Albatross analytics a hands-on into practice: statistical and data science application

Albatross Analytics is a statistical and data science data processing platform that researchers can use in disciplines of various fields. Albatross Analytics makes it easy to implement fundamental analysis for various regressions with random model effects, including Hierarchical Generalized Linear Models (HGLMs), Double Hierarchical Generalized Linear Models (DHGLMs), Multivariate Double Hierarchical Generalized Linear Models (MDHGLMs), Survival Analysis, Frailty Models, Support Vector Machines (SVMs), and Hierarchical Likelihood Structural Equation Models (HSEMs). We provide 94 types of dataset examples.

A library is a collection of commands or functions that can perform specific analyses. For instance, this implements double hierarchical generalized linear models in which the mean, dispersion parameters for the variance of random effects, and residual variance (overdispersion) can be further modeled as a random-effect model may use DHGLM [4], MDHGLM by Lee [5][6][7]. This package allows various models for multivariate response variables where each response is assumed to follow double hierarchical generalized linear models. See also further HGLM applications for machine learning [4], schizophrenic behavior data [8], variable selection methods [9], non-Gaussian factor [10], factor analysis for ordinal data [11], survival analysis [12], longitudinal outcomes and time-to-event data [13], and recent advanced topics [14][15][16][17].
The FRAILTYHL package fits semi-parametric frailty and competing risk models using the h-likelihood. This package allows lognormal or gamma frailties for randomeffect distribution, and it fits shared or multilevel frailty models for correlated survival data. Functions are provided to format and summarize the FRAILTYHL results [18]. The estimates of fixed effects and frailty parameters and their standard errors are calculated. We illustrate the use of our package with two well-known data sets and compare our results with various alternative R-procedures. Refers to the application of semi-competing risks data [19], and clustered survival data [20,21]. This paper addresses and explains what Albatross Analytics is and include how to use it in statistical and data science application. The advantage of Albatross Analytics is the user can analyze and interpret the data easily. Meanwhile, Fig. 1 shows the feature of Albatross Analytics, including fundamental analysis, random effect, regression, survival analysis, and multiple response analysis. This paper aims to express the application of Albatross Analytics software to handle statistical analysis in broad areas. Long story short, we provide illustrative examples. A

Data management
In today's world, data is the driving factor behind all establishments. As institutions keep collecting so much data, there is a need to handle the quality of the data becoming more notable by the day. Data Quality Management is the set of measures applied by a technical team or a database management system to enable good new knowledge [22][23][24]. The above collection of techniques is decided to carry out during the data management pathway, from data capture to execution, dissemination, and interpretation [24][25][26]. In line with this, the data management is the process of processing, managing, and maintaining data quality [27,28]. Effective data management can increase the efficiency of research work [26,29]. Figure 2a describes the main features available in Albatross Analytics. In the import data section, users can maximize this feature to upload data to be processed where the possible files are in excel and txt formats, respectively. For instance, Fig. 2b explains how to make a new variable feature, merge the dataset, and add new variables.
Each expression or variable has a data type such as numeric, integer, complex, logical, and character. The data types in Albatross analytics are expressed in class. A class is a combination of data types and the operations performed against the dataset type. The Albatross analytics look at the data as objects having attributes or properties. Data properties are defined by data type.

Basic analysis and GLMs
Descriptive statistics have been used to identify the specific characteristics of the data in the interpretation. We provide simple details of the findings and the procedures followed in Fig. 3. Alongside primary frequency distribution, we form the basis of almost all quantitative analyses of the results. Descriptive statistics shall be used to present practical explanations understandably. Descriptive statistics allow one to interpret enormous amounts of data in a structured way.
The t-test can be used to compare the means of two groups of data with the type of interval scale variable. Sometimes We will come across a study that aims to compare the mean of a sample with the mean of the entire population. Research models like this are rare, but the researcher can still provide valuable assumptions. We can do two kinds of tests, including z-test and a t-test. The condition we need to pay attention to is the population's standard deviation. If we know the standard deviation, we get it using the z-test. This will be found very rarely or never. Therefore, the most frequently used test is the t-test because we do not need to know the standard deviation of the population we study.
Furthermore, the use of the t-test on two samples is divided into two types based on the characteristics of the two samples. The first is the t-test on two independent samples. This means that the two samples to be studied came from two different groups and were given further treatment.
During the research, the use of analysis of variance is fundamental. One of the assumptions that must be met is that the population variances are the same, so we need to test the hypothesis. The purpose of the analysis of variance (ANOVA) is to determine the similarity of several population means. One-way ANOVA may be used if only one factor is involved. Two types of tests can be used in ANOVA testing, including formal and visual tests.
Meanwhile, the statistical test can be conducted by model checking plot. If the plot does not form a specific pattern, it is said that the homogeneity of the variance is fulfilled. We know the characteristics of each variable using descriptive analysis. In addition, we may see the relationship between variables, either normal or non-normal data [30]. With this correlation test, we want to know the similarity of the trends of the two variables. When the value of variable increases, it will also be accompanied by an increase or decrease in the value of other variables [31].
One main factor determines the test method used, namely the distribution of the data to be tested. We can use the parametric correlation test if the data distribution is normal, including Pearson's correlation coefficient. Besides, if the data distribution is not normal, we can use Kendall's rank correlation and Spearman's rank correlation, which are nonparametric correlation tests.
Regression analysis tests the causal relationship between variables-one variable as the independent variable and one other variable as the dependent variable. Numerous regression approaches, including Poisson regression, were used during the 1970s. Linear regression and logistic regression require a unique estimation algorithm by maximizing the likelihood. Figure 4 explains that Albatross Analytics provides features for using the Linear model, GLM Logit Model, GLM Probit Model, Log-linear Model, and joint GLM.
GLM describes a family of models where the response comes from the exponential family of distributions. The method used to t-test or F-test and inferences of these models is maximum likelihood (ML). In the GLM family of models, an IWLS algorithm can compute the ML estimates and their standard errors. Hence, the computational machinery developed for least-squares estimation for linear models can fit GLMs, but the statistical method is based on ML.

Hierarchical generalized linear models (HGLMs)
Albatross Analytics' distinct advantage is its unified analysis of random effect models. Various random effect models can be represented as HGLMs and estimated by h-likelihood procedures [32]. HGLMs are defined as follows: (1) Conditional on random effects u , the responses y follows a GLM family, satisfying for which the kernel of the log-likelihood is given by where θ = θ(µ) is the canonical parameter. The linear predictor takes the form in Eq. (1): where v = v(u) , for some monotone function v(·) and the link function g(µ).
(2) The random component u follows a (conjugate) distribution to a GLM family of distributions with parameter .
To infer the HGLM, Lee and Nelder [32] proposed using the h-likelihood. The h (log-) likelihood is defined as Eq. 2: The GLM attributes of an HGLM are summarized in Fig. 4. In Bissell's fabric study, the response variable y is the number of faults in a bolt of the fabric of length l . Table 1 represents the results of the fabric study. Figure 6 illustrates the negative binomial model fitted via Poisson-gamma HGLM with saturated random effects for the complete response. In addition, the model checking plot is presented in Fig. 5.

Double hierarchical generalized linear models (DHGLMs)
HGLM can be extended by allowing additional random effects in their various components. Lee and Nelder [32] introduced a class of double HGLMs (DHGLMs) in which random effects can be specified in both the mean and the residual variances. Heteroscedasticity between clusters can be modeled by introducing random effects in the dispersion model as heterogeneity between clusters in the mean model. With DHGLMs, it is possible to have robust inference against outliers by allowing heavy-tailed distribution. Many models can be unified and extended further by the use of DHGLMs. These also include models in the finance area such as autoregressive conditional heteroscedasticity (ARCH) models, generalized ARCH (GARCH), and stochastic volatility (SV) models. Models can be further extended by introducing random effects in the variance terms. Suppose that conditional on the pair of random effects (a, u) , the response y satisfies.
The critical extension is to introduce random effects into the component φ: (1) Given u , the linear predictor for µ takes the HGLM form in Eq. 1 where g(·) is the link function, X and Z are model matrices, v = g M (u) for some monotone function, g M (u) are the random effects, and β are the fixed effects. Moreover, dispersion parameters for u have the GLM form in Eq. 3 where h M () is the link function, G M is the model matrix and γ M is fixed effects. (2) Given a , the linear predictor for φ takes the HGLM form as described in Eq. 4 where h() is the link function, G and F are model matrices, b = g D (a) for some monotone function, g D (a) are the random effects, and γ are the fixed effects. Moreover, dispersion parameters α for a have the GLM form, as shown in Eq. 5.
where h D (·) is the link function, G D is the model matrix and γ D is fixed effects. Here, the labels M and D stand for mean and dispersion, respectively. The GLM attributes of a DHGLM are summarized in Fig. 4.
However, We illustrate an example of how to fit the DHGLM. Hudak [33] presented crack growth data, listed in Lu [34]. Each of 21 metallic specimens was subjected to 120,000 loading cycles, with the crack lengths recorded every 10,000 cycles. Let l ij be the crack length of the i-th specimen at the j-th observation and y ij = l ij − l ij−1 be the corresponding increment of crack length (response variable) measured in inches, which always has a positive value. A detailed description of the model can be found in Table 2, and Fig. 5a and b represent the mean and the dispersion, respectively [5]. Compared to an HGLM, DHGLM gives model checking plots for mean and dispersion, respectively. E y|a, u = µ and var y|a, u = φV (µ).

Multivariate double hierarchical generalized linear models (MDHGLM's)
Using h-likelihood, multivariate models are directly extended by assuming correlations among random effects in DHGLMs for different responses. The use of h-likelihood indicates that interlinked GLM fitting methods for HGLMs can be easily extended to fit multivariate HGLMs (MDHGLMs). Moreover, the resulting algorithm is numerically efficient and gives statistically valid inferences. In this paper, we present the example for MDHGLM. For more details, see [35] Meanwhile, Price et al. [36] presented data from a study on the developmental toxicity of ethylene glycol (EG) in mice. Table 3 summarizes the data on malformation (binary response) and fetal weight (continuous response) and shows clear dose-related trends concerning both responses.
To fit the EG data, the following bivariate HGLM is considered:   Figure 6 shows the path diagram of the model for the EG data. The malformation model information is given in Table 4, with cAIC for the evaluation models. In line with this, we get the result for the weight model in Table 5 and correlation in Table 6.

Survival analysis
Albatross Analytics also provides features for survival analysis, which represent in Fig. 7 by including incomplete data caused by censoring in survival time (time-to-event) data including Kaplan-Meier Estimator, Cox Model, Frailty Model [7], and Competing Risk Model [19,37]. More instances of the Kaplan Meier curve describe the relationship between the estimated survival function at time t and the survival time. The vertical axis represents the estimated survival function, and the horizontal axis represents the survival time.
Cox proportional hazards (PH) regression is used to describe the relationship between the hazard function of survival time and independent variables which are considered to affect survival time. Cox regression is a common regression used in survival analysis because it does not assume a particular statistical distribution (e.g., baseline hazard) of the survival time.
Cox's PH model is widely used to analyze survival data. This method is helpful with its semi-parametric existence, whereby baseline hazards are non-parametric, and treatment effects are estimated parametrically. A partial likelihood has usually been used to accommodate such a semi-parametric form. However, it can also be fitted with Poisson GLM methods. Moreover, they are sluggishly led to many nuisance parameters induced by non-parametric measurement hazards. Meanwhile, using the h-likelihood theory, we can prove that Poisson HGLM methodologies could be used for such kinds of modeling techniques. That being said, this method is again sluggish since the number of nuisance parameters in non-parametric baseline hazards grows with the number of events.

Example 1 using incomplete data caused by censoring in survival data
In Fig. 7, we study the analysis of incomplete data caused by censoring survival data.
Cox's PH model is widely used to analyze survival data. Frailty models with a non-parametric baseline hazard extend the PH model by allowing random effects in hazards and have been widely adopted for the analysis of correlated or clustered survival data using h-likelihood theory; we can show that Poisson HGLM algorithms can be used to fit the frailty models [12,[38][39][40][41][42][43]. Data consist of right-censored observation from q subjects, with n i observations each ( i = 1, . . . , q) , n = i n i as the total sample size, T ij as survival time for the j-th observation of the i-th subject ( j = 1, . . . , n i ) , C ij as the corresponding censoring time, y ij = min T ij , C ij , δ ij = I(T ij ≤ C ij ) , and u i as observed frailty for the i-th subject. The conditional hazard function of T ij given u i is of the form in Eq. 6 Here 0 (·) is an unspecified baseline hazard function and β = β 1 , . . . , β p T is a vector of regression parameters for the fixed covariates x ij . Here, the term x T ij β does not include an intercept term because of identifiability. Then, we assume that the frailties u i are i.i.d random variables with a frailty parameter α . We often assume gamma or lognormal distribution for u i ; that is, it is gamma frailty with E(u i ) = 1 and var(u i ) = α and log-normal frailty with v i = logu i ∼ N (0, α) . Meanwhile, the multi-component frailty models can be expressed in Eq. 7, with the linear predictor X is n × p model matrix for β , andZ r is n × q r model matrices corresponding to the frailties v r . At the same time, v (r) and v (i) are independent for r = I. Also, Z r has indicator values such that Z (r) st = 1 if observation s is a member of the subject t in the r-th frailty component, and 0 otherwise.
To the illustration, below we present two examples. Example 1 considers the dataset of the recurrence of infections in kidney patients using a portable dialysis machine. The data consist of the first and second recurrences of kidney infection in 38 patients. The catheter is later removed if the condition occurs and can be removed for other reasons, which we regard as censoring (about 24%).
In Example 1, the variables consist of 38 patients (id), time until infection since the catheter insertion (time), and a censoring indicator (1, infection; 0, censoring) for status, age of the patient (age), sex (sex) of the patient (1, male; 2, female), disease types (disease) following GN, AN, PKD, other, and estimated frailty (frail). The survival times (1st and 2nd infection times) for the same patient are likely to be correlated because of shared frailty describing the common patient's effect. We thus fit log-normal frailty models with two covariates, sex, and age. Here, we consider the patient as frailty. Figure 8 presents the Kaplan-Meier plot for the estimated survival probability of the sex (sex1, male; sex2, female). This shows that the female group has overall higher survival (i.e., less infectious) probabilities than ones in the male group. Table 7 summarizes the estimated results of the log-normal frailty model. We show the estimated frailty in Fig. 9. For further discussions in survival analysis, see [18].

Example 2 placebo-controlled rIFN-g in the treatment of CGD
Example 2, in the following case examples, consists of a placebo-controlled rIFN-g in the treatment of CGD [44,45]. One hundred twenty-eight patients from 13 centres were tracked for around 1 year. The survival times are the recurrent infection times of each  patient. Censoring occurred at the last observation for all patients, except one, who experienced a severe infection on the date he left the study. About 63% of the data were censored. The recurrent infection times for a given patient are likely to be correlated. Also, each patient belongs to one of the 13 centres. The correlation may be attributed to the patient effect and centre effect. Meanwhile, the recurrent infection times of each patient or censoring time (tstart-tstop), 128 patients (id), 13 centers (center), rIFN-g or placebo (treat), censoring indicator (1, infection observed; 0, censored) for status, data of randomization (random) information about patients at study entry (sex, age, height, weight), the pattern of inheritance (inherit), use of steroids at study entry 1(yes), 0(no) (steroids), use of propylac antibiotics at study entry. 1(yes), 0(no) (propylac), categorization of the centers into four groups (hos.cat), and observation number within-subject (enum). We fit multilevel log-normal frailty with two frailties and a single covariate, treatment. Here, the two frailties are random center and patient terms, with their structures given in Eq. 8.

Fig. 9 Estimated frailty in the kidney infection data
Here v 1 is center frailty, and v 2 is patient frailty. For testing the need for a random component i.e.,(α 1 = 0 or α 2 = 0) we use the deviance − 2p β,v h p and fit the following four models.
M1 Cox's model without frailty (α 1 = 0 or  (1)) indicates the absence of the random center effects, and the deviance difference between M2 and M4 (10.71) shows the necessity of random patient effects. In addition, the deviance difference between M1 and M3 (14.49) presents the random patient effect with or without random center effects. All of the three criteria (cAIC, mAIC and rAIC) also choose M3 among the M1-M4. Figure 10 presents the estimated frailty effects of this study. The explanations of model evaluation toward these three criteria can be seen in the Appendix.

Support vector machine using H likelihood
Support Vector Machine (SVM) is a supervised learning method for classification and regression using non-linear boundaries by feature space [4,[46][47][48][49]. We present a Support Vector Machine (SVM) based on the HGLM method [4]. The match between the observed response and the model output is optimized. The output model is a feature or prognostic function also referred to as a utility function and more specifically in medical research it is called the prognostic index or health function, defined in Eq. 9: Here u : R d → R , w is a vector of unknown d parameters and ϕ(x) is the transformation of the covariates x. In non-linear SVM, the transformation function used is "Kernel Trick", see: [50][51][52] Kernel Trick calculates the scalar product in the form of a kernel function. The SVM model is implied with a constraint function that will get the right margin. The constraint function of the SVM model is shown in Eq. 10. If there is an error in ranking it is given by the slack variable ξ ij ≥ 0 . The formulation of the SVM model is described in Eq. 10: cantered depression, and the latent personwith a regularization parameter γ ≥ 0 . v ij is an indicator function of whether or not two subjects with observations i and j are comparable; it is 1 if i and j are comparable and 0 otherwise. In this paper, we use the dataset of the anatomy of an Abdominal Aortic Aneurysm (AAA), Aortic Anatomy on Endovascular Aneurysm Repair (EVAR), see [53]. The variables are described as follows: Y = Sex, X 1 = Age, X 2 = Aortic type Fusiform (1), Saccular (2), X 3 = Proximal neck length, X 4 = Proximal neck diameter, X 5 = Proximal neck angle, and X 6 = Max. Aneurysmal sac. We set the response variable towards simulation by following the Bernoulli distribution with 500 observations. In each scenario, This simulation shows that HGLM performs better with high sensitivity because some of the data used is a binary case that SVM cannot handle. For more information on step construction using hierarchical likelihood towards SVM. Table 9 represents that the use of Ensemble SVM reduces the accuracy and other measures. When the mixture patterns exist in the predictor, Ensemble SVM improves SVM performance in two scenarios. Ensemble SVM performed almost as well as logistic regression, except for sensitivity. There is a decrease in performance in the Ensemble SVM model in the multicollinearity condition and linear combination between the predictor variables. Meanwhile, HGLM still has a good performance, which is represented in Fig. 11a and

Using H-likelihood to structural equation model (HSEMs)
The is widely used in multidisciplinary fields [41]. To account for the information, [42,43] performs the style of frequentist model averaging in structural equation modeling, non-linear structural equation modeling towards ordinal data [44] and partial least square [45,46] and robust nonlinear with the interaction between exogenous and endogenous latent variables [47]. With an example we present a SEM method based on h-likelihood, called "hsem" [52].
In application, [48] uses two-level dynamic SEM on longitudinal data at Mplus. In this paper, we explicitly discuss how to use h-Likelihood in SEM. This data set consists of 50 repetitions on regular time scales for 100 individuals. For the response variable, the urge to smoke is on a standardized scale so that 0 corresponds to the average where the standard deviation is 1. Smokers can feel drastic mood changes. Starting from feeling happy then turning into sadness, this can show the characteristics of a person who is depressed. For those addicted, smoking can give a calm mind for a moment. The second model will answer the question; latent person predicts smoke, mean cantered depression, and the latent person-mean centered lag-1 urge to smoke. The model Eq. 11 is given as follows: Figure 12 represents the path diagram by using hsem. This same standard progression path across all respondents was defined through the fixed-effect model. In contrast, the person-specific random effects are used to catch the variance of each participant from the expected path. Meanwhile, the path diagram represents within-level and betweenlevel models. As more instance, we provide the R package hsem [54]. (11) urge ti = β 0i + β 1i Time ti + e ti , β 0i = γ 00 + u 0i β 1i = γ 10 + u 1i

Short review albatross analytics
This paper explains how Albatross software can be used for alternative multidisciplinary data processing. We offer model estimation, model checking plots, and visualization features to interpret information. Through data and R code, instances would further reveal the benefit of the HGLM model for particular statistical cases. The h-likelihood approach is distinct from both classical frequentist and Bayesian frameworks, while this encompasses inference of both fixed and random unknowns. The main benefit over classical frequentist approaches is that it would be possible to infer unobservable quantities, such as random effects, and therefore, observations could be rendered. Whenever a statistical model has been selected for the research, the likelihood contributes to the direction of inferential statistics. Throughout direct ties with the establishment of h-likelihood, the nomenclature has already been used in which a wide variety of likelihoods have now been established. Most are through theoretical computation of GLM and GLMM, e.g., quasi-likelihood and extended quasi-likelihood. Many others are used to show the linkage of conventional frequentist estimation and Bayesian inference by the following other terms such as joint likelihood, extended likelihood, and adjusted profile likelihood. We demonstrate whether h-likelihood was an essential likelihood that marginal and REML probabilities and statistical probabilities are extracted. The extended probability theory underlies the h-likelihood system and demonstrates how it holds from classical and Bayesian probability.
Generalizations on random effects are of great application in simulations. For example, a typical example is that there are frequent observations of hospital admissions by patients and that the life of these patients can be expected. This might include a survival experiment with unexpected results for patients, and the variance of the estimates indicates the variability of the random effect.
During the first few examples, we demonstrate experiments using normal, log-normal, gamma, Poisson, and binomial HGLMs. Binary models are used to compare with application areas, while the dhglm package is fast and yields consistent results. Descriptions using HGLMs, including organized dispersion, are given below. We also line up models including correlated random effects and structural equation models.
The likelihood implies that probability models will offer an effective way to interpret the data if the model is accurate. It is also necessary to validate the model to verify the interpretation of the results. That being said, it could be hard to ascertain all the model assumptions. During the simulation using h-likelihood, SEM's normal assumption in binary GLMMs can give serious biases if the normal assumption on random effects is incorrect.

H-likelihood theory for the frailty model
The h-likelihood gives a straightforward way of handling non-parametric baseline hazards. The h-likelihood under the frailty model is defined by: Here, The functional form of 0 (t) is unknown. Hence, we consider ∧ 0 (t) to be a step function with jumps at the observed event time/At the moment, y (k) is the k-th smallest distinct event time among the y ij 's, and 0k = 0 y (k) . Thus, we proposed the use of the profile h-likelihood with 0 eliminated, r * = h| l * 0 = ij logf * (y ij , δ ij |u i ; β) = ij logf (y ij , δ ij |u i ; β, 0 ),