Generalized Estimating Equations Boosting (GEEB) machine for correlated data

Wang, Yuan-Wey; Yang, Hsin-Chou; Chen, Yi-Hau; Guo, Chao-Yu

doi:10.1186/s40537-023-00875-5

Methodology
Open access
Published: 22 January 2024

Generalized Estimating Equations Boosting (GEEB) machine for correlated data

Yuan-Wey Wang¹,
Hsin-Chou Yang²,
Yi-Hau Chen² &
…
Chao-Yu Guo¹

Journal of Big Data volume 11, Article number: 20 (2024) Cite this article

4830 Accesses
3 Citations
Metrics details

Abstract

Rapid development in data science enables machine learning and artificial intelligence to be the most popular research tools across various disciplines. While numerous articles have shown decent predictive ability, little research has examined the impact of complex correlated data. We aim to develop a more accurate model under repeated measures or hierarchical data structures. Therefore, this study proposes a novel algorithm, the Generalized Estimating Equations Boosting (GEEB) machine, to integrate the gradient boosting technique into the benchmark statistical approach that deals with the correlated data, the generalized Estimating Equations (GEE). Unlike the previous gradient boosting utilizing all input features, we randomly select some input features when building the model to reduce predictive errors. The simulation study evaluates the predictive performance of the GEEB, GEE, eXtreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) across several hierarchical structures with different sample sizes. Results suggest that the new strategy GEEB outperforms the GEE and demonstrates superior predictive accuracy than the SVM and XGBoost in most situations. An application to a real-world dataset, the Forest Fire Data, also revealed that the GEEB reduced mean squared errors by 4.5% to 25% compared to GEE, XGBoost, and SVM. This research also provides a freely available R function that could implement the GEEB machine effortlessly for longitudinal or hierarchical data.

Introduction

Correlated data measure the dependent variable repeatedly across multiple dimensions, such as longitudinal, clustered, spatial, or multilevel data [17]. Correlated data frequently occur in medicine, public health, and other research fields, requiring specialized statistical approaches to handle the complex correlation structure, avoid potential estimation biases, and ensure the accuracy of estimations.

Generalized Estimating Equations, also known as GEE [6, 12, 13], is a statistical method initially proposed by Liang and Zeger [11]. It extends the framework of Generalized Linear Models (GLMs) and overcomes the assumption of independence among observations, making it particularly useful for handling correlated data. One of the strengths of GEE is that it only assumes a "working correlation matrix" to describe the correlation structure among observations. This characteristic reduces the need for such restricted distribution assumptions and keeps parameter estimation consistent even when the working correlation matrix is misspecified under mild regularity conditions [7]. The Mixed-effects model is another standard statistical method for correlated data [10, 15]. The difference between GEE and the mixed-effects model is that GEE estimates the population average effects, and the mixed-effects model estimates individual random effects.

The widespread adoption of modern technology and digitization has created vast amounts of data in recent decades. Coupled with the general availability of digital services and advancements in storage technologies, a massive accumulation of data has occurred. This phenomenon has driven the necessity for big data analysis, which has fueled the flourishing development of machine learning (ML). ML aims to construct computer models that could automatically optimize algorithms based on past experiences to predict future outcomes. Conceptually, one can think of ML as having numerous settings of candidate models and using a large amount of past experiential data to guide the computer in finding the model setting that optimizes performance indicators [9].

Currently, some popular supervised ML models include eXtreme Gradient Boosting (XGBoost) [3], Random Forest [8], and Support Vector Machine (SVM) [4]. As one of the fastest-growing technologies, data scientists applied ML in various fields such as finance, marketing, computer vision, aerospace, biomedicine, etc. Every discipline is increasingly utilizing ML for prediction and decision support. Over the past two decades, ML has made significant progress and achievements, from academic research to the most popular commercial applications [16].

Thus, in the era of big data, in addition to traditional statistical methods, ML has provided advanced choices for data analysis. Numerous studies have compared statistical methods to ML, with some articles indicating that ML outperforms statistical methods [2, 14, 19]. However, limited research discusses more complex data structures, such as correlated or hierarchical ones. Therefore, we propose a novel algorithm, the Generalized Estimating Equations Boosting (GEEB) machine, to integrate the gradient boosting technique from ML algorithms into the GEE. Under such hybrid algorithms, we aim to create a new ML model to deal with correlated data, avoid biased estimates, and provide a more accurate prediction.

Materials and methods

Generalized Estimating Equations Boosting Machine

GEE

The core of the new machine is the GEE. Here, we briefly introduce the fundamentals of the GEE. Assume that ${y}_{ij}, i=1 to k$ and $j=1 to {n}_{i}$ represent the jth response of the ith subject, which has a vector of covariates ${x}_{ij}$. There are ${n}_{i}$ measurements on subject i, and the maximum number of measurements per subject is T. Let the responses of the ith subject be ${y}_{i}=[{y}_{i1}, \ldots ,{y}_{ini}]{\prime}$ with corresponding means${\mu }_{i}=[{\mu }_{i1}, \ldots ,{\mu }_{ini}]{\prime}$.

The marginal mean ${\mu }_{ij}$ of the response ${y}_{ij}$ is related to a linear predictor through a link function ${g(\mu }_{ij})={x}_{ij}{\prime}\beta$, and the variance of ${y}_{ij}$ depends on the mean through a variance function ${\nu (\mu }_{ij})$ for generalized linear models (GLM).

Solving the generalized estimating equations, we could obtain the estimate of the parameter:

$${\text{S}}\left(\upbeta \right)=\sum_{i=1}^{k}\frac{\partial {\mu }_{i}{\prime}}{\partial \beta }{V}_{i}^{-1}\left({y}_{i}-{\mu }_{i}(\beta )\right)=0$$

where ${{\text{V}}}_{i}$ is the working covariance matrix of ${{\text{Y}}}_{i}$

We only require the mean and the covariance of ${{\text{Y}}}_{i}$ in the GEE method, we do not need the full specification of the joint distribution of the correlated responses. This feature of the GEE is desirable and leads to a convenient way of analysis since the joint distribution for noncontinuous outcome variables involves high-order associations and is complicated to specify. In addition, the regression parameter estimates are consistent even when the working covariance is incorrectly specified. However, the GEE approach can lead to biased estimates when missing responses depend on previous responses. The "Weighted Generalized Estimating Equations under the MAR Assumption" can provide an unbiased estimate.

Working correlation matrix

Suppose ${R}_{i}(\alpha )$ is an ${n}_{i}\times {n}_{i}$ "working" correlation matrix specified by the vector of parameters. The covariance matrix of ${{\text{Y}}}_{i}$ is modeled as:

$${V}_{i}=\varphi {A}_{i}^\frac{1}{2}{W}_{i}^{-\frac{1}{2}}R(\alpha ){W}_{i}^{-\frac{1}{2}}{A}_{i}^\frac{1}{2}$$

where ${A}_{i}$ is a diagonal matrix (${n}_{i}\times {n}_{i}$) whose jth diagonal element is ${\nu (\mu }_{ij})$ and ${W}_{i}$ is a diagonal matrix (${n}_{i}\times {n}_{i}$) whose jth diagonal is ${w}_{ij}$, where ${w}_{ij}$ is a variable indicating the weight. If not weighted, ${w}_{ij}=1$ for all i and j. If ${R}_{i}(\alpha )$ is the true correlation matrix of ${{\text{Y}}}_{i}$, then ${{\text{V}}}_{i}$ is the true covariance matrix of ${{\text{Y}}}_{i}$.

In practice, the working correlation matrix is usually unknown, which must be estimated in the iterative fitting process by using the current value of the parameter vector $\upbeta$ to compute appropriate functions of the Pearson residual: ${e}_{ij}=\frac{{y}_{ij}-{\mu }_{ij}}{\sqrt{v({\mu }_{ij})/{w}_{ij}}}$.

If the working correlation matrix is the identity matrix (I), the GEE reduces to the independence estimating equations. The table from SAS [20] demonstrates the working correlation structure [21].

GEEB

GEEB has a hybrid design with GEE and gradient boosting. The strength of the GEEB algorithm lies in its ability to handle complex relationships in correlated data while optimizing the algorithm further through the gradient-boosting technique.

Gradient Boosting is a prevailing ML algorithm that applies the idea of gradient descent to ensemble learners. In the gradient boosting framework, each iteration builds a new learner based on the prediction errors calculated by the learner in the previous iteration. When the iterations meet the stopping rule, all the predictions from the iterations are weighted and summed to obtain the final prediction. More specifically, because the loss function is the difference between the predicted and actual values, the goal of gradient boosting is to progressively move towards the minimum value of the loss function by minimizing the prediction errors in each iteration. Negative gradients of the loss function computed at the previous iteration and learning rate determine the direction and range of progressive movement in the current iteration. In other words, the algorithm minimizes the loss function and improves the overall prediction accuracy by updating the model based on the negative gradients.

The GEEB algorithm has four components: an initial setting and three computational steps. The initial stage defines the input dataset and the loss function. The input dataset contains n samples with some input features (${x}_{i}$) and a continuous output feature (${y}_{i}$), represented as${\left\{({x}_{i},{y}_{i})\right\}}_{i=1}^{n}$. When the dependent variable ($y$) is continuous, the algorithm defines the loss function as a modified version of the mean squared error:$L\left({y}_{i},F\left({x}_{i}\right)\right)= \frac{1}{2}{({y}_{i}-F\left({x}_{i}\right))}^{2}$. Here, $L\left( ,\right)$ represents the loss function, ${y}_{i}$ denotes the actual outcome of the ${i}^{th}$ data point, and $F\left({x}_{i}\right)$ represents the model's predicted outcome for the ${i}^{th}$ data point. Modifying the loss function is crucial as it facilitates more straightforward computation of gradients in subsequent steps.

After defining the input dataset and the loss function, the first step is to compute the initial prediction values of the model. The initial prediction value is a constant number chosen to start the iterations at the most efficient point. In this case, the initial prediction value, denoted as ${F}_{0}\left(x\right)$, is defined as ${F}_{0}\left(x\right)=arg\underset{F({x}_{i})}{{\text{min}}}{\sum }_{i=1}^{n}L({y}_{i},F({x}_{i}))$. Thus, when the loss function is defined as $L\left({y}_{i},F\left({x}_{i}\right)\right)= \frac{1}{2}{({y}_{i}-F\left({x}_{i}\right))}^{2}$, the initial prediction value is set to the mean of the feature values $\left({F}_{0}\left(x\right)=\frac{{\sum }_{i=1}^{n}{y}_{i}}{n}\right)$.

The second step of the algorithm involves the iterative model updates for M times. Each iteration consists of five parts: (A), (B), (C), (D), and (E). The iterations continue until the ${M}^{th}$ iteration converges. Unlike the conventional gradient boosting machines that incorporate all input features, the GEEB randomly selects some input features when building the model to reduce predictive errors. In part (A), a subset of the data is created from the dataset by randomly selecting some features to generate the model. This subset is denoted as${\left\{\left({x}_{i}{\prime},{y}_{i}\right)\right\}}_{i=1}^{n}$, where ${x}_{i}{\prime}$ represents the selected features and ${y}_{i}$ represents the target feature. Part (B) involves calculating the residuals between the true and predicted outcomes of the subset (A). The residuals are computed as${r}_{i,m}=-{\left[\frac{\partial }{\partial F\left({x}_{i}{\prime}\right)}L\left({y}_{i},F\left({x}_{i}{\prime}\right)\right)\right]}_{F\left({x}_{i}{\prime}\right)={F}_{m-1}\left({x}_{i}{\prime}\right)}={y}_{i}-F\left({x}_{i}{\prime}\right) , i=1\dots n$. In part (C), the residuals calculated in (B) are used to fit the generalized estimating equations, obtaining the coefficients for this iteration. Part (D) uses the coefficients obtained in (C) to predict the residuals for the entire dataset in this iteration. Finally, in part (E), the progress of this iteration's prediction for the model is updated. The predicted values of the residuals for this iteration, denoted as${p}_{i,m}$, are multiplied by the learning rate ($\nu$) and added to the previous overall predicted value${F}_{m-1}\left({x}_{i}\right)$. The resulting computation represents the prediction for this iteration,${F}_{m}\left({x}_{i}\right)$.

After M iterations, the third step is to output the overall prediction results of the model, denoted as ${F}_{M}\left({x}_{i}\right)$.

The following presents the GEEB algorithm.

Materials

This chapter describes the data source, the detailed background of simulation studies, and the application to a real-world dataset, the Forest Fire Data.

Data source

The Institute of Digital Research and Education (IDRE) at the University of California, Los Angeles, published the HDP simulated data in July 2012 [1]. The HDP data is based on a large-scale lung cancer-related study. The correlations exist in its hierarchical structure, which consists of three nested levels: Doctors are nested within hospitals, and patients are nested within doctors. Researchers could adjust the number of hospitals, doctors, and patients according to their research requirements.

The simulated data includes nine different outcomes. For this research, we select tumor size, which follows a Gaussian distribution, as the target output feature. The patient-related features include age (Age), marital status (Married), family history (FamilyHx), smoking history (SmokingHX), sex (Sex), cancer stage (CancerStage), length of stay in hospital (LengthofStay), white blood cell count (WBC), red blood cell count (RBC), body mass index (BMI), interleukin-6 (IL6), and C-reactive protein (CRP). At the doctor level, there is doctor ID (DID), the experience of the doctor (Experience), the quality of the school doctors trained (School), and the number of lawsuits (Lawsuits). Note that the variable "School" is divided into two categories (top vs. average). Due to the highly imbalanced distribution, the "school" variable may have only one group that introduces errors in estimating the GEE function with the R package. Therefore, we did not include the "school" variable in simulation studies. The hospital-related features include hospital ID (HID) and Medicaid at the given hospital (Medicaid). Consequently, there are 17 predictors in the simulation study. Note that not all 17 features are related to the target response, and these features are noises to the predictive models.

Real-world data

Cortez and Morais [5] published the Forest Fire Data. This dataset covers the period from January 2000 to December 2003 and includes records of forest fires in the Montesinho Natural Park in northeastern Portugal. Multiple institutions collected the data and encompassed numerous variables, such as the Fire Weather Index (FWI) [22], spatial, temporal, and weather-related information.

We generated a new feature for "season" to construct the third-level hierarchical structure. In this way, the day is nested within the month, and the month is nested within season. Note that "season" was derived from the "month" variable. The four seasons are (1) Spring, from March to May; (2) Summer, from June to August; (3) Autumn, from September to November; and (4) Winter, from December to February. As a result, there are 14 variables (refer to Table 1), and the sample size is 517.

Table 1 The preprocessed Forest Fire Data attributes

Full size table

Experiment

We examine the consistency and accuracy of (1) the new machine GEEB, (2) the statistical method GEE, (3) the SVM, and (4) XGBoost under different hierarchical structures and sample sizes. Regarding the hyperparameter of SVM, the kernel is a radial basis function (RBF). Hyperparameters of XGB are objective = 'reg:squarederror', nrounds = 50, and verbose = 0. Other settings of hyperparameters yielded similar results. When developing simulation studies, we tuned the SVM and XGB with different parameter settings, such as the max_depth and learning_rate for the XGB. The results could be better or worse. In each scenario, there are 1000 repetitions. Each repetition could find its best parameter setting, but the comparisons are similar. Therefore, we used the most common settings for SVM and XGB.

Simulation studies

The core concept of the GEEB machine involves the random selection of features. Note that validation sets in the training data could find the optimal proportion. However, we randomly select eight proportions (30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%) of features in GEEB denoted as Model 1 to Model 8. The new approach is more promising if the GEEB outperforms other methods without an optimal proportion obtained by tenfold cross-validation. Table 2 presents detailed model settings.

Table 2 Settings of Model 1–Model 8

Full size table

Next, we examine the impact of sample size on the predictions. Therefore, three sample sizes were defined: (A) small, (B) medium, and (C) large. Due to the random variation $\pm 1$ set for the number of doctors and patients, the sample size is approximately estimated as the mean value. The three estimated sizes with the minimum and maximum in the brackets are (A) small sample: 200 [72, 405], (B) medium sample: 500 [200, 720], (C) large sample: 1000 [650, 1500]. Because the results are consistent from 72 to 1,500 patients, we did not increase the sample size after 1,500. Since a small sample size introduces more statistical issues than big data, the simulation study suggests that a minimum of 72 subjects is sufficient to implement the GEEB.

Additionally, we explore the impact of different hierarchical data structures and consider five scenarios in three different sample sizes: (1) a structure with a small number of hospitals, followed by some doctors, and then more patients in a ratio of 1:3:5. (2) A structure with a small number of hospitals, followed by more doctors and then a more significant number of patients, in a more disparate ratio of 1:5:9. (3) An equal number of hospitals, doctors, and patients in a balanced ratio 1:1:1. (4) A structure with many hospitals, followed by some doctors, and then a small number of patients in a ratio of 5:3:1. (5) A structure with even more hospitals, followed by some doctors and a few patients, with a more extreme ratio of 9:5:1. Tables 3, 4, 5 show a detailed summary of the hierarchical structures.

Table 3 Parameter settings of a small sample size

Full size table

Table 4 Parameter settings of a medium sample size

Full size table

Table 5 Parameter settings of a large sample size

Full size table

Finally, the research framework diagram in Fig. 1 indicates the study flow. In the beginning, the input datasets undergo data preprocessing, and then the dataset is split into an 80% training set to build the models. The remaining 20% of the data is the testing set that evaluates the predictive performance. Lastly, we record the predictive performance in every scenario.

Regarding other hyperparameters, the learning rate for the GEEB model is set to 0.1. The number of iterations for the GEEB model is set to 100 since the convergence takes many iterations. Lastly, the simulation of the HDP dataset repeats 1000 times for each parameter setting.

The correlation matrix varies in every repetition in each scenario. Take scenario A1, for example, 11–13 patients are nested within 6 to 8 doctors, who are nested within two hospitals. At the doctor level, the dimension of the correlation matrix could be $11\times 11$, $12\times 12$, or $13\times 13$. At the hospital level, the size is $6\times 6$, $7\times 7$, or $8\times 8$. Because there are 1000 repetitions, A1 yields 6000 correlation matrixes. The simulation study generates 6000*15 = 9000 correlation matrixes (15 Scenarios: A1-A5, B1-B5, C1-C5). In the Additional file 1: Table S1, we display one correlation matrix of scenario A1 at the doctors level. Additional file 1: Table S2 shows an example data in scenario C1.

Application to a real-world data

In the Forest Fire Data, the preprocessing step standardized all input features. We split the data into 80% training and 20% testing to evaluate the performance. Due to the stochastic nature of data splitting and feature selection, we will repeat the analysis one or 100 times. Tables 13, 14 and Fig. 2 reveal the analysis results and Feature Importance.

Evaluation metric

The Mean Square Error (MSE) measures the performance since the output feature is Gaussian. The MSE is defined as:

$$MSE=\frac{1}{n}{\sum }_{i=1}^{n}{({y}_{i}-F\left({x}_{i}\right))}^{2}$$

The formula represents the expected value of the squared difference between the true values (${y}_{i}$) and the predicted values ($F\left({x}_{i}\right)$) for each subject in a dataset of size n. Therefore, a smaller MSE indicates that the model's overall predicted results are closer to the actual values, meaning better performance, and vice versa.

Results

Computer simulations and applications to the Forest Fire Data are implemented by R version 4.2.1 R Core Team [18]. R: A language and environment for statistical computing. R Foundation for Statistical Computing). A computer with 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50 GHz and 16 GB of RAM in a 64-bit platform is used to implement all the experiments.

Simulation results

The new model GEEB performed well according to simulation results. The GEEB, with the selection of all features, is consistently superior to the benchmark GEE in Tables 6, 7, 8. With a suitable random feature selection proportion, the GEEB has a further improvement.

Table 6 Simulation results of GEEB and GEE with a small sample size

Full size table

Table 7 Simulation results of GEEB and GEE with a medium sample size

Full size table

Table 8 Simulation results of GEEB and GEE with a large sample size

Full size table

We discovered that three factors, the proportion of random feature selection, sample size, and hierarchical structure, impact model performance. The GEEB and GEE perform better in A1–A2, B1–B2, and C1–C2 in Tables 6, 7, 8. These scenarios have fewer hospitals, a moderate number of doctors, and more patients.

Models within the same hierarchical structure perform better as the dataset increases (B1 vs. A1, C1 vs. B1, …, etc.). Although the optimal feature selection proportion varies with different sample sizes and hierarchical structures, we suggest that Model 3 (with a 50% random feature selection) demonstrates consistent and satisfying predictive results (Tables 6, 7, 8). Therefore, without tenfold cross-validation searching for the optimal ratio, we adopted Model 3 in Tables 9, 10, 11, 12 for the GEEB. The GEEB with Model 3 shows a superior MSE than the SVM and XGBoost (Tables 9, 10, 11, 12).

Table 9 Simulation results of GEEB with Model 3, GEE, XGBoost, and SVM concerning the hierarchical structure

Full size table

Table 10 Simulation results of GEEB with Model 3, GEE, SVM, and XGBoost with a small sample size

Full size table

Table 11 Simulation results of GEEB with Model 3, GEE, SVM, and XGBoost with a medium sample size

Full size table

Table 12 Simulation results of GEEB with Model 3, GEE, SVM, and XGBoost with a large sample size

Full size table

In addition to the main results mentioned above, the subsequent sections have details in three aspects. First, we explore the impact of different random feature selection proportions in the GEEB. Secondly, we examine the influence of sample size. Lastly, we see how the predictive ability differs among the GEEB, GEE, XGBoost, and SVM.

The proportion of random feature selection

In the small sample scenarios (A1 to A5), Table 6 shows that the GEEB consistently outperforms the GEE even when all features are included (Model 8). Therefore, the boosting technique improved the accuracy compared to the conventional statistical approach. The optimal model is identified as Model 1, with a random feature selection proportion of 30% ($\begin{aligned}&{{\text{MSE}}}_{M1, A1}=99.01835516,\mathrm{ MSE}_{M1, A2}=95.70521460,\\&\quad {{\text{MSE}}}_{M1, A3}=101.94608001,\mathrm{ MSE}_{M1, A4}=107.97686232, {{\text{MSE}}}_{M1, A5}=102.39713434\end{aligned}$).

Moving on to the medium sample scenarios (B1 to B5) in Table 7, Model 2 (40%) and Model 3 (50%) exhibit the most favorable results of GEEB ($\begin{aligned}&{{\text{MSE}}}_{M2, B1}=94.59880891{{,\mathrm{ MSE}}_{M2, B2}=93.35856425,\mathrm{ MSE}}_{M2, B3}=96.29459362, \\& \quad {{\text{MSE}}}_{M3, B4}=98.90918803, {{\text{MSE}}}_{M2, B5}=98.26241921\end{aligned}$). In Table 8, large sample scenarios (C1 to C5) indicate that the optimal models of GEEB are Model 3 (50%) and Model 4 (60%), ($\begin{aligned}&{{\text{MSE}}}_{M4, C1}=93.71736127{,\mathrm{ MSE}}_{M3, C2}=93.48196280{,\mathrm{ MSE}}_{M3, C3}=94.90869049 ,\\ & \quad {{\text{MSE}}}_{M3, C4}=96.02364853, {{\text{MSE}}}_{M4, C5}=95.51834122\end{aligned}$). These results demonstrate the impact of the proportion of random feature selection on the GEEB across various sample sizes.

In Tables 6, 7, 8, Model 8 (GEEB with all features) outperforms GEE even without random feature selection, showing better predictive results than the GEE model. Furthermore, in the small sample scenarios (Table 6), the MSE in Models 1 to 7 is smaller than the numbers in Model 8. The comparisons demonstrate the improved GEEB through random feature selection. Depending on the sample size, the optimal random feature selection proportion falls between 30 to 60%. Table 7 (B2–B4) and 8 (C1-C5) show a curved pattern when the percentage is too small. Models 1 or 2 may encounter a higher MSE compared to Model 8 of the GEEB. For example, in B2, the best scenario is Model 2 (40% features, ${{\text{MSE}}}_{M2, B2}=93.35856425$), whereas a lower selection proportion Model 1 (30% features, ${{\text{MSE}}}_{M1, B2}=93.53285239$) performs worse than Model 2 and even underperforms the GEE (${{\text{MSE}}}_{GEE, B2}=93.43940338$). Similarly, in C1, the best situation is Model 4 (60% features, ${{\text{MSE}}}_{M4, C1}=93.71736127$), whereas a lower selection proportion Model 1 (30% features, ${{\text{MSE}}}_{M1, C1}=94.02394485$) performs worse than Model 4 and even underperforms the GEE (${{\text{MSE}}}_{GEE, C1}=93.72922513$). We think that the information content of 30% or 40% of the features is insufficient to provide accurate predictions.

In conclusion, although each sample size has its optimal range of random feature selection proportions, we recommend the following: (1) random feature selection is a hyperparameter for the proposed GEEB machine. The optimal selection proportion falls between 30 to 60% through simulation studies. Hence, one can find the optimal hyperparameter through techniques such as a validation set or k-fold cross-validation. (2) A 50% selection proportion consistently demonstrates stable and excellent performance across all scenarios. In cases where it is impossible to employ any validation technique, this research suggests setting the random feature selection proportion to 50%. Thus, the GEEB function in the R language has a default random feature selection proportion set to 50%.

The following two sections (Tables 9, 10, 11, 12) will explore the impact of sample size and hierarchical structure using the GEEB with Model 3.

Sample size

In Table 9, all models, including the GEEB, GEE, SVM, and XGBoost, perform better as the sample size increases. The GEEB with Model 3 in the small sample A1 has ${{\text{MSE}}}_{M3, A1}=100.08083977$. In the medium sample scenario B1, it is ${{\text{MSE}}}_{M3, B1}=94.62586832$. In the large sample scenario C1, it is ${{\text{MSE}}}_{M3,C1}=93.71743381$. The pattern indicates an improvement in reduced errors as the sample size increases. For GEE, the MSE in A1 is ${{\text{MSE}}}_{GEE, A1}=100.53386987$, in B1, it is ${{\text{MSE}}}_{GEE, B1}=94.71798990$, and in C1, it is ${{\text{MSE}}}_{GEE, C1}=93.72922513$. For SVM, in A1, it has ${{\text{MSE}}}_{SVM, A1}=112.12767754$, in B1, it has ${{\text{MSE}}}_{SVM, B1}=97.14783528$, and in C1, it has ${{\text{MSE}}}_{SVM, C1}=90.40457748$. For XGBoost, in A1, it has ${{\text{MSE}}}_{XGBoost A1}=112.45269936$, in B1, it has ${{\text{MSE}}}_{XGBoost, B1}=97.25146238$, and in C1, it has ${{\text{MSE}}}_{XGBoost, C1}=89.93082567$. In summary, under the same hierarchical structure, increasing the sample size leads to a decreasing trend in MSE, indicating improved predictive ability. Besides, SVM and XGBoost are more sensitive to sample size variations than the GEEB and GEE.

Hierarchical structure

Five types of hierarchical structures show the impact on the MSE. In Tables 10, 11, 12, there is a significant contrast between the first (ratio = 1:3:5) and fourth (ratio = 5:3:1), as well as the second (ratio = 1:5:9) and fifth (ratio = 9:5:1) hierarchical structures across all scenarios with the same dataset size. All models perform better in the first and second hierarchical structures, where the setting presents fewer hospitals, a moderate number of doctors, and the highest number of patients. In contrast, they do not perform well in the fourth and fifth hierarchical structures, when the data have more hospitals and minimal patients. Regarding the third hierarchical structure, where the data involves an equal number of hospitals, doctors, and patients, the predictive capability lies between all models.

Note that the second and fifth hierarchical structures demonstrate more extreme ratios, implying a more significant disparity in the size of hospitals, doctors, and patients. These situations investigate whether the models would exhibit more extreme MSEs. However, only in the small sample scenario A could we observe relatively notable differences. We did not see significant differences in the medium and large sample scenarios. The XGBoost and SVM are more sensitive hierarchical structures than the GEEB and GEE.

In the small sample size under the first and fourth structure, the GEEB with Model 3 yields ${{\text{MSE}}}_{M3, A1}=100.08083977$ and ${{\text{MSE}}}_{M3, A4}=109.33152319$; in the medium sample, it shows ${{\text{MSE}}}_{M3, B1}=94.62586832$ and ${{\text{MSE}}}_{M3, B4}=98.90918803$; in the large sample, it achieves ${{\text{MSE}}}_{M3, C1}=93.71743381$ and ${{\text{MSE}}}_{M3, C4}=96.02364853$. Similarly, GEE demonstrates similar behavior ($\begin{aligned}{{\text{MSE}}}_{GEE, A1}=100.53386987, {{\text{MSE}}}_{GEE, A4}=109.78894802, {{\text{MSE}}}_{GEE, B1}=94.71798990, \\{{\text{MSE}}}_{GEE, B4}=98.95471693 , {{\text{MSE}}}_{GEE, C1}=93.72922513, {{\text{MSE}}}_{GEE, C4}=96.04458325\end{aligned}$). However, the SVM and XGBoost show more differences with changes in hierarchical structures. For instance, in the first and fourth scenarios, the SVM yields ${{\text{MSE}}}_{SVM, A1}=112.12767754$ and ${{\text{MSE}}}_{SVM, A4}=133.14030348$ in the small sample, ${{\text{MSE}}}_{SVM, B1}=97.14783528$ and ${{\text{MSE}}}_{SVM, B4}=121.15671344$ in the medium sample, and ${{\text{MSE}}}_{SVM, C1}=90.40457748$ and ${{\text{MSE}}}_{SVM C4}=114.01629456$ in the large sample. The trend is similar for XGBoost ($\begin{aligned}{{\text{MSE}}}_{{\text{XGBoost}}, A1}=112.45269936, {{\text{MSE}}}_{{\text{XGBoost}}, A4}=128.82575872, {\mathrm{ MSE}}_{{\text{XGBoost}}, B1}=97.25146238,\\ {{\text{MSE}}}_{{\text{XGBoost}}, B4}=110.17394078, {{\text{MSE}}}_{{\text{XGBoost}}, C1}=89.93082567, {{\text{MSE}}}_{{\text{XGBoost}}, C4}=99.61407890\end{aligned}$). Thus, we observe significant differences between the first and fourth scenarios when using the SVM and XGBoost. It may be because the two ML models are not well-suited to handle hierarchical data, leading to their inferior performance in the fourth hierarchical structure, which has relatively enormous clusters.

Furthermore, according to the increasing number of clusters, we observed that SVM and XGBoost show a clear inverse relationship in both medium and large samples. Besides, as the number of clusters increases, the predictive performance of SVM and XGBoost decreases, indicating worse performance with more clusters and more pronounced inter-cluster correlations. In contrast, the GEEB and GEE demonstrate consistent and satisfying predictions.

The SVM and XGBoost can outperform the GEEB in scenarios C1 and C2 for larger sample sizes (Table 12). The reason may be with fewer clusters and relatively large data sizes, SVM and XGBoost can overlook the inter-cluster correlation structure and treat the data as independent.

Results of the Forest Fire Data

According to the simulation study, the GEEB model with 50% random feature selection demonstrates consistent and improved predictive performance. Therefore, we adopt the GEEB with Model 3 as the default model in real-world data analysis.

The Forest Fire Data analyses for each method are shown in Table 13, indicating that the GEEB with Model 3 exhibits the minimum MSE compared to the GEE, SVM, and XGBoost. The MSE of GEE is approximately 4.5% higher than the GEEB. The XGBoost is about 25.2% higher than the GEEB. Therefore, the GEEB has a decent improvement compared to the most famous statistical model for correlated data and the most promising ML approaches, SVM and XGBoost. Feature Importance of the GEEB with Model 3 is in Table 14, and the visualization is in Fig. 2.

Table 13 Applications to the Forest Fire Data

Full size table

Table 14 Feature importance of the Forest Fire Data (single run/averaged 100 runs)

Full size table

Discussions

In this study, we propose a new ML strategy named the Generalized Estimating Equations Boosting (GEEB) machine. This method integrates the gradient boosting technique with the gold standard model for correlated data, the GEE. Computer simulations confirmed that the GEEB outperforms the GEE. In most situations, GEEB performs better than the famous SVM and XGBoost. Besides, the GEEB demonstrates the best prediction for the Forest Fire Data. Therefore, our findings suggest: (1) the gradient boosting technique enables the GEEB to outperform the GEE model. (2) Although the XGBoost and SVM are known for their excellent predictive ability, they may not perform well with hierarchical data. Treating subjects as independent failed to capture the correlation structure.

This research also provides the code that computes all research results. The geebm() is an R function that implements the GEEB machine. This function has seven arguments: formula, id, iteration, feature_rate, lrate, standardize, and data. Note that formula must be specified in the format "response ~ predictors" to list the predictors (input features) and response variable (output feature) in the dataset. id is a vector that identifies the clusters and can support multiple levels arranged in the order of multilayer structure. iteration is an integer representing the number of iterations, set to default at 100 iterations. feature_rate represents the proportion of random feature selection. When set to 1, it uses all features; by default, it is set to 0.5, using half of the features. lrate is a hyperparameter for the learning rate, with a default value of 0.1. standardize determines whether features are standardized, and the default does not perform standardization. data is used to input the training dataset. For example, when training the model with the Forest Fire Data in this study, the function would be: geebm(area ~ X + Y + FFMC + DMC + DC + ISI + temp + RH + wind + rain + day, id = c("season","month"), iteration = 100, feature_rate = 0.5, lrate = 0.1, standardize = T, data = Dataset).

The GEEB is also inspired by the Random Forest that incorporates Bootstrap while randomly selecting features. However, the results were not satisfying. Our research aims to compare the GEEB with other benchmark ML and statistical models in correlated data. When deciding which ML models to include, we primarily considered models that are widely discussed and used in academia and industry and frequently win in various data science competitions. Therefore, we included the XGBoost, SVM, and Random Forest. However, when conducting the simulation studies, we discovered that the randomForest package in R cannot handle datasets with more than 53 categories. Since each doctor within each hospital is treated as a separate category and there are other categorical features such as gender and cancer stage, this number exceeds the limitation. Therefore, we must exclude Random Forest in the comparison as it does not apply to the hierarchical dataset.

Integrating the concept of gradient boosting and using the statistical model GEE as the base learner, combined with a random feature selection, the proposed novel approach GEEB has several advantages. Compared to the ML model XGBoost, which also utilizes Gradient Boosting, GEEB performs better in most scenarios. GEEB can handle such data more effectively, resulting in improved predictive performance. Furthermore, compared to using the GEE model alone, after conducting 1000 simulations, we observed that GEEB achieves more accurate predictions.

Limitations

There are some limitations in this study. Firstly, the simulated data used in this study is based on the publicly available HDP dataset from UCLA, and the investigation of the impact of the level of variable correlations, such as weak to high correlations, has not been further explored.

Secondly, in this study, we investigate the predictions of tumor size. The underlining techniques of GEEB are GEE and gradient boosting, both of which support classification tasks. However, this research focused on regression tasks only. The performance of the GEEB under other types of output features is unknown.

Here, we only roughly categorized the structures into three types: (1) fewer hospitals, followed by some doctors and more patients; (2) more hospitals, followed by some doctors and fewer patients; and (3) an equal number of hospitals, doctors, and patients. The study also considered varying sample sizes, including more extreme cases. Consequently, we examined five hierarchical structures: 1:3:5, 1:5:9, 1:1:1, 5:3:1, and 9:5:1. While this design provides initial insights, we could explore more detailed hierarchical structures in future works.

Future research topics

The corresponding theoretical work and simulation studies are great topics for future research with dichotomous, ordinal, or categorical nominal correlated datasets.

Availability of data and materials

Not applicable. It is a computer simulation study.

Code availability

R codes are included as Additional file 2 for publication.

References

Bruin J (2012). R advanced: simulating the hospital doctor patient dataset. https://stats.idre.ucla.edu/r/codefragments/mesimulation/. Accessed 27 Oct 2022.
Caruana, R. and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. Proceedings of the 23rd international conference on Machine learning. 2006.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97.
Article Google Scholar
Cortez P, Morais AJR. A data mining approach to predict forest fires using meteorological data. 2007.
Diggle P, Diggle PJ, Heagerty P, Liang K-Y, Zeger S. Analysis of longitudinal data. Oxford: Oxford University Press; 2002.
Book Google Scholar
Hardin JW, Hilbe JM. Generalized estimating equations. Boca Raton: CRC Press; 2012.
Book Google Scholar
Ho TK. Random decision forests. Proceedings of 3rd international conference on document analysis and recognition, IEEE. 1995.
Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science. 2015;349(6245):255–60.
Article MathSciNet Google Scholar
Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–74.
Article Google Scholar
Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
Article MathSciNet Google Scholar
Lipsitz SR, Fitzmaurice GM, Orav EJ, Laird NM. Performance of generalized estimating equations in practical situations. Biometrics. 1994;50:270–8.
Article Google Scholar
Lipsitz SR, Kim K, Zhao L. Analysis of repeated categorical data using generalized estimating equations. Stat Med. 1994;13(11):1149–63.
Article Google Scholar
Louppe G, Wehenkel L, Sutera A, Geurts P. Understanding variable importances in forests of randomized trees. Adv Neural Inf Process Syst. 2013;26.
McCulloch CE, Searle SR. Generalized, linear, and mixed models. Hoboken: Wiley; 2004.
Google Scholar
OpenAI. ChatGPT (Mar 14 version) [Large language model]. 2023. https://chat.openai.com/chat.
Peter X-KS, Song K. Correlated data analysis: modeling, analytics, and applications. Berlin: Springer; 2007.
Google Scholar
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, V., Austria. 2014. http://www.R-project.org/.
Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu X, Marcus J, Sun M. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018;1(1):18.
Article Google Scholar
SAS Institute Inc 2013. SAS/ACCESS® 9.4 Interface to ADABAS: Reference. Cary, N. S. I. I.
Stokes ME, Davis CS, Koch GG. Categorical data analysis using SAS. 3rd ed. Cary: SAS Institute Inc; 2012.
Google Scholar
Taylor SW, Alexander ME. Science, technology, and human factors in fire danger rating: the Canadian experience. Int J Wildland Fire. 2006;15(1):121–35.
Article Google Scholar

Download references

Funding

The National Science and Technology Council supports this work. Grant ID: 111-2118-M-A49-005 and 112-2118-M-A49-003.

Author information

Authors and Affiliations

Division of Biostatistics and Data Science, Institute of Public Health, College of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan
Yuan-Wey Wang & Chao-Yu Guo
Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
Hsin-Chou Yang & Yi-Hau Chen

Authors

Yuan-Wey Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hsin-Chou Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Hau Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chao-Yu Guo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YWW first draft manuscript, simulations, analyses, Tables, Figures, and R code. HCY provided critical comments on the methods and manuscript. YHC provided critical comments on the methods and manuscript. CYG proposed the research concept, supervised the project, additional simulations, modified and completed the manuscript, and revisions.

Corresponding author

Correspondence to Chao-Yu Guo.

Ethics declarations

Ethics approval and consent to participate

Not applicable. It is a computer simulation study.

Consent for publication

All authors read and approved the final manuscript for publication.

Competing interests

The authors declare that they have no conflicts of interest related to the subject matter or materials discussed in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1.

An Example of A1. Table S2. An Example of C1.

Additional file 2.

This research also provides the code that computes all research results. The geebm() is an R function that implements the GEEB machine. This function has seven arguments: formula, id, iteration, feature_rate, lrate, standardize, and data. Note that formula must be specified in the format "response ~ predictors" to list the predictors (input features) and response variable (output feature) in the dataset. id is a vector that identifies the clusters and can support multiple levels arranged in the order of multilayer structure. iteration is an integer representing the number of iterations, set to default at 100 iterations. feature_rate represents the proportion of random feature selection. When set to 1, it uses all features; by default, it is set to 0.5, using half of the features. lrate is a hyperparameter for the learning rate, with a default value of 0.1. standardize determines whether features are standardized, and the default does not perform standardization. data is used to input the training dataset. For example, when training the model with the Forest Fire Data in this study, the function would be: geebm(area~X+Y+FFMC+DMC+DC+ISI+temp+RH+wind+rain+day, id=c("season","month"), iteration=100, feature_rate=0.5, lrate=0.1, standardize=T, data=Dataset).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, YW., Yang, HC., Chen, YH. et al. Generalized Estimating Equations Boosting (GEEB) machine for correlated data. J Big Data 11, 20 (2024). https://doi.org/10.1186/s40537-023-00875-5

Download citation

Received: 11 October 2023
Accepted: 27 December 2023
Published: 22 January 2024
DOI: https://doi.org/10.1186/s40537-023-00875-5

Generalized Estimating Equations Boosting (GEEB) machine for correlated data

Abstract

Introduction

Materials and methods

Generalized Estimating Equations Boosting Machine

GEE

Working correlation matrix

GEEB

Materials

Data source

Real-world data

Experiment

Simulation studies

Application to a real-world data

Evaluation metric

Results

Simulation results

The proportion of random feature selection

Sample size

Hierarchical structure

Results of the Forest Fire Data

Discussions

Limitations

Future research topics

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1: Table S1.

Additional file 2.

Rights and permissions

About this article

Cite this article

Share this article

Keywords