Skip to main content

Early prediction of MODS interventions in the intensive care unit using machine learning



Multiple organ dysfunction syndrome (MODS) is one of the leading causes of death in critically ill patients. MODS is the result of a dysregulated inflammatory response that can be triggered by various causes. Owing to the lack of an effective treatment for patients with MODS, early identification and intervention are the most effective strategies. Therefore, we have developed a variety of early warning models whose prediction results can be interpreted by Kernel SHapley Additive exPlanations (Kernel-SHAP) and reversed by diverse counterfactual explanations (DiCE). So we can predict the probability of MODS 12 h in advance, quantify the risk factors, and automatically recommend relevant interventions.


We used various machine learning algorithms to complete the early risk assessment of MODS, and used a stacked ensemble to improve the prediction performance. The kernel-SHAP algorithm was used to quantify the positive and minus factors corresponding to the individual prediction results, and finally, the DiCE method was used to automatically recommend interventions. We completed the model training and testing based on the MIMIC-III and MIMIC-IV databases, in which the sample features in the model training included the patients’ vital signs, laboratory test results, test reports, and data related to the use of ventilators.


The customizable model called SuperLearner, which integrated multiple machine learning algorithms, had the highest authenticity of screening, and its Yordon index (YI), sensitivity, accuracy, and utility_score on the MIMIC-IV test set were 0.813, 0.884, 0.893, and 0.763, respectively, which were all maximum values of eleven models. The area under the curve of the deep–wide neural network (DWNN) model on the MIMIC-IV test set was 0.960, and the specificity was 0.935, which were both the maximum values of all these models. The Kernel-SHAP algorithm combined with SuperLearner was used to determine the minimum value of glasgow coma scale (GCS) in the current hour (OR = 0.609, 95% CI   0.606–0.612), maximum value of MODS score corresponding to GCS in the past 24 h (OR = 2.632, 95% CI 2.588–2.676), and maximum score of MODS corresponding to creatinine in the past 24 h (OR = 3.281, 95% CI   3.267–3.295) were generally the most influential factors.


The MODS early warning model based on machine learning algorithms has considerable application value, and the prediction efficiency of SuperLearner is superior to those of SubSuperLearner, DWNN, and other eight common machine learning models. Considering that the attribution analysis of Kernel-SHAP is a static analysis of the prediction results, we introduce the DiCE algorithm to automatically recommend counterfactuals to reverse the prediction results, which will be an important step towards the practical application of automatic MODS early intervention.


Multiple organ dysfunction syndrome (MODS) is defined as an acute and potentially reversible dysfunction of two or more organs induced by various factors. The incidence of MODS in adult patients admitted to ICU is 11–40%.[1, 2]. MODS is very common in critically ill patients, with a mortality rate of 44–76% [3,4,5]. The MODS mortality rate is related to the number of affected organs and the severity of each organ dysfunction. In cases of 2–4 organs failing, the mortality rate is 10–40%, whereas it is up to 50% in patients with cumulative five organ failure and 100% in patients with cumulative seven organ failure [6, 7].

MODS has a high mortality due to the lack of effective treatment, so early warning and intervention in the development of MODS is of great clinical importance [8]. Bose et al. determined the tags per min in the past 24 h according to IPSCC and the MODS standard proposed by Proulx et al. and added the waveform data features extracted using the spectral clustering method and used four algorithms to complete the early warning of MODS in children, in which the area under the ROC curve (AUC) of the random forest algorithm was ≥ 0.91, and the median early warning time was 22.7 h for random forest and 37 h for XGBoost models [9,10,11,12]. In addition to conventional model evaluation standards such as AUC, a new model prediction performance evaluation standard, utility scores, has been proposed, which believes that early or late warning is not helpful. [13]. Li et al. applied the utility scores to the performance comparison of sepsis early warning model for the first time; inspired by utility scores for sepsis, we proposed the utility scores of MODS [14]. In addition, the sample labels were determined based on the MODS diagnostic criterion [15, 16], and the features for model training were derived from the clinical and scoring characteristics [17,18,19]. Characteristics usually refer to the mathematical calculation of features, such as mean value and variance. In recent years, many studies have shown that stacked ensemble algorithms have greater predictive advantages in clinical decision support. Fan et al. used a stacked ensemble algorithm to classify normal and delayed hospitalizations in 1599 critically ill patients with spinal cord injuries [20]. Fan et al. selected three classifiers with the best performance from 91 base classifiers, and subsequently further superimposed the three classifiers into an stacked ensemble model using logistic regression classification. The AUC of the stacked ensemble model was 0.864, which was 6% higher than that of the non-ensemble learning classifier. Ko et al. developed the stacked ensemble algorithm called EDRnet based on 361 COVID-19 patients in Wuhan and applied the model to predict the death of 106 patients in three Korean medical institutions. The results demonstrated that the EDRnet provided 100% sensitivity, 91% specificity, and 92% accuracy [21]. The stacked ensemble algorithms [20, 21] achieved a high prediction performance and generalization ability because it fully utilized base classifiers, such as XGBoost and lightGBM, which were excellent for large sample sizes with multiple features, and the Bayesian neural network algorithm, which was suitable for small sample sets and effectively prevented overfitting. By integrating different classifiers, the disadvantages could be avoided, and the generality of the stacked ensemble algorithm could be considerably improved [22,23,24]. We have been exploring the use of a customizable neural network algorithm and non-neural network algorithms to integrate into a stacked adaptive algorithm, which has higher prediction performance. First, we need to develop a neural network algorithm with high prediction performance. Generally, the deeper the neural network is, the higher the prediction performance of the model; however, too high depths often caused the gradient disappearance or divergence of the weight of the loss function backpropagation. To solve the problem of gradient divergence, a batch normalization layer was added to the DWNN model used in this study; the batch normalization layer normalized the data before the input of each layer, which was conducive to eliminating gradient divergence and accelerating the training of the model, particularly for time-consuming stacked ensemble model training [25]. DWNN directly inputs the output of the middle layers to the “Concatenate”layer (Fig. 2), which solved the problem of the weight gradient disappearance of the far layer neural network. The loss function could propagate the direct gradient to the farthest layer, which was no longer influenced by the network depth [26]. Second, the stacked ensemble enabled the integration of multiple models with sub-optimal predictive performance into a model with optimal performance. A reasonable integration of multiple models could improve the generalization ability of the model. We use the Q-learning algorithm to determine the specific learners used by Stacked ensemble [27]. The interpretation of the stacked ensemble algorithm prediction results helped screen high-impact features and assisted doctors to complete decision-making interventions. Kernel-SHAP was a combination of the Linear LIME and Shapley value algorithms, which could be applied to all machine learning models, but Kernel-SHAP could not provide a scheme to reverse the outcome [28, 29]. Ramaravind et al. proposed that the DiCE (Diverse Counterfactual Explanations) algorithm provided various counterfactuals to reverse prediction result [30]. Jia et al. used the DiCE method to complete the recommendation of the reversal scheme for extubation failure in the ICU, thereby considerably reducing the risk of subjective intervention by doctors [31]. Compared with other relevant studies, our research has the following advantages. (1) Other scholars determine the stacked compensation based on experience or simply exhaustive, lacking theoretical support. However, we use the Q-learning algorithm to determine the stacked compensation algorithm. (2) There are two hypotheses in DiCE that other scholars have not tried to solve. However, we propose practical methods such as rule screening, which greatly weaken the defects of DiCE itself. (3) We propose the utility_score of MODS for the first time, which is more fair and objective for the model performance evaluation (Additional file 1: Section S1).

We mainly discussed the design scheme of model development, data processing and the idea of creating the stacked ensemble algorithm. And we also discussed Q-table for Q-learning, prediction results for models, analysis of risk factors for groups and individuals, and how to realize the integration of neural network and non-neural network models, how to use Kernel-SHAP in practical applications, and how to weaken limitations of DiCE algorithm.


Research program

As shown in Fig. 1, the study populations from 2001 to 2012 were 19,124 patients in the MIMIC-III data set, and they were ≥ 65 years old, admitted to the ICU for the first time for over 24 h, with a missing feature rate of less than 30% and had a clear outcome label; and 10,520 patients from 2013 to 2018 in the MIMIC-IV data set, who were ≥ 65 years old, admitted to the ICU for the first time for over 24 h, with a missing feature rate of less than 30% and had a clear outcome label. An entry is an sample, whose candidate features comes from a patient in an hourly time window, with 2,389,841 entries for 19,124 patients in MIMIC-III and 1,179,718 entries for 10,520 patients in MIMIC-IV. While the label of the entry is whether MODS occurred in the current hourly window, increasing in 12-h increments. When the label is occurrence of MODS, it was a positive entry, otherwise, it was a negative entry. We randomly considered 80% of the entries corresponding to the number of patients in MIMIC-III as the training set, and 20% as the internal validation data set. Entries corresponding to 10,520 patients in MIMIC-IV were used as the test set. After completing the five-fold cross-validation training of 11 models such as SuperLearner, the evaluation of the models was completed on the internal validation set and the test set, and the evaluation indicators included AUC and accuracy. Finally, Kernel-SHAP and DiCE were used to complete the interpretation and intervention of the prediction results of the test set.

Fig. 1
figure 1

Flow Chart of research programme

Feature selection and data processing

The candidate features were derived from the clinical features and scoring characteristics. For the cohort data of each patient, the forward or backward interpolation method is used to complete the interpolation of clinical features such as total bilirubin and creatinine. The scoring characteristics of the organs of MODS are calculated according to clinical features, so there is no interpolation for the scoring characteristics (Additional file 1: Section S2). To accelerate the convergence of the model training, it was required to standardize the entries and use the Gaussian distribution normalization method for obtaining the mean value, μ, and normal deviation, σ, of the MIMIC-III training set and subsequently apply the obtained μ and σ to the normalization of the MIMIC-III and MIMIC-IV test sets.

Machine learning

The deep neural network algorithm generally has a higher prediction performance than the single non-neural network algorithm, so we designed the MODS early warning algorithm, DWNN, strictly according to the requirements of neural network modeling (Fig. 2). In addition, we developed eight conventional machine learning algorithms based on the same MIMIC-III training set: KNN, lightgbm, Decision Tree, Naïve Bayes, random forest, XGBoost, AdaBoosting, and Logistic Regression. These algorithms had five-fold cross-validation and parameter optimization. The results of the MIMIC-III test set showed that DWNN model is one of the top three models with the best performance (Table 4). This study used the keras encapsulating interface of sklearn to encapsulate DWNN into the classifier interface of sklearn, and then used the interface module from sklearn to integrate multiple models. The stacked ensemble model could directly invoke and complete the general and individual sample interpretation using the Kernel-SHAP algorithm interface. The stacked ensemble enabled the integration of multiple models with sub-optimal predictive performance into a model with optimal performance. Conventional non-neural network machine learning algorithms included eight types: logistic regression, random forest, Bayesian, XGBoost, lightGBM, etc. A limitation of stacked ensemble algorithm is the difficulty of optimizing the integration framework. Two stacked ensemble schemes were used in this study (Fig. 3).

Fig. 2
figure 2

Structure of DWNN

Fig. 3
figure 3

Frameworks of stacked ensembles

Figure 3A is a two-layer stacked ensemble structure called SuperLearner composed of base learners and meta-learners, where base_1 ~ base_8 are the base learners, and the predictive probabilities of the base learners are used as the input features of the meta-learner. Figure 3B is a customizable three-layer stacked ensemble structure called SubSuperLearner. The predictive probabilities of base1_1 and base1_2 are used as the input features of meta1_1. The predictive probabilities of base1_3 and base1_4 are used as the input features of meta1_2. The predictive probabilities of meta1_1, meta1_2, base2_1 and base2_2 are used as the input features of meta2. For Fig. 3A, B, each learner can be selected from nine algorithms (excluding SuperLearner and SubSuperLearner). If you use the exhaustive method to determine which algorithm the learner is to achieve maximum AUC for SuperLearner or SubSuperLearner, you must exhaustive 9 ^ 9 = 387420489 times. The adoption of exhaustive selection is time-intensive and was not in line with reality. Herein, we used the Q-learning algorithm with ε-greedy strategy to determine each base learner and meta learner [32]. We have completed the pseudo-code of the Q-learning algorithm and its detailed description (Additional file 1: Sect. S3).

Statistical analysis

SPSS 20.0 software was used for statistical analysis, which was in line with the normal distribution characteristics and was expressed as ± standard deviation of the mean value (\(\overline{x} \pm s\)), and the inter-group comparison was performed using the \(t\) test; comparisons of the counting data groups were examined using the X2 test. The dependent variable was whether patients had MODS or not, and the independent variable was the index screened by the Univariate Analysis of the influencing factors of MODS. The related index was screened using multivariate logistic regression analysis, and the difference was statistically significant when P < 0.001.

We used Python 3.6 analysis and loaded third-party modules, such as sklearn, XGBoost, torch, shap, and imblearn. The AUC, accuracy, sensitivity, specificity, YI, and the utility_score of SuperLearner, SubSuperLearner, and DWNN models on the internal validation set and test set were calculated. To eliminate the random error of a single trial, this study was repeated 10 times. For the definition of utility_score, please refer to Additional file 1: Section S1.


Univariate factor logistic regression analysis between groups

The scores for the six organs of MODS can be addressed by clinical features (Additional file 1: Section S2). Candidate features include the mean, maximum and minimum for all clinical features such as total bilirubin and creatinine, as well as the scoring characteristics for the six organs of MODS within an hourly window. The total number of candidate features listed in the latest manuscript was 37. After one-way logistic regression analysis, 21 features were selected with statistical significance (P < 0.001) (Table 1). The factors with larger contribution values were currentMinGcs, gcs24HoursMods, and renal24HoursMods (Table 2).

Table 1 Comparison of characteristics between MODS and non-MODS groups
Table 2 Logistic regression analysis of factors affecting MODS occurrence

Q-table and ROC curves

From the 2,389,841 samples of MIMIC-III, 50,0000 samples were randomly selected for Q-learning training, and the StackingClassifier of the third-party module mlxtend was used to build the stacked ensemble model. The value of each parameter was \(\gamma = 0.85,\) \(\varepsilon = 0.9,\) and \(\alpha = 0.1\), and the Q-learning training was terminated after 5000 iterations, and the Q-tables of SuperLearner and SubSuperLearner were obtained (Fig. 4).

Fig. 4
figure 4

Q-table of SuperLearner and SubSuperLearner

The trained Q-learning algorithm would store the information selected by the learner that obtained the maximum reward in the Q-table, and we were only required to determine the learner corresponding to the maximum value of each row in the Q-table for determining which learner was selected in the rectangular box, as shown in Fig. 3A, B. As shown in Fig. 4, the color of AdaBoosting was the darkest in the state base_1 row, so the agent should select the AdaBoosting base learner at the base_1 position. Thus, the SuperLearner structure should be base_1 selecting AdaBoosting, base_2 selecting Na Naïve Bayes, base_3 selecting DWNN, base_4 selecting lightgbm, base_5 selecting KNN, base_6 selecting XGBoost, base_7 selecting decision tree, base_7 selecting random, and meta selecting logistic regression. As shown in Fig. 4B, the SubSuperLearner structure is base1_1 selecting DWNN, base1_2 selecting Decision Tree, base 1_3 selecting lightgbm, base1_4 selecting AdaBoosting, base2_1 selecting random forest, base2_2 selecting XGBoost, meta1_1 selecting Na Naïve Bayes, meta1_2 selecting KNN, and meta2 selecting logistic regression.

After SuperLearner and SubSuperLearner were determined, 10 independent trials were completed, and the ROC curves for them obtained in the MIMIC-IV test set is shown in Fig. 5. The sensitivity values corresponding to different specificities in Fig. 5 are listed in Table 3.

Fig. 5
figure 5

Fig. 5 ROC curves

Table 3 Sensitivity values corresponding to different specificities

As shown in Table 4, DWNN had the best performance in nine algorithms (excluding SuperLearner and SubSuperLearner). Figure 5 shows the ROC curves of SuperLearner, SubSuperLearner, DWNN, and logistic regression, and the maximum AUC of DWNN was 0.9602. Table 3 shows that the SuperLearner sensitivity achieved the maximum value when the specificity was ≥ 85%, and the DWNN sensitivity achieved the maximum value when the specificity was ≤ 80%.

Table 4 Comparison of the prediction ability of each model ()

Model evaluation

The AUC, accuracy, sensitivity, specificity,YI, and utility_score of various machine learning models are listed in Table 4. As shown in Table 4, SuperLearner obtained the highest screening authenticity in Accuracy, Sensitivity, YI and utility_score for the SuperLearner model; DWNN achieved maximum values of AUC and specificity.

Explanation and intervention of SuperLearner

SuperLearner is a stacked ensemble model that contains eight non-neural network algorithms and one deep neural network algorithm. The Kernel-SHAP algorithm could quantify the contribution of the general (population) factors (Fig. 6) and local (individual) factors of SuperLearner (Fig. 7).

Fig. 6
figure 6

Contribution of group factors

Fig. 7
figure 7

Contribution of individual factors

Figure 6A shows that currentMinGcs, gcs24HoursMods, and renal24HoursMods have the highest population contribution values, which is consistent with Table 2. The blue in Fig. 6B indicates that the observed value of the feature factor is small, and the red indicates that the observed value of the feature factor is large. The abscissa is the SHAP value. Generally, the larger the SHAP value, the greater the MODS risk. Figure 6B shows that currentMinGcs is negatively correlated with MODS, and the corresponding OR in Table 3 is also less than 1. The factors for calculating MODS scores are positively correlated with MODS occurrence, and the corresponding OR values in Table 2 are also greater than 1.

Causality cannot be derived directly from the statistically determined risk factors. Therefore, we study the correlation between risk factors and the predicted outcomes. We can regard contributions as correlations. An entry is a sample. We use a simple sampling method to randomly select a sample with a predictied outcomes of MODS and a sample with a predicted outcome of no-MODS from test set. For Fig. 7, the abscissa represents the risk factor contribution value (SHAP value), and the ordinate represents the risk factor (feature) with the observation value. If the risk factor contribution value is positive, it indicates that the factor is positively correlated with the prediction result, and the color is red; otherwise, the factor is negative correlated with the prediction result, and the color is blue. It should be noted that all samples in the test set are involved in SHAP analysis, and then the risk factor contribution value of each sample is obtained. f(x) in Fig. 7 contains the sum of SHAP values of all risk factors in the current sample (Additional file 1: Section S4); “E[f(x)] = 0.004” means the mean value of a of all samples including train set and the above test samples is 0.004 in Fig. 7. As shown in Fig. 7A, only currentCardiovascularMods is an unfavorable factor, and the rest are favorable factors. It was the 229th hour of ICU admission (currentHour = 229) for patient A, who had a mild cardiovascular disease (currentCardiovascularMods = 1). However, the consciousness was particularly clear, and the conversational and motor abilities were normal (currentMinGcs = 0 and gcs24HoursMods = 0) on the final day. So the patient A did not develop MODS at 241 h, which was consistent with the patient’s symptoms. Figure 7B shows that it was the 60th hour of ICU admission (currentHour = 60) for patient B. Patient B had three adverse factors: in the past 24, patient B was unconscious and had severe impairments of movement and respiratory system (gcs24HoursMods = 8, currentMinGcs = 3, and respiratory24HoursMods = 2). So the patient B developed MODS at 72 h, which was consistent with the patient’s symptoms.

For patient B who required immediate intervention, we used the DiCE algorithm to automatically recommend counterfactuals for the doctors to select, one of which is shown in Table 5.

Table 5 Example of generated counterfactual for the specific patient in Fig. 7B

The dashes in Table 5 indicate that the factors remain unchanged. We could sample verbal arousal, physical stimulation, and medication to increase the patient's current mental clarity, speech, and motor ability, changing the currentMinGcs value to 15 and the gcs24HoursMods value from 3 to 2. In addition, the respiratory24HoursMods was changed from 2 to 1 with ventilator use. After modification of the observed values of the above factors, the probability of MODS in patients was reduced from 0.97 to 0.11. In principle, patients would avoid MODS occurrence at the 72nd hour. Only one scheme was provided in Table 5. DiCE recommended multiple schemes, and doctors would select the most cost-effective scheme according to the actual situation of patients.


In this study, we developed the SuperLearner algorithm that combined the non-neural network algorithms and deep learning algorithm. We first use the "kerasClassifier" interface of tensorflow to package the customized DWNN model into a machine learning model available to sklearn. Then we use the "StackingClassifier" module of the third-party library "mlxtend" to build the candidate stacked ensemble model. Then the stacked ensemble enabled the integration of multiple models with sub-optimal predictive performance into a model with optimal performance. This study uses Q-learning to determine the SuperLearner and SubSuperLearner. Here we use the AUC of the candidate model as reward. Of course, we can also use YI or utility_score as reward. This study proposes for the first time to use the utility_score of MODS to evaluate the prediction performance for MODS early warning models. This research eliminates the imbalance of sample categories by setting category weights. In fact, the EasyEnsemble method can also be used to make full use of data to improve the classification ability of the model and reduce the bias of the model [33, 34]. In addition to Q-learning algorithm, genetic algorithm is also excellent in determining stacked ensemble algorithm.

There are supervised and unsupervised analyses when using Kernel-SHAP to analyze the attribution of the prediction results. The user first uses the training set and the trained early warning model to train the Kernel-SHAP model and this process is the supervised analysis for Kernel-SHAP. When conducting production application, our samples are not labeled and this process is unsupervised analysis for Kernel-SHAP. We need use not only the trained model to predict the results, but also the trained Kernel-SHAP model to determine the individual risk factor contribution values corresponding to the prediction results. Of course, all production applications of the Shapley Value algorithm require the above steps. "E[f(x)] = 0.004" in Fig. 7 shows that the unlabeled samples from the production environment will train Kernel-SHAP again. If the same sample is tested many times by Kernel-SHAP, this will inevitably lead to model deviation in Kernel-SHAP. Considering the complexity of the algorithm improvement, we directly use the backup trained Kernel-SHAP, and then cover the current Kernel-SHAP after predicting the individual sample of the production environment. In addition, if we can establish the relationship between the value of f(x) such as in Fig. 7 and probability prediction results, we can only use Kernel-SHAP to complete production applications without deploying trained early warning models. So far, we can draw the conclusion that the larger f(x) is, the more likely MODS will occur. This is very interesting and will be the focus of the next research step.

DiCE provides counterfactuals for reversing predicted outcomes on the premise of considering plausibility and diversity (Additional file 1: Section S5). The use of DiCE requires two assumptions. First, DiCE assumes that there is no dependence between features. Second, DiCE assumes that the prediction results can be reversed as long as counterfactuals are implemented within the early warning period, ignoring the time dimension. For hypothesis 1, multiple rules can be set, such as currentMinGcs <  = 6, currentGcsMods = 4, etc. to conduct a second round of screening for counterfactuals. For hypothesis 2, it is difficult to implement the current algorithm. We can only suggest that patients to implement counterfactuals as soon as possible to increase the possibility of reversing the predicted outcome.


In this study, the non-neural network algorithm and customizable neural network algorithm were integrated into a two-layer stacked ensemble structure called SuperLearner and a three-layer stacked ensemble structure called SubSuperLearner. Compared to the base learners, we found that the screening ability of the two stacked ensemble structures exceed any one of them. In terms of model performance evaluation, we added utility_score of MODS for the first time in all MODS-related studies. In order to determine base learners in the two stacked ensemble structures, we innovatively used Q-learning to determinate them. In addition, we applied Kernel-SHAP to complete the attribution analysis of the prediction results of the stacked ensemble model, and give tips on the use of Kernel-SHAP for production applications. Considering that the attribution analysis of Kernel-SHAP is a static analysis of the prediction results, we introduced the DiCE algorithm to automatically recommend counterfactuals to reverse the prediction results, which will be an important step towards the practical application of fully automatic MODS early intervention.

Availability of data and materials

The datasets generated and analyzed during the current study are not publicly available for privacy reasons but anonymized data are available from the corresponding author on reasonable request.


  1. Bernard GR, Vincent JL, Laterre PF, et al. Efficacy and safety of recombinant human activated protein C for severe sepsis. N Engl J Med. 2001;344:699–709.

    Article  Google Scholar 

  2. Guidet B, Aegerter P, Gauzit R, Meshaka P, Dreyfuss D. CUB-rea study group incidence and impact of organ dysfunction associated with sepsis. Chest. 2005;127:942–51.

    Article  Google Scholar 

  3. Gourd NM, Nikitas N. Multiple organ dysfunction syndrome. J Intensive Care Med. 2020;35(12):1564–75.

    Article  Google Scholar 

  4. Barie PS, Hydo LJ. Epidemiology of multiple organ dysfunction syndrome in critical surgical illness. Surg Infect. 2000;1(3):173–85.

    Article  Google Scholar 

  5. Angus DC, Linde-Zwirble WT, Lidicker J, Clermont G, Carcillo J, Pinsky MR. Epidemiology of severe sepsis in the United States: analysis of incidence, outcome, and associated costs of care. Crit Care Med. 2001;29(7):1303–10.

    Article  Google Scholar 

  6. Churpek MM, Zadravecz FJ, Winslow C, Howell MD, Edelson DP. Incidence and prognostic value of the systemic inflammatory response syndrome and organ dysfunctions in ward patients. Am J Respir Crit Care Med. 2015;192(8):958–64.

    Article  Google Scholar 

  7. Mayr VD, Dünser MW, Greil V, Jochberger S, Luckner G, Ulmer H, et al. Causes of death and determinants of outcome in critically ill patients. Crit Care. 2006;10(6):R154.

    Article  Google Scholar 

  8. Gourd NM, Nikitas N. Multiple organ dysfunction syndrome. J Intensive Care Med. 2020;35(12):1564–75.

    Article  Google Scholar 

  9. Bose SN, Greenstein JL, Fackler JC, Sarma SV, Winslow RL, Bembea MM. Early prediction of multiple organ dysfunction in the pediatric intensive care unit. Front Pediatr. 2021;9:711104.

    Article  Google Scholar 

  10. Goldstein B, Giroir B, Randolph A. International pediatric sepsis consensus conference: definitions for sepsis and organ dysfunction in pediatrics. Pediatr Crit Care Med. 2005;6(1):2–8.

    Article  Google Scholar 

  11. Proulx F, Joyal JS, Mariscalco MM, Leteurtre S, Leclerc F, Lacroix J. The pediatric multiple organ dysfunction syndrome. Pediatr Crit Care Med. 2009;10(1):12–22.

    Article  Google Scholar 

  12. Proulx F, Fayon M, Farrell CA, Lacroix J, Gauthier M. Epidemiology of sepsis and multiple organ dysfunction syndrome in children. Chest. 1996;109(4):1033–7.

    Article  Google Scholar 

  13. Reyna MA, Josef CS, Jeter R, Shashikumar SP, Westover MB, Nemati S, et al. Early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge 2019. Crit Care Med. 2020;48(2):210–7.

    Article  Google Scholar 

  14. Li X, Xu X, Xie F, Xu X, Sun Y, Liu X, et al. A time-phased machine learning model for real-time prediction of sepsis in critical care. Crit Care Med. 2020;48(10):e884–8.

    Article  Google Scholar 

  15. Karakike E, Scicluna BP, Roumpoutsou M, Mitrou I, Karampela N, Karageorgos A, Psaroulis K, Massa E, Pitsoulis A, Chaloulis P, Pappa E, Schrijver IT, Frantzeskaki F, Lada M, Dauby N, De Bels D, Floros I, Anisoglou S, Antoniadou E, Patrani M, Vlachogianni G, Mouloudi E, Antoniadou A, Grimaldi D, Roger T, Wiersinga WJ, Tsangaris I, Giamarellos-Bourboulis EJ. Effect of intravenous clarithromycin in patients with sepsis, respiratory and multiple organ dysfunction syndrome: a randomized clinical trial. Crit Care. 2022;26(1):183.;PMCID:PMC9206755.

    Article  Google Scholar 

  16. Hazeldine J, Naumann DN, Toman E, Davies D, Bishop JRB, Su Z, Hampson P, Dinsdale RJ, Crombie N, Duggal NA, Harrison P, Belli A, Lord JM. Prehospital immune responses and development of multiple organ dysfunction syndrome following traumatic injury: a prospective cohort study. PLoS Med. 2017;14(7):e1002338.

    Article  Google Scholar 

  17. Cook R, Cook D, Tilley J, Lee K, Marshall J, Canadian Critical Care Trials Group. Multiple organ dysfunction: baseline and serial component scores. Crit Care Med. 2001;29(11):2046–50.

    Article  Google Scholar 

  18. Liu X, Hu P, Mao Z, Kuo P, Li P, Liu C, Hu J, Li D, Cao D, Mark RG, Celi LA, Zhang Z, Zhou F. (2020). Interpretable Machine Learning Model for Early Prediction of Mortality in Elderly Patients with Multiple Organ Dysfunction Syndrome (MODS): a Multicenter Retrospective Study and Cross Validation. ArXiv, abs/2001.10977.

  19. Accessed 16 March 2022.

  20. Fan G, Yang S, Liu H, Xu N, Chen Y, He J, et al. Machine learning-based prediction of prolonged intensive care unit stay for critical patients with spinal cord injury. Spine. 2022;47(9):E390-e398.

    Article  Google Scholar 

  21. Ko H, Chung H, Kang WS, Park C, Kim DW, Kim SE, et al. An Artificial intelligence model to predict the mortality of COVID-19 patients at hospital admission time using routine blood samples: development and validation of an ensemble model. J Med Internet Res. 2020;22(12):e25442.

    Article  Google Scholar 

  22. Kalagotla SK, Gangashetty SV, Giridhar K. A novel Stacked Ensemble technique for prediction of diabetes. Comput Biol Med. 2021;135:104554.

    Article  Google Scholar 

  23. Chiu CC, Wu CM, Chien TN, Kao LJ, Li C, Jiang HL. Applying an improved stacking ensemble model to predict the mortality of ICU patients with heart failure. J Clin Med. 2022;11(21):6460.

    Article  Google Scholar 

  24. Liang N, Wang C, Duan J, Xie X, Wang Y. Efficacy prediction of noninvasive ventilation failure based on the stacking ensemble algorithm and autoencoder. BMC Med Inform Decis Mak. 2022;22(1):27.;PMCID:PMC8805397.

    Article  Google Scholar 

  25. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning; 2015: PMLR; 2015. p. 448–456.

  26. Wang B, Bai Y, Yao Z, Li J, Dong W, Tu Y, et al. A multi-task neural network architecture for renal dysfunction prediction in heart failure patients with electronic health records. IEEE Access. 2019;7:178392–400.

    Article  Google Scholar 

  27. Ardulov V, Martinez VR, Somandepalli K, Zheng S, Salzman E, Lord C, et al. Robust diagnostic classification via Q-learning. Sci Rep. 2021;11(1):11730.

    Article  Google Scholar 

  28. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Advances in neural information processing systems 2017: 30.

  29. Ribeiro MT, Singh S, Guestrin C. " Why should i trust you?" Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016; 2016. p. 1135–1144.

  30. Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 607–617, 2020.

  31. Jia Y, Kaul C, Lawton T, Murray-Smith R, Habli I. Prediction of weaning from mechanical ventilation using convolutional neural networks. Artif Intell Med. 2021;117:102087.

    Article  Google Scholar 

  32. Ardulov V, Martinez VR, Somandepalli K, Zheng S, Salzman E, Lord C, Bishop S, Narayanan S. Robust diagnostic classification via Q-learning. Sci Rep. 2021;11(1):11730.

    Article  Google Scholar 

  33. Sun C, Cui H, Zhou W, Nie W, Wang X, Yuan Q. Epileptic seizure detection with EEG textural features and imbalanced classification based on easyensemble learning. Int J Neural Syst. 2019;29(10):1950021.

    Article  Google Scholar 

  34. Kang Q, Chen X, Li S, Zhou M. A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans Cybern. 2017;47(12):4263–74.

    Article  Google Scholar 

Download references


The authors thank Bullet Edits for the English language editing and review services.


This work was supported by China PLA Scientific Key Grant (20–163-12-ZT-005–003-01), China Key Scientific Grant Program (No. 2021YFC0122500), National Science Foundation for Young Scientists of China (Grant No. 82100096) and National Science Foundation for Young Scientists of Beijing (Grant No. 7214254).

Author information

Authors and Affiliations



CL, ZJ and PF wrote the mauscript. YH, HC, HB contributed to data analysis and produced figures and tables. HB, LX and KX edited and revised the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Haibo Cheng, Lixin Xie or Kun Xiao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors read and approved the publication of the final manuscript.

Competing interests


Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Figure S1.

Diagrams of utility of positive and negative predictions for MODS and non-MODS. Table S1. The modified multiple organ dysfunction syndrome (MODS) score. Table S2. The Q-table for SuperLearner. Table S3. The Q-table for SubSuperLearner.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, C., Yao, Z., Liu, P. et al. Early prediction of MODS interventions in the intensive care unit using machine learning. J Big Data 10, 55 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • MODS
  • Stacked ensemble
  • Feature interpretation
  • Decision recommendation