Skip to main content

Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data

Abstract

Background

There is currently no consensus on the impact of class imbalance methods on the performance of clinical prediction models. We aimed to empirically investigate the impact of random oversampling and random undersampling, two commonly used class imbalance methods, on the internal and external validation performance of prediction models developed using observational health data.

Methods

We developed and externally validated prediction models for various outcomes of interest within a target population of people with pharmaceutically treated depression across four large observational health databases. We used three different classifiers (lasso logistic regression, random forest, XGBoost) and varied the target imbalance ratio. We evaluated the impact on model performance in terms of discrimination and calibration. Discrimination was assessed using the area under the receiver operating characteristic curve (AUROC) and calibration was assessed using calibration plots.

Results

We developed and externally validated a total of 1,566 prediction models. On internal and external validation, random oversampling and random undersampling generally did not result in higher AUROCs. Moreover, we found overestimated risks, although this miscalibration could largely be corrected by recalibrating the models towards the imbalance ratios in the original dataset.

Conclusions

Overall, we found that random oversampling or random undersampling generally does not improve the internal and external validation performance of prediction models developed in large observational health databases. Based on our findings, we do not recommend applying random oversampling or random undersampling when developing prediction models in large observational health databases.

Background

Many datasets used for clinical prediction modeling exhibit an unequal distribution between their outcome classes and are hence imbalanced; typically, only a small proportion of patients in a target population experiences a certain outcome of interest. In the machine learning literature, the term class imbalance problem has been used to describe a situation in which a classifier may not be suitable for imbalanced data. It has been suggested that a prediction model developed using imbalanced data may become biased towards the larger class (also referred to as the majority class) and may be more likely to misclassify the smaller class (also referred to as the minority class) [1]. As a result, various methods have been proposed to improve prediction performance when developing prediction models using imbalanced data [1, 2]. Such methods are also referred to as class imbalance methods.

In our previous systematic review on clinical prediction modeling using electronic health record (EHR) data, we found that class imbalance methods were increasingly applied in the period 2009–2019 [3]. However, there is currently no consensus on the impact of class imbalance methods on the performance of clinical prediction models. Several previous studies suggest that class imbalance methods may indeed improve performance of clinical prediction models [4, 5]. In contrast, a recent study focusing on logistic regression investigated random oversampling, random undersampling, and Synthetic Minority Oversampling Technique (SMOTE), and found that balancing data using these methods generally did not improve model discrimination [6]. These previous studies focused on low-dimensional datasets with smaller sample sizes; the impact of class imbalance methods on the performance of prediction models developed in large observational health databases is yet unclear. Observational health data typically contain information on thousands of features concerning health conditions and drugs that are routinely recorded in a patient’s medical history.

Additionally, to the best of our knowledge, no previous study investigating the impact of class imbalance methods has also assessed external validation performance. External validation refers to evaluating the model performance on data from databases that were not used during model development, while internal validation refers to evaluating the model performance on data from the same database that was used to train the model such as by using a train and test split-sample. Although good internal validation should be an initial requirement for a prediction model, it is often the case that model performance drops on external validation. We are interested in whether class imbalance methods would result in models with better generalizability and robustness.

The aim of this study is to empirically investigate the impact of random oversampling and random undersampling, two commonly used class imbalance methods, on the internal and external validation performance of prediction models developed using observational health data. We developed and validated models for various outcomes within a target population of people with pharmaceutically treated depression across four large observational health databases. We used three different classifiers (lasso logistic regression, random forest, XGBoost) and varied the target imbalance ratio.

Methods

In this study, we developed and validated prediction models using the Patient-Level Prediction (PLP) framework from the Observational Health Data Sciences and Informatics (OHDSI) initiative [7]. To improve the interoperability of originally heterogeneous data sources, OHDSI uses the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), which transforms source data into a common format using a set of common terminologies, vocabularies, and coding schemes [8]. The OHDSI PLP framework in turn allows for standardized development and extensive validation of prediction models across observational health databases that are mapped to the OMOP CDM [8, 9].

Data description

We used four observational health databases: three large claims databases from the United States of America (USA) and one large EHR database from Germany with data mapped to the OMOP CDM. The databases are listed in Table 1 and a description of each database is provided in Additional file 1. Each site obtained institutional review board approval for the study or used de-identified data. Therefore, informed consent was not necessary at any site.

Table 1 Databases included in the study with data mapped to the OMOP CDM

For each database, we investigated 21 different outcomes of interest within a target population of people with pharmaceutically treated depression, as described in the OHDSI PLP framework paper [7]. For each of these 21 different outcomes, the prediction problem was defined as follows: “Amongst a target population of patients with pharmaceutically treated depression, which patients will develop < the outcome > during the 1-year time interval following the start of the depression treatment (the index event)?”. The aim of this study was not to obtain the best possible models for these prediction problems but to empirically investigate the impact of random oversampling and random undersampling on model performance. For consistency across the experiments and to reduce computational efforts, we sampled an initial study population of 100,000 patients from each database. Further inclusion criteria were then applied to obtain the final study population for each outcome of interest within each database [7]: (1) a minimum of 365 days of observation in the database prior to index, and (2) no record of the specific outcome of interest any time prior to index. Additional file 2: Table S1 provides the observed outcome event count and the observed outcome event proportion in the final study population for all prediction outcomes of interest and all databases. For some outcomes in IQVIA Germany, no outcome events were observed. Across all remaining study populations, the observed outcome event count ranged from 32 (0.03%) to 7,365 (10.44%).

Candidate predictors

Candidate predictors were extracted from data routinely recorded in the databases. These included binary indicators of 5-year age groups (0–4, 5–9, etc.) and sex, as well as a large set of binary indicators of recorded OMOP CDM concepts for health conditions and drug groups [10]. For health conditions and drug groups, we considered data from the 365 days prior to index. No feature selection methods were used for selecting candidate predictors prior to model training. The initial study population contained 13,207 candidate predictors in CCAE, 14,237 in MDCD, 13,499 in MDCR, and 6,494 in IQVIA Germany. A list of all the candidate predictors per database is available in Additional file 3.

Handling of missing data

Observational health data rarely reflect whether a feature is not observed or missing. In the observational health data used in this study, if a candidate predictor was not recorded in a patient’s history, the candidate predictor defaulted to a value of 0 (corresponding to not observed) for this patient. Age group and sex are required by the OMOP CDM and were always recorded.

Statistical analysis methods

For our experiments, we varied the prediction task, the sampling strategy, and the classifier, resulting in a total of 1,566 prediction models = 58 (prediction tasks) × 9 (8 sampling strategies + 1 control) × 3 (classifiers). The details of what was varied are described in the rest of this section.

We refer to a combination of one of the 21 prediction problems and one of the four databases as a prediction task. For each prediction task, a random stratified subset of 75% of the patients in the final study population was used as a training set and the remaining subset of 25% of the patients was used as a test set. To increase statistical power for our analysis, prediction tasks for which the test set contained less than 100 outcome events were excluded from further analysis [11]. This resulted in a total of 58 prediction tasks across the four databases; two outcomes (‘acute liver injury inpatient’ and ‘decreased libido’) were omitted from further analysis. The imbalance ratio (IR) is defined as the number of patients who do not experience the outcome (the minority class) divided by the number of patients who do experience the outcome (the majority class). An IR = 1 hence represents balanced data, while data with an IR > 100 are typically considered severely imbalanced [12]. The original IRs (IRoriginal) in the final study populations ranged from 8.6 to 245.3 with a median of 84.0 (Table 2).

Table 2 Original imbalance ratios

First, we developed an original data model (without sampling strategy) for each of the 58 prediction tasks. We then investigated random oversampling and random undersampling: for random oversampling, data from the minority class were randomly replicated (with replacement) and added to the original dataset; for random undersampling, data from the majority class were randomly selected and removed from the original dataset. We randomly sampled towards a target IR: IRtarget = min(IRoriginal, x) with x {20, 10, 2, 1}; this resulted in a total of eight different sampling strategies.

Three different classifiers were considered: L1-regularized logistic regression (also known as “lasso logistic regression” or “lasso”), random forest, and XGBoost. The algorithms were all implemented within the OHDSI PLP framework [7], with lasso logistic regression implemented using the glmnet R package [13], random forest using the Scikit-learn Python package [14], and XGBoost using the xgboost R package [15]. The model development and internal validation procedure is illustrated in Fig. 1. First, we performed hyperparameter tuning using threefold cross-validation (CV) on the training set [16]. The sampling strategy was only applied to the training folds within CV; it was not applied to the validation fold to allow for a realistic evaluation of the model during CV [17]. Next, the model was refit on the full training set using the tuned hyperparameters, and the final model was internally validated on the test set (i.e., the held out 25% of patients from the development database).

Fig. 1
figure 1

Flow chart of the model development and internal validation procedure

Model evaluation

We evaluated model discrimination for each developed model using the area under the receiver operating characteristic curve (AUROC) with 95% confidence intervals [18]. The impact of the sampling strategy on model discrimination was then assessed using the difference from the original data model AUROC, calculated using internal AUROC difference = AUROCsampled, internal – AUROCoriginal, internal, with AUROCoriginal, internal the AUROC of the original data model for which no sampling strategy was applied on internal validation. A positive AUROC difference therefore means that the sampling strategy resulted in an increased AUROC compared to when no sampling strategy was applied, while a negative AUROC difference means that the sampling strategy resulted in a decreased AUROC. We also evaluated discrimination using the maximum F1-score across all prediction thresholds for each model.

The impact of the sampling strategy on model calibration (in the moderate sense) was assessed using plots of the mean predicted risks against the observed outcome event proportions, categorized using percentiles of the predicted risks by each model [19, 20]. Without sampling, the mean predicted risks and the observed outcome event proportions are expected to be equal on internal validation. However, when random oversampling or random undersampling is applied, the outcome proportion in the data used to train the classifier is modified, resulting in a mismatch between the predicted risks and the observed outcome event proportions, and thus miscalibration is expected. We investigated whether this miscalibration could be corrected by recalibrating the models towards the original IRs, and we assessed the calibration plots both before and after recalibration [21]. Recalibration towards the original IR was done by adding a correction to the logit function of the estimated probabilities to obtain the logit function of the recalibrated probabilities as follows:

$$ln\left(odds\left({probability}_{i, recalibrated}\right)\right)=ln\left(odds\left({probability}_{i, estimated}\right)\right)+C\;\textrm{for\;individual}\;i$$

with \(C=ln\left(\frac{odds\left({\pi }_{original}\right)}{odds\left({\pi }_{target}\right)}\right)\) where \({\pi }_{original}=\frac{1}{I{R}_{original}+1}\) and \({\pi }_{target}=\frac{1}{I{R}_{target}+1}\).

We also investigated whether the best sampling strategy in terms of AUROC could be identified prior to evaluating the performance on the test set by selecting the sampling strategy with the highest AUROC during CV. We consider the eight different sampling strategies as well as the option of no sampling strategy, i.e., the original data model, and assessed the internal AUROC difference. We tested whether the median AUROC differences between the original data model and the selected model were significantly different from 0 using the Wilcoxon signed-rank test (\(p < 0.05\)).

Finally, we investigated whether the sampling strategy resulted in better generalizability and robustness by externally validating each developed model across the other databases [22]. To increase statistical power for our analysis, external validation tasks for which the external validation dataset contained less than 100 outcome events were excluded from further analysis [11]. We evaluated the impact of the sampling strategy on model discrimination using the external AUROC difference = AUROCsampled, external – AUROCoriginal, external, with AUROCoriginal, external the AUROC of the original data model for which no sampling strategy was applied on external validation. The impact of the sampling strategy on model calibration was assessed in the same way as on internal validation.

Detailed definitions of the inclusion criteria and outcome definitions, including code lists, as well as the analytical source code that were used for the analysis, including example code, are available at: https://github.com/mi-erasmusmc/RandomSamplingPrediction.

Results

We developed an original data model for which no sampling strategy was applied and eight different models for which a sampling strategy was applied across a total of 58 prediction tasks and three different classifiers. We hence developed and externally validated a total of 1,566 prediction models. The original AUROCs ranged from 0.58 to 0.87 (Table 3).

Table 3 Original data model AUROCs (with 95% confidence intervals)

First, we investigated the impact on model discrimination in terms of AUROC difference for each sampling strategy and classifier on internal validation (Fig. 2). We can see that although there were some cases with a positive AUROC difference, indicating that the sampling strategy resulted in a higher AUROC compared to when no sampling strategy was applied, random oversampling and random undersampling generally did not improve the AUROC. For lasso logistic regression and XGBoost, the impact of random sampling on model discrimination was relatively small, with a maximum absolute difference in AUROC below 0.06. However, for random oversampling with random forest, we observed a larger impact on model discrimination; the AUROC differences had a wider range, with the largest difference around − 0.3. Moreover, we investigated the AUROC differences for each sampling strategy and classifier by number of outcome events on internal validation (Fig. 3). It appears that overall, and in particular for random oversampling with random forest, the impact of random sampling on the AUROC shows more variation when the number of outcome events is lower. We also investigated the impact on model discrimination in terms of difference in maximum F1-score for each sampling strategy and classifier on internal validation (Additional file 4). We found that random oversampling and random undersampling generally did not improve the maximum F1-score.

Fig. 2
figure 2

Internal AUROC differences across all prediction problems and databases for each sampling strategy and classifier. A positive difference means original data model had a lower AUROC, and a negative difference means original data model had a higher AUROC

Fig. 3
figure 3

Internal AUROC differences across all prediction problems and databases for each sampling strategy and classifier by number of outcome events. A positive difference means original data model had a lower AUROC, and a negative difference means original data model had a higher AUROC

Figure 4 shows that model calibration on internal validation clearly deteriorated for all sampling strategies, for all three classifiers. More specifically, the calibration plots indicate increased overestimation for random oversampling or random undersampling towards smaller target IRs, compared to the original data model. This is in line with expectations, since the models with smaller target IRs were trained using increased outcome proportions. To investigate whether this miscalibration could be corrected, we recalibrated the models towards the original IRs. Figure 5 shows that after recalibration, the calibration plots resembled those of the original data models, although for random oversampling with random forest the models appeared to underestimate risks instead. The same calibrations plots per outcome of interest are available in Additional file 5.

Fig. 4
figure 4

Calibration plots across all prediction problems and databases for each sampling strategy and classifier on internal validation prior to recalibration towards the original imbalance ratios

Fig. 5
figure 5

Calibration plots across all prediction problems and databases for each sampling strategy and classifier on internal validation after recalibration towards the original imbalance ratios

Next, we were interested in whether the best sampling strategy in terms of AUROC could be identified prior to evaluating the performance on the test set by selecting the sampling strategy with the highest AUROC during CV. The option of no sampling strategy, i.e., the original data model, was also considered. Table 4 shows the resulting median AUROC differences across all prediction problems for each database and classifier on internal validation. Only for random forest we found positive median AUROC differences, but these were not significantly different from zero. Hence, selecting the sampling strategy based on the highest AUROC during CV generally did not improve the test AUROC.

Table 4 Median internal AUROC differences (with interquartile range) across all prediction problems for each database and classifier when choosing the sampling strategy with the highest AUROC during CV

Finally, we investigated the impact of random sampling on external validation performance by assessing the external AUROC differences across all prediction tasks for each sampling strategy and classifier (Fig. 6). The results were consistent with internal validation; generally, random oversampling and random undersampling did not improve the AUROC on external validation compared to when no sampling strategy was applied. For random oversampling with random forest, we found more variation and larger drops in external validation AUROC. We also found that random oversampling and random undersampling generally did not improve the maximum F1-score on external validation compared to when no sampling strategy was applied (Additional file 4). The calibration plots on external validation before and after recalibration are available in Additional file 6. Consistent with internal validation, the calibration plots prior to recalibration indicate increased overestimation for random oversampling or random undersampling towards smaller target IRs. However, after recalibration, the calibration plots mostly resembled those of the original data models.

Fig. 6
figure 6

External AUROC differences across all prediction problems and databases for each sampling strategy and classifier. A positive difference means original data model had a lower AUROC, and a negative difference means original data model had a higher AUROC

Discussion

In this study, we empirically investigated the impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data. We developed models for various outcomes of interest within a target population of people with pharmaceutically treated depression. We varied the classifier (lasso logistic regression, random forest, XGBoost), the sampling strategy (random oversampling, random undersampling), and the target IR [1, 2, 9, 20] and applied each combination across 58 prediction tasks (each a combination of a prediction problem and one of the four databases). Overall, we found that random oversampling or random undersampling towards different imbalance ratios generally does not improve the performance of prediction models developed in large observational health databases.

On internal validation, the impact of random sampling on model discrimination in terms of an increase or a decrease in the AUROC appeared limited for most models. Only the models for random oversampling with random forest showed more variation in AUROC difference and generally a substantial decrease in the AUROC compared to the original data model. The combination of oversampling with random forest and a target IR of 1 showed the largest drop in test AUROC. The impact on the AUROC appeared to vary more for datasets with a lower number of outcome events. Inspection of the calibration plots allowed us to investigate the impact of random sampling on model calibration. When random oversampling or random undersampling is applied, the outcome proportion in the data used to train the classifier is modified, and we therefore expected miscalibration. In line with our expectations, both random oversampling and random undersampling resulted in overestimated risks. We found that this miscalibration could largely be corrected by recalibrating the models towards the original IRs; after recalibration, the calibration plots resembled those of the original data models, although for random oversampling with random forest the recalibrated models appeared to underestimate risks instead. This highlights that it is important to be aware of the impact of random sampling on model calibration. In our previous systematic review, we found that calibration was often not evaluated at all [3]; we consider it likely that many researchers applying random sampling are not aware of the impact on model calibration and the consequent need for recalibration. For example, several recently published papers on clinical prediction modelling applied random sampling to balance the data used for model development without assessing calibration [23,24,25].

Most previous studies that investigated the impact of class imbalance methods on the performance of clinical prediction models only evaluated model discrimination using threshold-specific measures such as sensitivity, specificity, and positive predictive value. Thresholds are typically carefully selected within the specific clinical context, which makes it difficult to compare models based on threshold-specific measures. We were interested in investigating the impact of random oversampling and random undersampling on model performance across various outcomes of interest and therefore evaluated model discrimination using the AUROC, which provides a summary measure across all possible thresholds; this makes it difficult for us to directly compare our findings with previous literature. We are not aware of any previous study that has systematically identified a positive impact of random oversampling and random undersampling on the performance of prediction models developed in large observational health databases. One previous study investigated various class imbalance methods using data of cancer patients and suggests that a higher test AUROC could be found amongst these class imbalance methods compared to when no class imbalance method was applied [5]. However, the authors did not consistently identify the same method that would result in a higher AUROC, and it is unclear from this study whether the best class imbalance method could be identified prior to evaluating the performance on the test set. Additionally, calibration was not assessed. We investigated whether the best sampling strategy in terms of AUROC could be identified prior to evaluating the internal validation performance by selecting the sampling strategy with the highest AUROC during CV, and we generally found no improvement in the test AUROC.

Our findings were in line with a recent study focusing on logistic regression that found that completely balancing the data did not result in models with better performance [6]. More specifically, the authors found in a simulation study and a case study that random oversampling, random undersampling, and SMOTE did not improve model discrimination in terms of AUROC. Different from this previous study, our study investigated the impact of random oversampling and random undersampling on model performance for multiple imbalance ratios and multiple classifiers, using large and high-dimensional datasets from multiple observational health databases, and evaluated both internal and external validation. Our findings therefore allow us to extend the findings for random oversampling and random undersampling from this previous study to models developed using lasso logistic regression, random forest and XGBoost in large observational health databases. The authors similarly highlighted the miscalibration resulting from random sampling. SMOTE was proposed for continuous features and our datasets only contained binary features as candidate predictors; we were therefore not able to investigate SMOTE using our data [26].

Finally, we investigated the impact of random oversampling and random undersampling on external validation performance. To the best of our knowledge, no previous study has investigated whether random sampling would result in models with better generalizability and robustness by assessing external validation performance across various databases. We found that consistent with internal validation, on external validation the models for random oversampling with random forest showed more variation in AUROC difference and generally a substantial decrease in the AUROC. Otherwise, the AUROC differences were relatively small. Overall, the results suggest that random oversampling and random undersampling do not result in models with better generalizability and robustness.

A potential limitation of our study is that our results were based on outcomes of interest within a target population of people with pharmaceutically treated depression; we cannot guarantee that these findings will generalize across all prediction problems. Furthermore, a potential limitation of our study is that our results did not account for AUROC uncertainty that may occur due to a low outcome event count in the test set. Nevertheless, to the best of our knowledge, this is the first study that has empirically investigated the impact of random oversampling and random undersampling on the internal and external validation performance of prediction models developed in large observational health databases. By developing and validating models using data mapped to the OMOP CDM, we were able to develop a total of 1,566 prediction models and empirically investigate the impact of random oversampling and random undersampling on internal and external validation performance across four databases. Based on our findings, we do not recommend applying random oversampling or random undersampling when developing prediction models in large observational health databases. Future research could extend our research to other class imbalance methods.

Conclusions

In this study, we empirically investigated the impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data. We developed models for various outcomes of interest within a target population of people with pharmaceutically treated depression across four large observational health databases. Overall, we found that random oversampling or random undersampling towards different imbalance ratios generally does not improve the performance of prediction models developed in large observational health databases. Based on our findings, we do not recommend applying random oversampling or random undersampling when developing prediction models in large observational health databases.

Availability of data and materials

The data that support the findings of this study are available from IBM and IQVIA, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. The IBM data that support the findings of this study are available from IBM MarketScan Research Databases (contact at: https://www.ibm.com/products/marketscan-research-databases/databases). The IQVIA data that support the findings of this study are available from IQVIA (contact at: https://www.iqvia.com/solutions/real-world-evidence/real-world-data-and-insights).

Abbreviations

AUROC:

Area Under the Receiver Operating characteristic Curve

CCAE:

IBM MarketScan® Commercial Claims and Encounters Database

CDM:

Common Data Model

CV:

Cross-validation

EHR:

Electronic Heath Record

IQVIA Germany:

IQVIA Disease Analyser Germany EMR

IR:

Imbalance Ratio

MDCD:

IBM MarketScan® Multi-State Medicaid Database

MDCR:

IBM MarketScan® Medicare Supplemental Database

OHDSI:

Observational Health Data Sciences and Informatics

OMOP:

Observational Medical Outcomes Partnership

PLP:

Patient-Level Prediction

SMOTE:

Synthetic Minority Oversampling Technique

USA:

United States of America

References

  1. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84.

    Article  Google Scholar 

  2. Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Comput Surv. 2016;49(2):Article31.

    Google Scholar 

  3. Yang C, Kors JA, Ioannou S, John LH, Markus AF, Rekkas A, et al. Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review. J Am Med Inform Assoc. 2022;29:983–9.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Liu J, Wong ZSY, So HY, Tsui KL. Evaluating resampling methods and structured features to improve fall incident report identification by the severity level. J Am Med Inform Assoc. 2021;28(8):1756–64.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Fotouhi S, Asadi S, Kattan MW. A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inform. 2019;90:103089.

    Article  PubMed  Google Scholar 

  6. van Goorbergh Rvd M, Timmerman D, Van Calster B. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. arXiv Preprint arXiv:220209101. 2022.

  7. Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969–75.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Khalid S, Yang C, Blacketer C, Duarte-Salles T, Fernández-Bertolín S, Kim C, et al. A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data. Comput Methods Programs Biomed. 2021;211: 106394.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Reps JM, Williams RD, You SC, Falconer T, Minty E, Callahan A, et al. Feasibility and evaluation of a large-scale external validation approach for patient-level prediction in an international data network: validation of models predicting stroke in female patients newly diagnosed with atrial fibrillation. BMC Med Res Methodol. 2020;20(1):102.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc. 2012;19(1):54–60.

    Article  PubMed  Google Scholar 

  11. Collins GS, Ogundimu EO, Altman DG. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat Med. 2016;35(2):214–26.

    Article  MathSciNet  PubMed  Google Scholar 

  12. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42.

    Article  Google Scholar 

  13. Friedman JH, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.

    MathSciNet  Google Scholar 

  15. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; San Francisco, California, USA: Association for Computing Machinery; 2016. p. 785–94.

  16. Reps JM, Ryan P, Rijnbeek P. Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data. BMJ Open. 2021;11(12): e050146.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Blagus R, Lusa L. Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models. BMC Bioinform. 2015;16:363.

    Article  Google Scholar 

  18. Sun X, Xu W. Fast implementation of DeLong’s Algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Process Lett. 2014;21(11):1389–93.

    Article  ADS  Google Scholar 

  19. Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016;74:167–76.

    Article  PubMed  Google Scholar 

  21. Steyerberg EW. Clinical prediction models: a practical approach to development. Validation, and updating. New York: Springer, New York; 2008.

    Google Scholar 

  22. Williams RD, Reps JM, Kors JA, Ryan PB, Steyerberg E, Verhamme KM, et al. Using iterative pairwise external validation to contextualize prediction model performance: a use case predicting 1-year heart failure risk in patients with diabetes across five data sources. Drug Saf. 2022;45(5):563–70.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Chiew CJ, Liu N, Wong TH, Sim YE, Abdullah HR. Utilizing machine learning methods for preoperative prediction of postsurgical mortality and intensive care unit admission. Ann Surg. 2020;272(6):1133–9.

    Article  PubMed  Google Scholar 

  24. Liu L, Ni Y, Zhang N, Nick Pratap J. Mining patient-specific and contextual data with machine learning technologies to predict cancellation of children’s Surgery. Int J Med Inform. 2019;129:234–41.

    Article  PubMed  Google Scholar 

  25. Makino M, Yoshimoto R, Ono M, Itoko T, Katsuki T, Koseki A, et al. Artificial intelligence predicts the progression of diabetic kidney disease using big data machine learning. Sci Rep. 2019;9(1):11862.

    Article  PubMed  PubMed Central  ADS  Google Scholar 

  26. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement No 806968. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA.

Author information

Authors and Affiliations

Authors

Contributions

CY conducted investigation and formal analysis of the study and developed the initial manuscript. EAF, JAK, JMR, and PRR guided conceptualization, investigation, and formal analysis of the study. All authors reviewed, edited, and approved the manuscript before submission.

Corresponding author

Correspondence to Cynthia Yang.

Ethics declarations

Ethics approval and consent to participate

All patient data included in this study were deidentified. The New England Institutional Review Board (IRB) determined that studies conducted in these data are exempt from study-specific IRB review, as these studies do not qualify as human subjects research. No experiments were conducted on humans in this study. The research methods were conducted in accordance with appropriate guidelines.

Consent for publication

Not applicable.

Competing interests

JMR is an employee of Janssen Research and Development and shareholder of Johnson and Johnson. CY, EAF, JAK and PRR work for a research group who received unconditional research grants from Janssen Research and Development, none of which relate to the content of this work.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Database descriptions.

Additional file 2.

Outcome event counts and proportions.

Additional file 3.

Candidate predictors per database.

Additional file 4.

Differences in maximum F1-score.

Additional file 5.

Internal calibration plots per outcome.

Additional file 6.

External calibration plots per database.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, C., Fridgeirsson, E.A., Kors, J.A. et al. Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data. J Big Data 11, 7 (2024). https://doi.org/10.1186/s40537-023-00857-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40537-023-00857-7

Keywords