Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data

Background: There is currently no consensus on the impact of class imbalance meth-ods on the performance of clinical prediction models. We aimed to empirically investigate the impact of random oversampling and random undersampling, two commonly used class imbalance methods, on the internal and external validation performance of prediction models developed using observational health data. Methods: We developed and externally validated prediction models for various outcomes of interest within a target population of people with pharmaceutically treated depression across four large observational health databases. We used three different classifiers (lasso logistic regression, random forest, XGBoost) and varied the target imbalance ratio. We evaluated the impact on model performance in terms of discrimination and calibration. Discrimination was assessed using the area under the receiver operating characteristic curve (AUROC) and calibration was assessed using calibration plots. Results: We developed and externally validated a total of 1,566 prediction models. On internal and external validation, random oversampling and random undersampling generally did not result in higher AUROCs. Moreover, we found overestimated risks, although this miscalibration could largely be corrected by recalibrating the models towards the imbalance ratios in the original dataset. Conclusions: Overall, we found that random oversampling or random undersampling generally does not improve the internal and external validation performance of prediction models developed in large observational health databases. Based on our findings, we do not recommend applying random oversampling or random undersampling when developing prediction models in large observational health databases.


Background
Many datasets used for clinical prediction modeling exhibit an unequal distribution between their outcome classes and are hence imbalanced; typically, only a small proportion of patients in a target population experiences a certain outcome of interest.
In the machine learning literature, the term class imbalance problem has been used to describe a situation in which a classifier may not be suitable for imbalanced data.It has been suggested that a prediction model developed using imbalanced data may become biased towards the larger class (also referred to as the majority class) and may be more likely to misclassify the smaller class (also referred to as the minority class) [1].As a result, various methods have been proposed to improve prediction performance when developing prediction models using imbalanced data [1,2].Such methods are also referred to as class imbalance methods.
In our previous systematic review on clinical prediction modeling using electronic health record (EHR) data, we found that class imbalance methods were increasingly applied in the period 2009-2019 [3].However, there is currently no consensus on the impact of class imbalance methods on the performance of clinical prediction models.Several previous studies suggest that class imbalance methods may indeed improve performance of clinical prediction models [4,5].In contrast, a recent study focusing on logistic regression investigated random oversampling, random undersampling, and Synthetic Minority Oversampling Technique (SMOTE), and found that balancing data using these methods generally did not improve model discrimination [6].These previous studies focused on low-dimensional datasets with smaller sample sizes; the impact of class imbalance methods on the performance of prediction models developed in large observational health databases is yet unclear.Observational health data typically contain information on thousands of features concerning health conditions and drugs that are routinely recorded in a patient's medical history.
Additionally, to the best of our knowledge, no previous study investigating the impact of class imbalance methods has also assessed external validation performance.External validation refers to evaluating the model performance on data from databases that were not used during model development, while internal validation refers to evaluating the model performance on data from the same database that was used to train the model such as by using a train and test split-sample.Although good internal validation should be an initial requirement for a prediction model, it is often the case that model performance drops on external validation.We are interested in whether class imbalance methods would result in models with better generalizability and robustness.
The aim of this study is to empirically investigate the impact of random oversampling and random undersampling, two commonly used class imbalance methods, on the internal and external validation performance of prediction models developed using observational health data.We developed and validated models for various outcomes within a target population of people with pharmaceutically treated depression across four large observational health databases.We used three different classifiers (lasso logistic regression, random forest, XGBoost) and varied the target imbalance ratio.

Methods
In this study, we developed and validated prediction models using the Patient-Level Prediction (PLP) framework from the Observational Health Data Sciences and Informatics (OHDSI) initiative [7].To improve the interoperability of originally heterogeneous data sources, OHDSI uses the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), which transforms source data into a common format using a set of common terminologies, vocabularies, and coding schemes [8].The OHDSI PLP framework in turn allows for standardized development and extensive validation of prediction models across observational health databases that are mapped to the OMOP CDM [8,9].

Data description
We used four observational health databases: three large claims databases from the United States of America (USA) and one large EHR database from Germany with data mapped to the OMOP CDM.The databases are listed in Table 1 and a description of each database is provided in Additional file 1.Each site obtained institutional review board approval for the study or used de-identified data.Therefore, informed consent was not necessary at any site.
For each database, we investigated 21 different outcomes of interest within a target population of people with pharmaceutically treated depression, as described in the OHDSI PLP framework paper [7].For each of these 21 different outcomes, the prediction problem was defined as follows: "Amongst a target population of patients with pharmaceutically treated depression, which patients will develop < the outcome > during the 1-year time interval following the start of the depression treatment (the index event)?".The aim of this study was not to obtain the best possible models for these prediction problems but to empirically investigate the impact of random oversampling and random undersampling on model performance.For consistency across the experiments and to reduce computational efforts, we sampled an initial study population of 100,000 patients from each database.Further inclusion criteria were then applied to obtain the final study population for each outcome of interest within each database [7]: (1) a minimum of 365 days of observation in the database prior to index, and (2) no record of the specific outcome of interest any time prior to index.Additional file 2: Table S1 provides the observed outcome event count and the observed outcome event proportion in the final study population for all prediction outcomes of interest and all databases.For some outcomes in IQVIA Germany, no outcome events were observed.Across all remaining study populations, the observed outcome event count ranged from 32 (0.03%) to 7,365 (10.44%).

Candidate predictors
Candidate predictors were extracted from data routinely recorded in the databases.These included binary indicators of 5-year age groups (0-4, 5-9, etc.) and sex, as well as a large set of binary indicators of recorded OMOP CDM concepts for health conditions and drug groups [10]

Handling of missing data
Observational health data rarely reflect whether a feature is not observed or missing.
In the observational health data used in this study, if a candidate predictor was not recorded in a patient's history, the candidate predictor defaulted to a value of 0 (corresponding to not observed) for this patient.Age group and sex are required by the OMOP CDM and were always recorded.

Statistical analysis methods
For our experiments, we varied the prediction task, the sampling strategy, and the classifier, resulting in a total of 1,566 prediction models = 58 (prediction tasks) × 9 (8 sampling strategies + 1 control) × 3 (classifiers).The details of what was varied are described in the rest of this section.We refer to a combination of one of the 21 prediction problems and one of the four databases as a prediction task.For each prediction task, a random stratified subset of 75% of the patients in the final study population was used as a training set and the remaining subset of 25% of the patients was used as a test set.To increase statistical power for our analysis, prediction tasks for which the test set contained less than 100 outcome events were excluded from further analysis [11].This resulted in a total of 58 prediction tasks across the four databases; two outcomes ('acute liver injury inpatient' and 'decreased libido') were omitted from further analysis.The imbalance ratio (IR) is defined as the number of patients who do not experience the outcome (the minority class) divided by the number of patients who do experience the outcome (the majority class).An IR = 1 hence represents balanced data, while data with an IR > 100 are typically considered severely imbalanced [12].The original IRs (IR original ) in the final study populations ranged from 8.6 to 245.3 with a median of 84.0 (Table 2).
First, we developed an original data model (without sampling strategy) for each of the 58 prediction tasks.We then investigated random oversampling and random undersampling: for random oversampling, data from the minority class were randomly replicated (with replacement) and added to the original dataset; for random undersampling, data from the majority class were randomly selected and removed from the original dataset.
We randomly sampled towards a target IR: IR target = min(IR original , x) with x ∈ {20, 10, 2, 1}; this resulted in a total of eight different sampling strategies.
Three different classifiers were considered: L1-regularized logistic regression (also known as "lasso logistic regression" or "lasso"), random forest, and XGBoost.The algorithms were all implemented within the OHDSI PLP framework [7], with lasso logistic regression implemented using the glmnet R package [13], random forest using the Scikit-learn Python package [14], and XGBoost using the xgboost R package [15].The model development and internal validation procedure is illustrated in Fig. 1.First, we performed hyperparameter tuning using threefold cross-validation (CV) on the training set [16].The sampling strategy was only applied to the training folds within CV; it was not applied to the validation fold to allow for a realistic evaluation of the model during CV [17].Next, the model was refit on the full training set using the tuned hyperparameters, and the final model was internally validated on the test set (i.e., the held out 25% of patients from the development database).

Model evaluation
We evaluated model discrimination for each developed model using the area under the receiver operating characteristic curve (AUROC) with 95% confidence intervals [18].The impact of the sampling strategy on model discrimination was then assessed using the difference from the original data model AUROC, calculated using internal AUROC difference = AUROC sampled, internal -AUROC original, internal , with AUROC original, internal the AUROC of the original data model for which no sampling strategy was applied on internal validation.A positive AUROC difference therefore means that the sampling strategy resulted in an increased AUROC compared to when no sampling strategy was applied, while a negative AUROC difference means that the sampling strategy resulted in a decreased AUROC.We also evaluated discrimination using the maximum F1-score across all prediction thresholds for each model.The impact of the sampling strategy on model calibration (in the moderate sense) was assessed using plots of the mean predicted risks against the observed outcome event proportions, categorized using percentiles of the predicted risks by each model [19,20].Without sampling, the mean predicted risks and the observed outcome event proportions are expected to be equal on internal validation.However, when random oversampling or random undersampling is applied, the outcome proportion in the data used to train the classifier is modified, resulting in a mismatch between the predicted risks and the observed outcome event proportions, and thus miscalibration is expected.We investigated whether this miscalibration could be corrected by recalibrating the models towards the original IRs, and we assessed the calibration plots both before and after recalibration [21].Recalibration towards the original IR was done by adding a correction We also investigated whether the best sampling strategy in terms of AUROC could be identified prior to evaluating the performance on the test set by selecting the sampling strategy with the highest AUROC during CV.We consider the eight different sampling strategies as well as the option of no sampling strategy, i.e., the original data model, and assessed the internal AUROC difference.We tested whether the median AUROC differences between the original data model and the selected model were significantly different from 0 using the Wilcoxon signed-rank test ( p < 0.05).
Finally, we investigated whether the sampling strategy resulted in better generalizability and robustness by externally validating each developed model across the other databases [22].To increase statistical power for our analysis, external validation tasks for which the external validation dataset contained less than 100 outcome events were excluded from further analysis [11].We evaluated the impact of the sampling strategy on model discrimination using the external AUROC difference = AUROC sampled, external -AUROC original, external , with AUROC original, external the AUROC of the original data model for which no sampling strategy was applied on external validation.The impact of the sampling strategy on model calibration was assessed in the same way as on internal validation.
Detailed definitions of the inclusion criteria and outcome definitions, including code lists, as well as the analytical source code that were used for the analysis, including example code, are available at: https:// github.com/ mi-erasm usmc/ Rando mSamp lingP redic tion.

Results
We developed an original data model for which no sampling strategy was applied and eight different models for which a sampling strategy was applied across a total of 58 prediction tasks and three different classifiers.We hence developed and externally validated a total of 1,566 prediction models.The original AUROCs ranged from 0.58 to 0.87 (Table 3).
First, we investigated the impact on model discrimination in terms of AUROC difference for each sampling strategy and classifier on internal validation (Fig. 2).We can see that although there were some cases with a positive AUROC difference, indicating that the sampling strategy resulted in a higher AUROC compared to when no sampling strategy was applied, random oversampling and random undersampling generally did not improve the AUROC.For lasso logistic regression and XGBoost, the impact of random sampling on model discrimination was relatively small, with a maximum absolute difference in AUROC below 0.06.However, for random oversampling with random forest, we observed a larger impact on model discrimination; the AUROC differences had a wider range, with the largest difference around − 0.3.Moreover, we investigated the AUROC differences for each sampling strategy and classifier by number of outcome events on ln odds probability i,recalibrated = ln odds probability i,estimated + C for individual i internal validation (Fig. 3).It appears that overall, and in particular for random oversampling with random forest, the impact of random sampling on the AUROC shows more variation when the number of outcome events is lower.We also investigated the impact on model discrimination in terms of difference in maximum F1-score for each sampling strategy and classifier on internal validation (Additional file 4).We found that random oversampling and random undersampling generally did not improve the maximum F1-score.
Figure 4 shows that model calibration on internal validation clearly deteriorated for all sampling strategies, for all three classifiers.More specifically, the calibration plots indicate increased overestimation for random oversampling or random undersampling towards smaller target IRs, compared to the original data model.This is in line with expectations, since the models with smaller target IRs were trained using increased outcome proportions.To investigate whether this miscalibration could be corrected, we recalibrated the models towards the original IRs. Figure 5 shows that after recalibration, the calibration plots resembled those of the original data models,  Next, we were interested in whether the best sampling strategy in terms of AUROC could be identified prior to evaluating the performance on the test set by selecting the sampling strategy with the highest AUROC during CV.The option of no sampling strategy, i.e., the original data model, was also considered.Table 4 shows the resulting median AUROC differences across all prediction problems for each database and classifier on internal validation.Only for random forest we found positive median AUROC differences, but these were not significantly different from zero.Hence, selecting the Finally, we investigated the impact of random sampling on external validation performance by assessing the external AUROC differences across all prediction tasks for each sampling strategy and classifier (Fig. 6).The results were consistent with internal validation; generally, random oversampling and random undersampling did not improve the Fig. 4 Calibration plots across all prediction problems and databases for each sampling strategy and classifier on internal validation prior to recalibration towards the original imbalance ratios AUROC on external validation compared to when no sampling strategy was applied.For random oversampling with random forest, we found more variation and larger drops in external validation AUROC.We also found that random oversampling and random undersampling generally did not improve the maximum F1-score on external validation compared to when no sampling strategy was applied (Additional file 4).The calibration plots on external validation before and after recalibration are available in Additional Fig. 5 Calibration plots across all prediction problems and databases for each sampling strategy and classifier on internal validation after recalibration towards the original imbalance ratios file 6.Consistent with internal validation, the calibration plots prior to recalibration indicate increased overestimation for random oversampling or random undersampling towards smaller target IRs.However, after recalibration, the calibration plots mostly resembled those of the original data models.

Discussion
In this study, we empirically investigated the impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data.We developed models for various outcomes of interest within a target population of people with pharmaceutically treated depression.We varied the classifier (lasso logistic regression, random forest, XGBoost), the sampling strategy (random oversampling, random undersampling), and the target IR [1,2,9,20] and applied each combination across 58 prediction tasks (each a combination of a prediction problem and one of the four databases).Overall, we found that random oversampling or random undersampling towards different imbalance ratios generally does not improve the performance of prediction models developed in large observational health databases.On internal validation, the impact of random sampling on model discrimination in terms of an increase or a decrease in the AUROC appeared limited for most models.Only the models for random oversampling with random forest showed more variation in AUROC difference and generally a substantial decrease in the AUROC compared to the original data model.The combination of oversampling with random forest and a target IR of 1 showed the largest drop in test AUROC.The impact on the AUROC appeared to vary more for datasets with a lower number of outcome events.Inspection of the calibration plots allowed us to investigate the impact of random sampling on model calibration.When random oversampling or random undersampling is applied, the outcome proportion in the data used to train the classifier is modified, and we therefore expected miscalibration.In line with our expectations, both random oversampling and random undersampling resulted in overestimated risks.We found that this miscalibration could largely be corrected by recalibrating the models towards the original IRs; after recalibration, the calibration plots resembled those of the original data models, although for random oversampling with random forest the recalibrated models appeared to underestimate risks instead.This highlights that it is important to be aware of the impact of random sampling on model calibration.In our previous systematic review, we found that calibration was often not evaluated at all [3]; we consider it likely that many researchers applying random sampling are not aware of the impact on model calibration and the consequent need for recalibration.For example, several recently published papers on clinical prediction modelling applied random sampling to balance the data used for model development without assessing calibration [23][24][25].
Most previous studies that investigated the impact of class imbalance methods on the performance of clinical prediction models only evaluated model discrimination using threshold-specific measures such as sensitivity, specificity, and positive predictive value.Thresholds are typically carefully selected within the specific clinical context, which makes it difficult to compare models based on threshold-specific measures.We were interested in investigating the impact of random oversampling and random undersampling on model performance across various outcomes of interest and therefore evaluated model discrimination using the AUROC, which provides a summary measure across all possible thresholds; this makes it difficult for us to directly compare our findings with previous literature.We are not aware of any previous study that has systematically identified a positive impact of random oversampling and random undersampling on the performance of prediction models developed in large observational health databases.One previous study investigated various class imbalance methods using data of cancer patients and suggests that a higher test AUROC could be found amongst these class imbalance methods compared to when no class imbalance method was applied [5].However, the authors did not consistently identify the same method that would result in a higher AUROC, and it is unclear from this study whether the best class imbalance method could be identified prior to evaluating the performance on the test set.Additionally, calibration was not assessed.We investigated whether the best sampling strategy in terms of AUROC could be identified prior to evaluating the internal validation performance by selecting the sampling strategy with the highest AUROC during CV, and we generally found no improvement in the test AUROC.
Our findings were in line with a recent study focusing on logistic regression that found that completely balancing the data did not result in models with better performance [6].More specifically, the authors found in a simulation study and a case study that random oversampling, random undersampling, and SMOTE did not improve model discrimination in terms of AUROC.Different from this previous study, our study investigated the impact of random oversampling and random undersampling on model performance for multiple imbalance ratios and multiple classifiers, using large and high-dimensional datasets from multiple observational health databases, and evaluated both internal and external validation.Our findings therefore allow us to extend the findings for random oversampling and random undersampling from this previous study to models developed using lasso logistic regression, random forest and XGBoost in large observational health databases.The authors similarly highlighted the miscalibration resulting from random sampling.SMOTE was proposed for continuous features and our datasets only contained binary features as candidate predictors; we were therefore not able to investigate SMOTE using our data [26].
Finally, we investigated the impact of random oversampling and random undersampling on external validation performance.To the best of our knowledge, no previous study has investigated whether random sampling would result in models with better generalizability and robustness by assessing external validation performance across various databases.We found that consistent with internal validation, on external validation the models for random oversampling with random forest showed more variation in AUROC difference and generally a substantial decrease in the AUROC.Otherwise, the AUROC differences were relatively small.Overall, the results suggest that random oversampling and random undersampling do not result in models with better generalizability and robustness.
A potential limitation of our study is that our results were based on outcomes of interest within a target population of people with pharmaceutically treated depression; we cannot guarantee that these findings will generalize across all prediction problems.Furthermore, a potential limitation of our study is that our results did not account for AUROC uncertainty that may occur due to a low outcome event count in the test set.Nevertheless, to the best of our knowledge, this is the first study that has empirically investigated the impact of random oversampling and random undersampling on the internal and external validation performance of prediction models developed in large observational health databases.By developing and validating models using data mapped to the OMOP CDM, we were able to develop a total of 1,566 prediction models and empirically investigate the impact of random oversampling and random undersampling on internal and external validation performance across four databases.Based on our findings, we do not recommend applying random oversampling or random undersampling when developing prediction models in large observational health databases.Future research could extend our research to other class imbalance methods.

Conclusions
In this study, we empirically investigated the impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data.We developed models for various outcomes of interest within a target population of people with pharmaceutically treated depression across four large observational health databases.Overall, we found that random oversampling or random undersampling towards different imbalance ratios generally does not improve the performance of prediction models developed in large observational health databases.Based on our findings, we do not recommend applying random oversampling or random undersampling when developing prediction models in large observational health databases.

Fig. 1
Fig. 1 Flow chart of the model development and internal validation procedure

XGBoost 0 .Fig. 2
Fig. 2 Internal AUROC differences across all prediction problems and databases for each sampling strategy and classifier.A positive difference means original data model had a lower AUROC, and a negative difference means original data model had a higher AUROC

Fig. 3
Fig. 3 Internal AUROC differences across all prediction problems and databases for each sampling strategy and classifier by number of outcome events.A positive difference means original data model had a lower AUROC, and a negative difference means original data model had a higher AUROC

Fig. 6
Fig. 6 External AUROC differences across all prediction problems and databases for each sampling strategy and classifier.A positive difference means original data model had a lower AUROC, and a negative difference means original data model had a higher AUROC

Table 1
Databases included in the study with data mapped to the OMOP CDM

Table 2
Original imbalance ratios

Table 4
Median internal AUROC differences (with interquartile range) across all prediction problems for each database and classifier when choosing the sampling strategy with the highest AUROC during CV