Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data

Journal of Big Data

Table 4 Median internal AUROC differences (with interquartile range) across all prediction problems for each database and classifier when choosing the sampling strategy with the highest AUROC during CV

Database	Number of prediction tasks	Lasso	Random forest	XGBoost	All classifiers
CCAE	14	− 0.0025 (0.0073)	0.0001 (0.0106)	0 (0.0067)	− 0.0004 (0.0076)
MDCD	17	− 0.0004 (0.0044)	0 (0.0062)	0 (0.0071)	0 (0.0068)
MDCR	19	0.0000 (0.0052)	0.0037 (0.0143)	0 (0.0057)	0 (0.0075)
IQVIA Germany	8	− 0.0011 (0.0048)	0.0012 (0.0095)	− 0.0045 (0.0204)	− 0.0010 (0.0098)
All databases	58	− 0.0004 (0.0053)	0.0008 (0.0099)	0 (0.0074)	0 (0.0081)