Design matters in patient-level prediction: evaluation of a cohort vs. case-control design when developing predictive models in observational healthcare datasets

Reps, Jenna M.; Ryan, Patrick B.; Rijnbeek, Peter R.; Schuemie, Martijn J.

doi:10.1186/s40537-021-00501-2

Journal of Big Data

Table 1 The potential issues with the case-control designs

From: Design matters in patient-level prediction: evaluation of a cohort vs. case-control design when developing predictive models in observational healthcare datasets

Issue	Description	Issue in cohort design?	Issue in case-control design?
Subjective data extraction methodology choices	The design requires subjective methodology choices that may differ between researchers	Not if problem is well defined with specified target population, outcome and time-at-risk	Yes—matching choice can differ (e.g., matching criteria, matching ratio, whether to remove unmatched cases)
Selection bias	Data used to train model may not be representative of target population	Potentially if the database has a bias	Potentially due to poor matching design and if the database has a bias
Covariate issue/protopathic bias [13]	Includes problematic covariates that are precursors of the outcome (e.g., symptoms/tests of outcome)	Potentially if the target population index date is chosen incorrectly. Easily solved by improving target population criteria or adding a gap between index and time-at-risk (e.g., predict outcome 60 days to 365 days after index)	Potentially an issue if using data around outcome record (e.g., 1 day before) for feature engineering. Can be difficult to solve.
Performance metric bias	Optimistic performance reported due to under-sampling non-outcomes	No	Potentially if matching ratio not representative of true outcome ratio (e.g., precision will be higher in case-control data with outcome class over-represented)
Miscalibration issue	The predicted risk does not match the true risk	Yes (moderate chance)—if the outcome proportion changes over time or the machine learning model does not calibrate well	Yes (high chance)—if the outcome proportion is not representative due to over-representing the outcome class or the machine learning model does not calibrate well
Ill-defined time to apply model	No clear point in time for clinical implementation of model (where the performance has been assessed)	No—index well defined by target population criteria	Yes—no clear index as design is centered around outcome (which is unknown at the point in time the model will be applied)

Back to article page