Skip to main content

Table 1 The potential issues with the case-control designs

From: Design matters in patient-level prediction: evaluation of a cohort vs. case-control design when developing predictive models in observational healthcare datasets

Issue

Description

Issue in cohort design?

Issue in case-control design?

Subjective data extraction methodology choices

The design requires subjective methodology choices that may differ between researchers

Not if problem is well defined with specified target population, outcome and time-at-risk

Yes—matching choice can differ (e.g., matching criteria, matching ratio, whether to remove unmatched cases)

Selection bias

Data used to train model may not be representative of target population

Potentially if the database has a bias

Potentially due to poor matching design and if the database has a bias

Covariate issue/protopathic bias [13]

Includes problematic covariates that are precursors of the outcome (e.g., symptoms/tests of outcome)

Potentially if the target population index date is chosen incorrectly. Easily solved by improving target population criteria or adding a gap between index and time-at-risk (e.g., predict outcome 60 days to 365 days after index)

Potentially an issue if using data around outcome record (e.g., 1 day before) for feature engineering. Can be difficult to solve.

Performance metric bias

Optimistic performance reported due to under-sampling non-outcomes

No

Potentially if matching ratio not representative of true outcome ratio (e.g., precision will be higher in case-control data with outcome class over-represented)

Miscalibration issue

The predicted risk does not match the true risk

Yes (moderate chance)—if the outcome proportion changes over time or the machine learning model does not calibrate well

Yes (high chance)—if the outcome proportion is not representative due to over-representing the outcome class or the machine learning model does not calibrate well

Ill-defined time to apply model

No clear point in time for clinical implementation of model (where the performance has been assessed)

No—index well defined by target population criteria

Yes—no clear index as design is centered around outcome (which is unknown at the point in time the model will be applied)