Rare events bias of logistic regression
Logistic regression is one of the most commonly used statistical methods to estimate prognostic models that relate a binary outcome (with levels event and non-event) to a number of explanatory variables. A low prevalence of events, encountered frequently in clinical or epidemiological studies, causes underestimation of estimates of the event (rare events bias). We explain that the rare events bias and small sample bias of the regression coefficients have to be distinguished which can explain why the bias corrected estimates, as for example Firth's bias correction, cannot remove the rare events bias. We show that the rare events bias is more pronounced when the number of explanatory variables approaches or exceeds the number of events (high-dimensional data). The rare events bias is explained for the maximum likelihood estimation, as well as for the penalized estimation by using LASSO, ridge and Firth-type penalization. We also explain why the intuitive solution of weighting the samples amplifies the rare events bias while under-sampling the non-events is efficient in removing the rare events bias.
The slides at the lecture will be in English.