Variable selection in linear regression models: Choosing the best subset is not always the best choice

3Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

We consider the question of variable selection in linear regressions, in the sense of identifying the correct direct predictors (those variables that have nonzero coefficients given all candidate predictors). Best subset selection (BSS) is often considered the “gold standard,” with its use being restricted only by its NP-hard nature. Alternatives such as the least absolute shrinkage and selection operator (Lasso) or the Elastic net (Enet) have become methods of choice in high-dimensional settings. A recent proposal represents BSS as a mixed-integer optimization problem so that large problems have become computationally feasible. We present an extensive neutral comparison assessing the ability to select the correct direct predictors of BSS compared to forward stepwise selection (FSS), Lasso, and Enet. The simulation considers a range of settings that are challenging regarding dimensionality (number of observations and variables), signal-to-noise ratios, and correlations between predictors. As fair measure of performance, we primarily used the best possible F1-score for each method, and results were confirmed by alternative performance measures and practical criteria for choosing the tuning parameters and subset sizes. Surprisingly, it was only in settings where the signal-to-noise ratio was high and the variables were uncorrelated that BSS reliably outperformed the other methods, even in low-dimensional settings. Furthermore, FSS performed almost identically to BSS. Our results shed new light on the usual presumption of BSS being, in principle, the best choice for selecting the correct direct predictors. Especially for correlated variables, alternatives like Enet are faster and appear to perform better in practical settings.

Cite

CITATION STYLE

APA

Hanke, M., Dijkstra, L., Foraita, R., & Didelez, V. (2024). Variable selection in linear regression models: Choosing the best subset is not always the best choice. Biometrical Journal, 66(1). https://doi.org/10.1002/bimj.202200209

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free