When is the first spurious variable selected by sequential regression procedures?
From MaRDI portal
Publication:4561008
DOI10.1093/BIOMET/ASY032zbMATH Open1499.62282arXiv1708.03046OpenAlexW2962872139WikidataQ129805624 ScholiaQ129805624MaRDI QIDQ4561008FDOQ4561008
Authors: Weijie J. Su
Publication date: 10 December 2018
Published in: Biometrika (Search for Journal in Brave)
Abstract: Applied statisticians use sequential regression procedures to produce a ranking of explanatory variables and, in settings of low correlations between variables and strong true effect sizes, expect that variables at the very top of this ranking are truly relevant to the response. In a regime of certain sparsity levels, however, three examples of sequential procedures--forward stepwise, the lasso, and least angle regression--are shown to include the first spurious variable unexpectedly early. We derive a rigorous, sharp prediction of the rank of the first spurious variable for these three procedures, demonstrating that the first spurious variable occurs earlier and earlier as the regression coefficients become denser. This counterintuitive phenomenon persists for statistically independent Gaussian random designs and an arbitrarily large magnitude of the true effects. We gain a better understanding of the phenomenon by identifying the underlying cause and then leverage the insights to introduce a simple visualization tool termed the double-ranking diagram to improve on sequential methods. As a byproduct of these findings, we obtain the first provable result certifying the exact equivalence between the lasso and least angle regression in the early stages of solution paths beyond orthogonal designs. This equivalence can seamlessly carry over many important model selection results concerning the lasso to least angle regression.
Full work available at URL: https://arxiv.org/abs/1708.03046
Recommendations
- Selection of regression and autoregression models with initial ordering of variables
- Spurious regression and lurking variables
- On the selection of regression variables
- Selection of the regression model order
- Variable selection in seemingly unrelated regressions with random predictors
- Some variable selection procedures in multivariate linear regression models
- On some variable selection procedures based on data for regression models
Ridge regression; shrinkage estimators (Lasso) (62J07) Sequential statistical analysis (62L10) Paired and multiple comparisons; multiple testing (62J15)
Cited In (3)
This page was built for publication: When is the first spurious variable selected by sequential regression procedures?
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q4561008)