Preselection in lasso-type analysis for ultra-high dimensional genomic exploration

0Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We address the issue of variable preselection in high-dimensional penalized regression, such as the lasso, a commonly used approach to variable selection and prediction in genomics. Preselection - to start with a manageable set of covariates - is becoming increasingly necessary for enabling advanced analysis tasks to be carried out on data sets of huge size created by high throughput technologies. Preselection of features to be included in multivariate analyses based on simple univariate ranking is a natural strategy that has often been implemented despite its potential bias. We demonstrate this bias and propose a way to correct it. Starting with a sequential implementation of the lasso with increasing lists of predictors, we exploit a property of the set of corresponding cross-validation curves, a pattern that we call "freezing". The ranking of the predictors to be included sequentially is based on simple measures of associations with the outcome, which can be pre-computed in an efficient way for ultra high dimensional data sets externally to the penalized regression implementation. We demonstrate by simulation that our sequential approach leads in a vast majority of cases to a safe and efficient way of focusing the lasso analysis on a smaller and manageable number of predictors. In situations where the lasso performs well, we need typically less than 20 % of the variables to recover the same solution as if using the full set of variables. We illustrate the applicability of our strategy in the context of a genome-wide association study and on microarray genomic data where we need just 2:5 % and 13 % of the variables respectively. Finally we include an example where 260 million gene-gene interactions are ranked and we are able to recover the lasso solution using only 1 % of these. Freezing offers great potential for extending the applicability of penalized regressions to current and upcoming ultra high dimensional problems in bioinformatics. Its applicability is not limited to the standard lasso but is a generic property of many penalized approaches.

Cite

CITATION STYLE

APA

Bergersen, L. C., Ahmed, I., Frigessi, A., Glad, I. K., & Richardson, S. (2016). Preselection in lasso-type analysis for ultra-high dimensional genomic exploration. In Abel Symposia (Vol. 11, pp. 37–66). Springer Heidelberg. https://doi.org/10.1007/978-3-319-27099-9_3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free