Random forest Gini importance favours SNPs with large minor allele frequency: Impact, sources and recommendations

Anne Laure Boulesteix; Andreas Bender; Justo Lorenzo Bermejo; Carolin Strobl

Journal ArticleOPEN ACCESS

Random forest Gini importance favours SNPs with large minor allele frequency: Impact, sources and recommendations

Briefings in Bioinformatics (2012) 13(3) 292-304

DOI: 10.1093/bib/bbr053

101Citations

96Readers

Abstract

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, namely that common variants are systematically favoured by the widely used Gini VIM. As a consequence, researchers may overlook rare variants that contribute to the missing heritability. The goal of the present article is 3-fold: (i) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ unbiased variable selection criteria) as well as for different importance measures (Gini and permutation based); (ii) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the preference for common variants; and (iii) to summarize these results and previously investigated properties of random forest VIMs in the context of genetic association studies and to make practical recommendations regarding the choice of the random forest and variable importance type. © The Author 2011. Published by Oxford University Press.

Author supplied keywords

Cite

CITATION STYLE

APA

Boulesteix, A. L., Bender, A., Bermejo, J. L., & Strobl, C. (2012). Random forest Gini importance favours SNPs with large minor allele frequency: Impact, sources and recommendations. Briefings in Bioinformatics, 13(3), 292–304. https://doi.org/10.1093/bib/bbr053

Random forest Gini importance favours SNPs with large minor allele frequency: Impact, sources and recommendations

Abstract

Author supplied keywords

Cite

Register to see more suggestions