Fuzzy forests: Extending random forest feature selection for correlated, high-dimensional data

20Citations
Citations of this article
52Readers
Mendeley users who have this article in their library.

Abstract

In this paper we introduce fuzzy forests, a novel machine learning algorithm for ranking the importance of features in high-dimensional classification and regression problems. Fuzzy forests is specifically designed to provide relatively unbiased rankings of variable importance in the presence of highly correlated features, especially when the number of features, p, is much larger than the sample size, n (p ≫ n). We introduce our implementation of fuzzy forests in the R package, fuzzyforest. Fuzzy forests works by taking advantage of the network structure between features. First, the features are partitioned into separate modules such that the correlation within modules is high and the correlation between modules is low. The package fuzzyforest allows for easy use of the package WGCNA (weighted gene coexpression network analysis, alternatively known as weighted correlation network analysis) to form modules of features such that the modules are roughly uncorrelated. Then recursive feature elimination random forests (RFE-RFs) are used on each module, separately. From the surviving features, a final group is selected and ranked using one last round of RFE-RFs. This procedure results in a ranked variable importance list whose size is pre-specified by the user. The selected features can then be used to construct a predictive model.

Cite

CITATION STYLE

APA

Conn, D., Ngun, T., Li, G., & Ramirez, C. M. (2019). Fuzzy forests: Extending random forest feature selection for correlated, high-dimensional data. Journal of Statistical Software, 91(9). https://doi.org/10.18637/jss.v091.i09

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free