Combining Multiple Feature-Ranking Techniques and Clustering of Variables for Feature Selection

60Citations
Citations of this article
81Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Feature selection aims to eliminate redundant or irrelevant variables from input data to reduce computational cost, provide a better understanding of data and improve prediction accuracy. Majority of the existing filter methods utilize a single feature-ranking technique, which may overlook some important assumptions about the underlying regression function linking input variables with the output. In this paper, we propose a novel feature selection framework that combines clustering of variables with multiple feature-ranking techniques for selecting an optimal feature subset. Different feature-ranking methods typically result in selecting different subsets, as each method has its own assumption about the regression function linking input variables with the output. Therefore, we employ multiple feature-ranking methods having disjoint assumption about the regression function. The proposed approach has a feature ranking module to identify relevant features and a clustering module to eliminate redundant features. First, input variables are ranked using regression coefficients obtained by training $L1$ regularized Logistic Regression, Support Vector Machine and Random Forests models. Those features which are ranked lower than a certain threshold are filtered-out. The remaining features are grouped into clusters using an exemplar-based clustering algorithm, which identifies data-points that exemplify the data better, and associates each data-point with an exemplar. We use both linear correlation coefficients and information gain for measuring the association between a data-point and its corresponding exemplar. From each cluster the highest ranked feature is selected as a delegate, and all delegates from the three ranked lists are combined into the final feature set using union operation. Empirical results over a number of real-world data sets confirm the hypothesis that combining features selected using multiple heterogeneous methods results in a more robust feature set and improves prediction accuracy. As compared to other feature selection approaches evaluated, features selected using linear correlation-based multi-filter feature selection achieved the best classification accuracy with 98.7%, 100%, 92.3% and 100% for Ionosphere, Wisconsin Breast Cancer, Sonar and Wine data sets respectively.

Cite

CITATION STYLE

APA

Haq, A. U., Zhang, D., Peng, H., & Rahman, S. U. (2019). Combining Multiple Feature-Ranking Techniques and Clustering of Variables for Feature Selection. IEEE Access, 7, 151482–151492. https://doi.org/10.1109/ACCESS.2019.2947701

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free