Discovering Networks of Interdependent Features in High-Dimensional Problems

13Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The availability of very large data sets in Life Sciences provided earlier by the technological breakthroughs such as microarrays and more recently by various forms of sequencing has created both challenges in analyzing these data as well as new opportunities. A promising, yet underdeveloped approach to Big Data, not limited to Life Sciences, is the use of feature selection and classification to discover interdependent features. Traditionally, classifiers have been developed for the best quality of supervised classification. In our experience, more often than not, rather than obtaining the best possible supervised classifier, the Life Scientist needs to know which features contribute best to classifying observations (objects, samples) into distinct classes and what the interdependencies between the features that describe the observation. Our underlying hypothesis is that the interdependent features and rule networks do not only reflect some syntactical properties of the data and classifiers but also may convey meaningful clues about true interactions in the modeled biological system. In this chapter we develop further our method of Monte Carlo Feature Selection and Interdependency Discovery (MCFS and MCFS-ID, respectively), which are particularly well suited for high-dimensional problems, i.e., those where each observation is described by very many features, often many more features than the number of observations. Such problems are abundant in Life Science applications. Specifically, we define Inter-Dependency Graphs (termed, somewhat confusingly, ID Graphs) that are directed graphs of interactions between features extracted by aggregation of information from the classification trees constructed by the MCFS algorithm. We then proceed with modeling interactions on a finer level with rule networks. We discuss some of the properties of the ID graphs and make a first attempt at validating our hypothesis on a large gene expression data set for CD4 + T-cells. The MCFS-ID and ROSETTA including the Ciruvis approach offer a new methodology for analyzing Big Data from feature selection, through identification of feature interdependencies, to classification with rules according to decision classes, to construction of rule networks. Our preliminary results confirm that MCFS-ID is applicable to the identification of interacting features that are functionally relevant while rule networks offer a complementary picture with finer resolution of the interdependencies on the level of feature-value pairs.

Cite

CITATION STYLE

APA

Dramiński, M., Da̧browski, M. J., Diamanti, K., Koronacki, J., & Komorowski, J. (2016). Discovering Networks of Interdependent Features in High-Dimensional Problems. In Studies in Big Data (Vol. 16, pp. 285–304). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-26989-4_12

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free