This paper identifies a criterion for choosing an optimum set of selected features, or rejected null hypotheses, in high-dimensional data analysis. The method is designed for dimension reduction with multiple hypothesis testing used in filtering process of big data, and in exploratory research, to identify significant associations among many predictor variables and few outcomes. The novelty of the proposed method is that the selected p-value threshold will be insensitive to dependency within features, and between features and outcome. The method neither requires predetermined thresholds for level of significance, nor uses presumed thresholds for false discovery rate. Using the presented method, the optimum p-value for powerful yet parsimonious model is chosen, then for every set of rejected hypotheses, the researcher can also report traditional measures of statistical accuracy such as the expected number of false positives, and false discovery rate. The upper limit for number of rejected hypotheses (or selected features) is determined by finding the maximum difference between expected true hypotheses and expected false hypotheses among all possible sets of rejected hypotheses. Then, many methods of choosing an optimum number of selected features such as piecewise regression are used to form a parsimonious model. The paper reports the results of implementation of proposed methods in a novel example of non-parametric analysis of high-dimensional ordinal survey data.
CITATION STYLE
Ghaseminejad Tafreshi, A. H. (2018). A non-parametric maximum for number of selected features: objective optima for FDR and significance threshold with application to ordinal survey analysis. Journal of Big Data, 5(1). https://doi.org/10.1186/s40537-018-0128-5
Mendeley helps you to discover research relevant for your work.