Microarray gene expression data contains missing values (MVs). However, some methods for downstream analyses, including some predic-tion tools, require a complete expression data matrix. Current methods for estimating the MVs include sample mean and K-nearest neighbors (KNN). Whether the accuracy of estimation (imputation) methods depends on the actual gene expression has not been thoroughly investigated. Under this set-ting, we examine how the accuracy depends on the actual expression level and propose new methods that provide improvements in accuracy relative to the current methods in certain ranges of gene expression. In particular, we propose regression methods, namely multiple imputation via ordinary least squares (OLS) and missing value prediction using partial least squares (PLS). Mean estimation of MVs ignores the observed correlation structure of the genes and is highly inaccurate. Estimating MVs using KNN, a method which incorporates pairwise gene expression information, provides substan-tial improvement in accuracy on average. However, the accuracy of KNN across the wide range of observed gene expression is unlikely to be uniform and this is revealed by evaluating accuracy as a function of the expression level.
CITATION STYLE
Nguyen, D. V., Wang, N., & Carroll, R. J. (2021). Evaluation of Missing Value Estimation for Microarray Data. Journal of Data Science, 2(4), 347–370. https://doi.org/10.6339/jds.2004.02(4).170
Mendeley helps you to discover research relevant for your work.