Missing data imputation and its effect on the accuracy of classification

Lynette A. Hunt

Conference Proceedings

Missing data imputation and its effect on the accuracy of classification

Hunt L

Studies in Classification, Data Analysis, and Knowledge Organization (2017) 0 3-14

DOI: 10.1007/978-3-319-55723-6_1

22Citations

28Readers

Get full text

Abstract

Multivariate data sets frequently have missing observations scattered throughout the data set. Many machine learning algorithms assume that there is no particular significance in the fact that a particular observation has an attribute value missing. A common approach in coping with these missing values is to replace the missing value using some plausible value, and the resulting completed data set is analysed using standard methods. We evaluate the effect that some commonly used imputation methods have on the accuracy of classifiers in supervised leaning. The effect is assessed in simulations performed on several classical datasets where observations have been made missing at random in different proportions. Our analysis finds that missing data imputation using hot deck, iterative robust model-based imputation (IRMI), factorial analysis for mixed data (FAMD) and Random Forest Imputation (MissForest) perform in a similar manner regardless of the amount of missing data and have the highest mean percentage of observations correctly classified. Other methods investigated did not perform as well.

Cite

CITATION STYLE

APA

Hunt, L. A. (2017). Missing data imputation and its effect on the accuracy of classification. In Studies in Classification, Data Analysis, and Knowledge Organization (Vol. 0, pp. 3–14). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-55723-6_1

Missing data imputation and its effect on the accuracy of classification

Abstract

Cite

Register to see more suggestions