Missing data imputation and its effect on the accuracy of classification

22Citations
Citations of this article
28Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Multivariate data sets frequently have missing observations scattered throughout the data set. Many machine learning algorithms assume that there is no particular significance in the fact that a particular observation has an attribute value missing. A common approach in coping with these missing values is to replace the missing value using some plausible value, and the resulting completed data set is analysed using standard methods. We evaluate the effect that some commonly used imputation methods have on the accuracy of classifiers in supervised leaning. The effect is assessed in simulations performed on several classical datasets where observations have been made missing at random in different proportions. Our analysis finds that missing data imputation using hot deck, iterative robust model-based imputation (IRMI), factorial analysis for mixed data (FAMD) and Random Forest Imputation (MissForest) perform in a similar manner regardless of the amount of missing data and have the highest mean percentage of observations correctly classified. Other methods investigated did not perform as well.

Cite

CITATION STYLE

APA

Hunt, L. A. (2017). Missing data imputation and its effect on the accuracy of classification. In Studies in Classification, Data Analysis, and Knowledge Organization (Vol. 0, pp. 3–14). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-55723-6_1

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free