Multivariate data sets frequently have missing observations scattered throughout the data set. Many machine learning algorithms assume that there is no particular significance in the fact that a particular observation has an attribute value missing. A common approach in coping with these missing values is to replace the missing value using some plausible value, and the resulting completed data set is analysed using standard methods. We evaluate the effect that some commonly used imputation methods have on the accuracy of classifiers in supervised leaning. The effect is assessed in simulations performed on several classical datasets where observations have been made missing at random in different proportions. Our analysis finds that missing data imputation using hot deck, iterative robust model-based imputation (IRMI), factorial analysis for mixed data (FAMD) and Random Forest Imputation (MissForest) perform in a similar manner regardless of the amount of missing data and have the highest mean percentage of observations correctly classified. Other methods investigated did not perform as well.
CITATION STYLE
Hunt, L. A. (2017). Missing data imputation and its effect on the accuracy of classification. In Studies in Classification, Data Analysis, and Knowledge Organization (Vol. 0, pp. 3–14). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-55723-6_1
Mendeley helps you to discover research relevant for your work.