Handling missing data in trees: Surrogate splits or statistical imputation?

55Citations
Citations of this article
63Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

In many applications of data mining a - sometimes considerable - part of the data values is missing. Despite the frequent occurrence of missing data, most data mining algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. We investigate simulation-based data augmentation to handle missing data, which is based on filling-in (imputing) one or more plausible values for the missing data. One advantage of this approach is that the imputation phase is separated from the analysis phase, allowing for different data mining algorithms to be applied to the completed data sets. We compare the use of imputation to surrogate splits, such as used in CART, to handle missing data in tree-based mining algorithms. Experiments show that imputation tends to outperform surrogate splits in terms of predictive accuracy of the resulting models. Averaging over M >1 models resulting from M imputations yields even better results as it profits from variance reduction in much the same way as procedures such as bagging.

Cite

CITATION STYLE

APA

Feelders, A. (1999). Handling missing data in trees: Surrogate splits or statistical imputation? In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1704, pp. 329–334). Springer Verlag. https://doi.org/10.1007/978-3-540-48247-5_38

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free