Missing data imputation through machine learning algorithms

Michael B. Richman; Theodore B. Trafalis; Indra Adrianto

Book Chapter

Missing data imputation through machine learning algorithms

Springer Netherlands, (2009), 153-169

DOI: 10.1007/978-1-4020-9119-3_7

42Citations

65Readers

Get full text

Abstract

How to address missing data is an issue most researchers face. Computerized algorithms have been developed to ingest rectangular data sets, where the rows represent observations and the columns represent variables. These data matrices contain elements whose values are real numbers. In many data sets, some of the elements of the matrix are not observed. Quite often, missing observations arise from instrument failures,values that have not passed quality control criteria, etc. That leads to a quandary for the analyst using techniques that require a full data matrix. The first ecision an analyst must make is whether the actual underlying values would have been observed if there was not an instrument failure, an extreme value, or some unknown reason. Since many programs expect complete data and the most economical way to achieve this is by deleting the observations with missing data, most often the analysis is performed on a subset of available data. This situation can become extreme in cases where a substantial portion of the data are missing or, worse, in cases where many variables exist with a seemingly small percentage of missing data. In such cases, large amounts of available data are discarded by deleting observations with one or more pieces of missing data. The importance of this problem arises as the investigator is interested in making inferences about the entire population, not just those observations with complete data. Before embarking on an analysis of the impact of missing data on the first two moments of data distributions, it is helpful to discuss if there are patterns in the missing data. Quite often, understanding the way data are missing helps to illuminate the reason for the missing values. In the case of a series of gridpoints, all gridpoints but one may have complete data. If the gridpoint with missing data is consideredimportant, some technique to fill-in the missing values may be sought. Spatial interpolation techniques have been developed that are accurate in most situations (e.g., Barnes 1964; Julian 1984; Spencer and Gao 2004). Contrast this type of missing data pattern to another situation where a series of variables (e.g., temperature, precipitation, station pressure, relative humidity) are measured at a single location. Perhaps all but one of the variables is complete over a set of observations, but the last variable has some missing data. In such cases, interpolation techniques are not the logical alternative; some other method is required. Such problems are not unique to the environmental sciences. In the analysis of agriculture data, patterns of missing data have been noted for nearly a century (Yates 1933). Dodge (1985) discusses the use of least squares estimation to replace missing data in univariate analysis. © 2009 Springer Netherlands.

Cite

CITATION STYLE

APA

Richman, M. B., Trafalis, T. B., & Adrianto, I. (2009). Missing data imputation through machine learning algorithms. In Artificial Intelligence Methods in the Environmental Sciences (pp. 153–169). Springer Netherlands. https://doi.org/10.1007/978-1-4020-9119-3_7

Missing data imputation through machine learning algorithms

Abstract

Cite

Register to see more suggestions