Missing Value Imputation for Mixed Data via Gaussian Copula

Yuxuan Zhao; Madeleine Udell

Conference ProceedingsOPEN ACCESS

Missing Value Imputation for Mixed Data via Gaussian Copula

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2020) 636-646

DOI: 10.1145/3394486.3403106

28Citations

46Readers

Abstract

Missing data imputation forms the first critical step of many data analysis pipelines. The challenge is greatest for mixed data sets, including real, Boolean, and ordinal data, where standard techniques for imputation fail basic sanity checks: for example, the imputed values may not follow the same distributions as the data. This paper proposes a new semiparametric algorithm to impute missing values, with no tuning parameters. The algorithm models mixed data as a Gaussian copula. This model can fit arbitrary marginals for continuous variables and can handle ordinal variables with many levels, including Boolean variables as a special case. We develop an efficient approximate EM algorithm to estimate copula parameters from incomplete mixed data. The resulting model reveals the statistical associations among variables. Experimental results on several synthetic and real datasets show the superiority of our proposed algorithm to state-of-the-art imputation algorithms for mixed data.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhao, Y., & Udell, M. (2020). Missing Value Imputation for Mixed Data via Gaussian Copula. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 636–646). Association for Computing Machinery. https://doi.org/10.1145/3394486.3403106

Missing Value Imputation for Mixed Data via Gaussian Copula

Abstract

Author supplied keywords

Cite

Register to see more suggestions