BACKGROUND: It is challenging to deal with mixture models when missing values occur in clustering datasets.<br /><br />METHODS AND RESULTS: We propose a dynamic clustering algorithm based on a multivariate Gaussian mixture model that efficiently imputes missing values to generate a "pseudo-complete" dataset. Parameters from different clusters and missing values are estimated according to the maximum likelihood implemented with an expectation-maximization algorithm, and multivariate individuals are clustered with Bayesian posterior probability. A simulation showed that our proposed method has a fast convergence speed and it accurately estimates missing values. Our proposed algorithm was further validated with Fisher's Iris dataset, the Yeast Cell-cycle Gene-expression dataset, and the CIFAR-10 images dataset. The results indicate that our algorithm offers highly accurate clustering, comparable to that using a complete dataset without missing values. Furthermore, our algorithm resulted in a lower misjudgment rate than both clustering algorithms with missing data deleted and with missing-value imputation by mean replacement.<br /><br />CONCLUSION: We demonstrate that our missing-value imputation clustering algorithm is feasible and superior to both of these other clustering algorithms in certain situations.
Xiao, J., Xu, Q., Wu, C., Gao, Y., Hua, T., & Xu, C. (2016). Performance evaluation of missing-value imputation clustering based on a multivariate Gaussian mixture model. PLoS ONE, 11(8). https://doi.org/10.1371/journal.pone.0161112