Multiple imputation inference for missing values in distributed datasets using apache spark

Sathish Kaliamoorthy; S. Mary Saira Bhanu

Conference Proceedings

Multiple imputation inference for missing values in distributed datasets using apache spark

Communications in Computer and Information Science (2018) 906 24-33

DOI: 10.1007/978-981-13-1813-9_3

2Citations

4Readers

Get full text

Abstract

Big data is a term that describes the large volume of data, both structured and unstructured. Due to its huge quantity, big data are stored by partitioning and distributing into smaller chunks of data in multiple machines for quick and efficient analysis, because it is not possible for a single machine to hold all of the big data by itself. However, these datasets are generally incomplete because it contains many instances of missing values. Missing values are a serious impediment to data analysis, and Multiple Imputation is a preferred method for handling missing values. All existing multiple imputation implementations in statistical software packages are all based on the in-memory processing of data and are unsuitable if the data is distributed. So there is a need for handling missing values using multiple imputation if the data is distributed. The goal of this work is to implement a multiple imputation algorithm for missing values using fuzzy clustering on a distributed computing system built with Apache Spark. The results show that the multiple imputation algorithm outperforms traditional imputation techniques for missing values in a distributed computing system in terms of imputation accuracy.

Author supplied keywords

Cite

CITATION STYLE

APA

Kaliamoorthy, S., & Bhanu, S. M. S. (2018). Multiple imputation inference for missing values in distributed datasets using apache spark. In Communications in Computer and Information Science (Vol. 906, pp. 24–33). Springer Verlag. https://doi.org/10.1007/978-981-13-1813-9_3

Multiple imputation inference for missing values in distributed datasets using apache spark

Abstract

Author supplied keywords

Cite

Register to see more suggestions