Using data to build a better EM: EM* for big data

Hasan Kurban; Mark Jenne; Mehmet M. Dalkilic

Journal ArticleOPEN ACCESS

Using data to build a better EM: EM* for big data

International Journal of Data Science and Analytics (2017) 4(2) 83-97

DOI: 10.1007/s41060-017-0062-1

11Citations

17Readers

Abstract

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like expectation maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a nonlinear hierarchical data structure (heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real-world and synthetic data sets. We lastly conclude with some theoretical underpinnings that explain why EM* is successful.

Author supplied keywords

Cite

CITATION STYLE

APA

Kurban, H., Jenne, M., & Dalkilic, M. M. (2017). Using data to build a better EM: EM* for big data. International Journal of Data Science and Analytics, 4(2), 83–97. https://doi.org/10.1007/s41060-017-0062-1

Using data to build a better EM: EM* for big data

Abstract

Author supplied keywords

Cite

Register to see more suggestions