Accurate recasting of parameter estimation algorithms using sufficient statistics for efficient parallel speed-up: Demonstrated for center-based data clustering algorithms

Bin Zhang; Meichun Hsu; George Forman

Conference ProceedingsOPEN ACCESS

Accurate recasting of parameter estimation algorithms using sufficient statistics for efficient parallel speed-up: Demonstrated for center-based data clustering algorithms

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2000) 1910 243-254

DOI: 10.1007/3-540-45372-5_24

9Citations

20Readers

Abstract

Fueled by advances in computer technology and online business, data collection is rapidly accelerating, as well as the importance of its analysis-data mining. Increasing database sizes strain the scalability of many data mining algorithms. Data clustering is one of the fundamental techniques in data mining solutions. The many clustering algorithms developed face new challenges with growing data sets. Algorithms with quadratic or higher computational complexity, such as agglomerative algorithms, drop out quickly. More efficient algorithms, such as K-Means EM with linear cost per iteration, still need work to scale up to large data sets. This paper shows that many parameter estimation algorithms, including K-Means, K-Harmonic Means and EM, can be recast without approximation in terms of Sufficient Statistics, yielding an superior speed-up efficiency. Estimates using today’s workstations and local area network technology suggest efficient speed-up to several hundred computers, leading to effective scale-up for clustering hundreds of gigabytes of data. Implementation of parallel clustering has been done in a parallel programming language, ZPL. Experimental results show above 90% utilization.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhang, B., Hsu, M., & Forman, G. (2000). Accurate recasting of parameter estimation algorithms using sufficient statistics for efficient parallel speed-up: Demonstrated for center-based data clustering algorithms. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1910, pp. 243–254). Springer Verlag. https://doi.org/10.1007/3-540-45372-5_24

Accurate recasting of parameter estimation algorithms using sufficient statistics for efficient parallel speed-up: Demonstrated for center-based data clustering algorithms

Abstract

Author supplied keywords

Cite

Register to see more suggestions