k-means is one of the most widely used clustering algorithms by far. However, when faced with massive data clustering tasks, traditional data mining approaches, especially existing clustering mechanisms fail to deal with malicious attacks under arbitrary background knowledge. This could result in violation of individuals’ privacy, as well as leaks through system resources and clustering outputs while untrusted codes are directly performed on the original data. To address this issue, this paper proposes a novel, effective hybrid k-means clustering preserving differential privacy in Spark, namely Differential Privacy Hybrid k-means (DPHKMS). We combined Particle Swarm Optimization and Cuckoo-search to initiate better cluster centroid selections in the framework of big data computing platform, Apache Spark. Furthermore, DPHKMS is implemented and theoretically proved to meet ε-differential privacy with determinative privacy budget allocation under Laplace mechanism. Finally, experimental results on challenging benchmark data sets demonstrated that DPHKMS, guaranteeing availability and scalability, significantly improves existing varieties of k-means and consistently outperforms the state-of-the-art ones in terms of privacy-preserving, verifying the effectiveness and advantages of incorporating heuristic swarm intelligence.
CITATION STYLE
Gao, Z. Q., & Zhang, L. J. (2018). DPHKMS: An efficient hybrid clustering preserving differential privacy in spark. In Lecture Notes on Data Engineering and Communications Technologies (Vol. 6, pp. 367–377). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-319-59463-7_37
Mendeley helps you to discover research relevant for your work.