On Saving Outliers for Better Clustering over Noisy Data

9Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Clustering is often distracted by errors, frequently observed in almost all areas, ranging from online questionnaire to sensor reading in IoT. The dirty data values not only make themselves (the corresponding tuples) outlying, but also mislead the clustering of remaining tuples, e.g., mistakenly splitting a cluster into two or distorting the cluster center. The reason is that the traditional clustering methods either simply ignore the outliers such as DBSCAN or assign them to the closest clusters anyway, e.g., in K-Means. In this paper, we propose to save the outliers for better clustering. The idea is to adjust the erroneous values (often minimally) of the outlier in order to make it appear normally. That is, the tuples after adjusting values are no longer outlying, and thus will be clustered without distracting others. The outlier saving by value adjustment is designed to work with any clustering methods (e.g., DBSCAN or K-Means). Our technical contributions include: (1) showing NPhardness of the outlier saving problem for clustering, (2) deriving lower and upper bounds of the optimal solutions, and (3) devising approximation algorithm with performance guarantees referring to the aforesaid bounds. Experiments on datasets with real-world outliers demonstrate the higher accuracy of our proposal, compared to the state-of-the-art approaches. Remarkably, we show that the adjusted data with outlier saving indeed improve significantly clustering, as well as other applications such as classification and record matching.

Author supplied keywords

Cite

CITATION STYLE

APA

Song, S., Gao, F., Huang, R., & Wang, Y. (2021). On Saving Outliers for Better Clustering over Noisy Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1692–1704). Association for Computing Machinery. https://doi.org/10.1145/3448016.3457271

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free