A new efficient data cleansing method

Li Zhao; Sung Sam Yuan; Sun Peng; Ling Tok Wang

Conference Proceedings

A new efficient data cleansing method

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2002) 2453 484-493

DOI: 10.1007/3-540-46146-9_48

9Citations

8Readers

Get full text

Abstract

One of the most important tasks in data cleansing is to detect and remove duplicate records, which consists of two main components, detection and comparison. A detection method decides which records will be compared, and a comparison method determines whether two records compared are duplicate. Comparisons take a great deal of data cleansing time. We discover that if certain properties are satisfied by a comparison method then many unnecessary expensive comparisons can be avoided. In this paper, we first propose a new comparison method, LCSS, based on the longest common subsequence, and show that it possesses the desired properties. We then propose two new detection methods, SNM-IN and SNM-INOUT, which are variances of the popular detection method SNM. The performance study on real and synthetic databases shows that the integration of SNM-IN (SNM-INOUT) and LCSS saves about 39% (56%) of comparisons.

Cite

CITATION STYLE

APA

Zhao, L., Yuan, S. S., Peng, S., & Wang, L. T. (2002). A new efficient data cleansing method. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2453, pp. 484–493). Springer Verlag. https://doi.org/10.1007/3-540-46146-9_48

A new efficient data cleansing method

Abstract

Cite

Register to see more suggestions