A new efficient data cleansing method

9Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

One of the most important tasks in data cleansing is to detect and remove duplicate records, which consists of two main components, detection and comparison. A detection method decides which records will be compared, and a comparison method determines whether two records compared are duplicate. Comparisons take a great deal of data cleansing time. We discover that if certain properties are satisfied by a comparison method then many unnecessary expensive comparisons can be avoided. In this paper, we first propose a new comparison method, LCSS, based on the longest common subsequence, and show that it possesses the desired properties. We then propose two new detection methods, SNM-IN and SNM-INOUT, which are variances of the popular detection method SNM. The performance study on real and synthetic databases shows that the integration of SNM-IN (SNM-INOUT) and LCSS saves about 39% (56%) of comparisons.

Cite

CITATION STYLE

APA

Zhao, L., Yuan, S. S., Peng, S., & Wang, L. T. (2002). A new efficient data cleansing method. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2453, pp. 484–493). Springer Verlag. https://doi.org/10.1007/3-540-46146-9_48

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free