One of the most important tasks in data cleansing is to detect and remove duplicate records, which consists of two main components, detection and comparison. A detection method decides which records will be compared, and a comparison method determines whether two records compared are duplicate. Comparisons take a great deal of data cleansing time. We discover that if certain properties are satisfied by a comparison method then many unnecessary expensive comparisons can be avoided. In this paper, we first propose a new comparison method, LCSS, based on the longest common subsequence, and show that it possesses the desired properties. We then propose two new detection methods, SNM-IN and SNM-INOUT, which are variances of the popular detection method SNM. The performance study on real and synthetic databases shows that the integration of SNM-IN (SNM-INOUT) and LCSS saves about 39% (56%) of comparisons.
CITATION STYLE
Zhao, L., Yuan, S. S., Peng, S., & Wang, L. T. (2002). A new efficient data cleansing method. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2453, pp. 484–493). Springer Verlag. https://doi.org/10.1007/3-540-46146-9_48
Mendeley helps you to discover research relevant for your work.