Cleansing data for mining and warehousing

73Citations
Citations of this article
58Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Given the rapid growth of data, it is important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing is crucial because of the "garbage in, garbage out" principle. "Dirty" data files are prevalent because of incorrect or missing data values, inconsistent value naming conventions, and incomplete information. Hence, we may have multiple records refer-ing to the same real world entity. In this paper, we examine the problem of detecting and removing duplicating records. We present several efficient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood. Based on these techniques, we implement a data cleansing system which can detect and remove more duplicate records than existing methods.

Cite

CITATION STYLE

APA

Lee, M. L., Lu, H., Ling, T. W., & Ko, Y. T. (1999). Cleansing data for mining and warehousing. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1677, pp. 751–760). Springer Verlag. https://doi.org/10.1007/3-540-48309-8_70

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free