Cleansing data for mining and warehousing

Mong Li Lee; Hongjun Lu; Tok Wang Ling; Yee Teng Ko

Conference Proceedings

Cleansing data for mining and warehousing

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (1999) 1677 751-760

DOI: 10.1007/3-540-48309-8_70

73Citations

58Readers

Get full text

Abstract

Given the rapid growth of data, it is important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing is crucial because of the "garbage in, garbage out" principle. "Dirty" data files are prevalent because of incorrect or missing data values, inconsistent value naming conventions, and incomplete information. Hence, we may have multiple records refer-ing to the same real world entity. In this paper, we examine the problem of detecting and removing duplicating records. We present several efficient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood. Based on these techniques, we implement a data cleansing system which can detect and remove more duplicate records than existing methods.

Cite

CITATION STYLE

APA

Lee, M. L., Lu, H., Ling, T. W., & Ko, Y. T. (1999). Cleansing data for mining and warehousing. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1677, pp. 751–760). Springer Verlag. https://doi.org/10.1007/3-540-48309-8_70

Cleansing data for mining and warehousing

Abstract

Cite

Register to see more suggestions