Text joins for data cleansing and integration in an RDBMS

32Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.
Get full text

Abstract

An organization's data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching can be effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these "text joins" are best done inside on RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. In this paper, we propose an approximate, sampling-based text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.

Cite

CITATION STYLE

APA

Gravano, L., Ipeirotis, P. G., Koudas, N., & Srivastava, D. (2003). Text joins for data cleansing and integration in an RDBMS. In Proceedings - International Conference on Data Engineering (pp. 729–731). https://doi.org/10.1109/ICDE.2003.1260850

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free