Text joins for data cleansing and integration in an RDBMS

  • Gravano L
  • Ipeirotis P
  • Koudas N
 et al. 
  • 12

    Readers

    Mendeley users who have this article in their library.
  • 27

    Citations

    Citations of this article.

Abstract

An organization's data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching is effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these "text joins" are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. We propose an approximate, sampling-based text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

Authors

  • Luis Gravano

  • Panagiotis G. Ipeirotis

  • Nick Koudas

  • Divesh Srivastava

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free