Declarative data cleaning: Language, model, and algorithms

Helena Galhardas; Daniela Florescu; Dennis Shasha; Eric Simon; Cristian Augustin Saita

Conference Proceedings

Declarative data cleaning: Language, model, and algorithms

VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases (2001) 371-380

243Citations

108Readers

Abstract

The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. This holds regardless of the application - relational database joining, web-related, or scientific. In all cases, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge is the design and implementation of a data flow graph that effectively and efficiently generates clean data. Needed improvements to the current state of the art include (i) a clear separation between the logical specification of data transformations and their physical implementation (ii) an explanation of the reasoning behind cleaning results, (iii) and interactive facilities to tune a data cleaning program. This paper presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessment of the proposed framework for data cleaning.

Cite

CITATION STYLE

APA

Galhardas, H., Florescu, D., Shasha, D., Simon, E., & Saita, C. A. (2001). Declarative data cleaning: Language, model, and algorithms. In VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases (pp. 371–380). Morgan Kaufmann.

Declarative data cleaning: Language, model, and algorithms

Abstract

Cite

Register to see more suggestions