Abstract
The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. This holds regardless of the application - relational database joining, web-related, or scientific. In all cases, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge is the design and implementation of a data flow graph that effectively and efficiently generates clean data. Needed improvements to the current state of the art include (i) a clear separation between the logical specification of data transformations and their physical implementation (ii) an explanation of the reasoning behind cleaning results, (iii) and interactive facilities to tune a data cleaning program. This paper presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessment of the proposed framework for data cleaning.
Cite
CITATION STYLE
Galhardas, H., Florescu, D., Shasha, D., Simon, E., & Saita, C. A. (2001). Declarative data cleaning: Language, model, and algorithms. In VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases (pp. 371–380). Morgan Kaufmann.
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.