Leveraging the common cause of errors for constraint-based data cleansing

Ayako Hoshino; Hiroki Nakayama; Chihiro Ito; Kyota Kanno; Kenshi Nishimura

Conference Proceedings

Leveraging the common cause of errors for constraint-based data cleansing

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2015) 9441 164-176

DOI: 10.1007/978-3-319-25660-3_14

0Citations

3Readers

Get full text

Abstract

This study describes a statistically motivated approach to constraint-based data cleansing that derives the cause of errors from a distribution of conflicting tuples. In real-world dirty data, errors are often not randomly distributed. Rather, they often occur only under certain conditions, such as when the transaction is handled by a certain operator, or the weather is rainy. Leveraging such common conditions, or “cause conditions”, the algorithm resolves multi-tuple conflicts with high speed, as well as high accuracy in realistic settings where the distribution of errors is skewed. We present complexity analyses of the problem, pointing out two subproblems that are NP-complete. We then introduce, for each subproblem, heuristics that work in sub-polynomial time. The algorithms are tested with three sets of data and rules. The experiments show that, compared to the state-of-the-art methods for Conditional Functional Dependencies (CFD)-based and FD-based data cleansing, the proposed algorithm scales better with respect to the data size, is the only method that outputs complete repairs, and is more accurate when the error distribution is skewed.

Author supplied keywords

Cite

CITATION STYLE

APA

Hoshino, A., Nakayama, H., Ito, C., Kanno, K., & Nishimura, K. (2015). Leveraging the common cause of errors for constraint-based data cleansing. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9441, pp. 164–176). Springer Verlag. https://doi.org/10.1007/978-3-319-25660-3_14

Leveraging the common cause of errors for constraint-based data cleansing

Abstract

Author supplied keywords

Cite

Register to see more suggestions