Leveraging the common cause of errors for constraint-based data cleansing

0Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This study describes a statistically motivated approach to constraint-based data cleansing that derives the cause of errors from a distribution of conflicting tuples. In real-world dirty data, errors are often not randomly distributed. Rather, they often occur only under certain conditions, such as when the transaction is handled by a certain operator, or the weather is rainy. Leveraging such common conditions, or “cause conditions”, the algorithm resolves multi-tuple conflicts with high speed, as well as high accuracy in realistic settings where the distribution of errors is skewed. We present complexity analyses of the problem, pointing out two subproblems that are NP-complete. We then introduce, for each subproblem, heuristics that work in sub-polynomial time. The algorithms are tested with three sets of data and rules. The experiments show that, compared to the state-of-the-art methods for Conditional Functional Dependencies (CFD)-based and FD-based data cleansing, the proposed algorithm scales better with respect to the data size, is the only method that outputs complete repairs, and is more accurate when the error distribution is skewed.

Cite

CITATION STYLE

APA

Hoshino, A., Nakayama, H., Ito, C., Kanno, K., & Nishimura, K. (2015). Leveraging the common cause of errors for constraint-based data cleansing. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9441, pp. 164–176). Springer Verlag. https://doi.org/10.1007/978-3-319-25660-3_14

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free