Effective and efficient data cleaning for entity matching

0Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.
Get full text

Abstract

As a key data-integration step, entity matching (EM) identifies tuples referring to the same real-world entities in disparate data sources. In many cases, the EM quality can be improved by repairing incorrect values in the data; at the same time, it is well known that the time costs of data cleaning by human experts could be prohibitive. In this paper, we focus on the time-consuming humanin- the-loop data-cleaning problem for relational EM, by recommending to human experts a time-efficient order in which values of attributes could be cleaned in the given data. Our proposed domainindependent cleaning framework aims to save human users' time, by guiding them in cleaning the EM inputs in an attribute order that is as conducive to maximizing EM accuracy as possible within a given constraint on the time they spend on cleaning. In guiding the cleaning process, our attribute-recommendation methods discover and take advantage of information provided by the data, and also use feedback from the EM engine. Our preliminary experimental results suggest that the proposed approach leads to measurable speedup, for a variety of time constraints, in the improvement of EM accuracy over the baseline approach, in which domain experts choose the sequence in which to clean the attributes of the inputs.

Author supplied keywords

Cite

CITATION STYLE

APA

Ao, J., & Chirkova, R. (2019). Effective and efficient data cleaning for entity matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery. https://doi.org/10.1145/3328519.3329127

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free