An automatic blocking strategy for XML duplicate detection

  • Leitão L
  • Calado P
N/ACitations
Citations of this article
7Readers
Mendeley users who have this article in their library.

Abstract

Duplicate detection consists in finding objects that, although having different representations in a database, correspond to the same real world entity. This is typically achieved by comparing all objects to each other, which can be unfeasible for large datasets. Blocking strategies have been devised to reduce the number of objects to compare, at the cost of loosing some duplicates. However, these strategies typically rely on user knowledge to discover a set of parameters that optimize the comparisons, while minimizing the loss. Also, they do not usually optimize the comparison between each pair of objects. In this paper, we propose a blocking method of combining two optimization strategies: one to select which objects to compare and another to optimize pair-wise object comparisons. In addition, we propose a machine learning approach to determine the required parameters, without the need of user intervention. Experiments performed on several datasets show that not only we are able to effectively determine the optimization parameters, but also to significantly improve efficiency, while maintaining an acceptable loss of recall.

Cite

CITATION STYLE

APA

Leitão, L., & Calado, P. (2013). An automatic blocking strategy for XML duplicate detection. ACM SIGAPP Applied Computing Review, 13(2), 42–53. https://doi.org/10.1145/2505420.2505424

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free