An Efficient Multi-Phase Blocking Strategy for Entity Resolution in Big Data

  • et al.
N/ACitations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Entity Resolution (ER) is the process of identifying records that refer to the same real-world entity. It plays a key role in many applications as data warehouse, data integration, and business intelligence. Comparing every record with all corresponding records is infeasible especially for a big dataset. To overcome such a problem, blocking techniques have been implemented. In this paper, we propose a novel Efficient Multi-Phase Blocking Strategy (EMPBS) for resolving duplicates in big data. As per our knowledge, some state of art blocking techniques may result in overlapping blocks (i.e. Q-grams) which cause redundant comparisons and hence increase the time complexity. Our proposed blocking strategy has disjoint blocks and less time complexity compared to Q-grams and slandered blocking techniques. In addition, EMPBS is general and requires no restrictions on the type of blocking keys. EMPBS consists of three phases. The first one generates three single efficient blocking keys. The second phase takes the output of the first phase as an input to construct a compound key. The compound key is composed of concatenation of two single blocking keys. Three compound blocking keys are the output of this phase that will be used as an input for the last phase, which is generating the Efficient Multi-Phase Blocking Key (EMPBK). EMPBK is constructed using the union of two compound blocking keys. The implementation of EMPBS presents promising results in terms of Reduction Ratio (RR) as it achieves a higher value of RR than adopting only a single blocking key, while at the same time maintains nearly the same precision and recall. EMPBS reduced about 84% of the average number of comparisons accomplished in a single blocking key. To evaluate EMPBS, we developed a Duplicate Generation tool (DupGen) that accepts a clean semi-structured file as an input and generates labeled duplicate records according to certain criteria.

Cite

CITATION STYLE

APA

El-ghafar*, R. M. A., El-Bastawissy, A. H., … Gheith, M. H. (2020). An Efficient Multi-Phase Blocking Strategy for Entity Resolution in Big Data. International Journal of Innovative Technology and Exploring Engineering, 9(9), 254–263. https://doi.org/10.35940/ijitee.i7070.079920

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free