Overlapped hashing: A novel scalable blocking technique for entity resolution in big-data era

1Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Entity resolution is a critical process to enable big data integration. It aims to identify records that refer to the same real-world entity over one or several data sources. By time entity resolution processing has become more problematic and very challenging process due to the continuous increases in the data volume and variety. Therefore, blocking techniques have been developed to solve entity resolution limitations through partitioning datasets into “Blocks” of records. This partitioning step allows their processing in parallel for applying entity resolution methods within each block individually. The current blocking techniques are categorized into two main types: efficient or effective. The effective category includes the techniques that target the accuracy and quality of results. On the other hand, the efficient category includes the fast techniques yet report low accuracy. Nevertheless, there is no technique that succeeded to combine efficiency and effectiveness together, which become a crucial requirement especially with the evolution of the big-data area. This paper introduces a novel technique to fulfill the existing gap in order to achieve high efficiency with no cost to effectiveness through combining the core idea of the canopy clustering with the hashing blocking technique. It is worth to mention that the canopy clustering is classified as the most efficient blocking technique, while the hashing is classified as the most effective one. The proposed technique is named overlapped hashing. The extensive simulation studies conducted on benchmark dataset proved the ability to combine both concepts in one technique yet avoiding their drawbacks. The results report an outstanding performance in terms of scalability, efficiency and effectiveness and promise to create a new step forward in the entity resolution field.

Cite

CITATION STYLE

APA

Khalil, R., Shawish, A., & Elzanfaly, D. (2019). Overlapped hashing: A novel scalable blocking technique for entity resolution in big-data era. In Advances in Intelligent Systems and Computing (Vol. 858, pp. 427–441). Springer Verlag. https://doi.org/10.1007/978-3-030-01174-1_32

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free