Exploring hybrid parallel systems for probabilistic record linkage

3Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Record linkage is a technique widely used to gather data stored in disparate data sources that presumably pertain to the same real world entity. This integration can be done deterministically or probabilistically, depending on the existence of common key attributes among all data sources involved. The probabilistic approach is very time-consuming due to the amount of records that must be compared, specifically in big data scenarios. In this paper, we propose and evaluate a methodology that simultaneously exploits multicore and multi-GPU architectures in order to perform the probabilistic linkage of large-scale Brazilian governmental databases. We present some algorithmic optimizations that provide high accuracy and improve performance by defining the best algorithm-architecture combination for a problem given its input size. We also discuss performance results obtained with different data samples, showing that a hybrid approach outperforms other configurations, providing an average speedup of 7.9 when linking up to 20.000 million records.

Cite

CITATION STYLE

APA

Boratto, M., Alonso, P., Pinto, C., Melo, P., Barreto, M., & Denaxas, S. (2019). Exploring hybrid parallel systems for probabilistic record linkage. Journal of Supercomputing, 75(3), 1137–1149. https://doi.org/10.1007/s11227-018-2328-3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free