An efficient algorithm for de-duplication of demographic data

0Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper proposes an efficient algorithm to de-duplicate based on demographic information which contains two name strings, viz. GivenName and Surname of individuals. The algorithm consists of two stages - enrolment and de-duplication. In both stages, all name strings are reduced to generic name strings with the help of phonetic based reduction rules. Thus there may be several name strings having same generic name and also there may be many individuals having the same name. The generic name with all name strings and their Ids forms a bin. At the enrolment stage, a database with demographic information is efficiently created which is an array of bins and each bin is a singly linked list. At the de-duplication stage, name strings are reduced and all neighbouring bins of the reduced name strings are used to determine the top k best matches. In order to see the performance of the proposed algorithm, we have considered a large demographic database of 4,85,136 individuals. It has been observed that the phonetic reduction rules could reduce both the name strings by more than 90%. Experimental results reveal that there is very high hit rate against a low penetration rate. © 2012 Springer-Verlag.

Cite

CITATION STYLE

APA

Kaushik, V. D., Bendale, A., Nigam, A., & Gupta, P. (2012). An efficient algorithm for de-duplication of demographic data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7389 LNCS, pp. 602–609). https://doi.org/10.1007/978-3-642-31588-6_77

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free