An efficient algorithm for de-duplication of demographic data

Vandana Dixit Kaushik; Amit Bendale; Aditya Nigam; Phalguni Gupta

Conference Proceedings

An efficient algorithm for de-duplication of demographic data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7389 LNCS 602-609

DOI: 10.1007/978-3-642-31588-6_77

0Citations

4Readers

Get full text

Abstract

This paper proposes an efficient algorithm to de-duplicate based on demographic information which contains two name strings, viz. GivenName and Surname of individuals. The algorithm consists of two stages - enrolment and de-duplication. In both stages, all name strings are reduced to generic name strings with the help of phonetic based reduction rules. Thus there may be several name strings having same generic name and also there may be many individuals having the same name. The generic name with all name strings and their Ids forms a bin. At the enrolment stage, a database with demographic information is efficiently created which is an array of bins and each bin is a singly linked list. At the de-duplication stage, name strings are reduced and all neighbouring bins of the reduced name strings are used to determine the top k best matches. In order to see the performance of the proposed algorithm, we have considered a large demographic database of 4,85,136 individuals. It has been observed that the phonetic reduction rules could reduce both the name strings by more than 90%. Experimental results reveal that there is very high hit rate against a low penetration rate. © 2012 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Kaushik, V. D., Bendale, A., Nigam, A., & Gupta, P. (2012). An efficient algorithm for de-duplication of demographic data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7389 LNCS, pp. 602–609). https://doi.org/10.1007/978-3-642-31588-6_77

An efficient algorithm for de-duplication of demographic data

Abstract

Author supplied keywords

Cite

Register to see more suggestions