Abstract
Record linkage is the process of identifying which records within or across databases refer to the same entity. Min-hash based Locality Sensitive Hashing (LSH) is commonly used in record linkage as a blocking technique to reduce the number of records to be compared. However, when applied on large databases, min-hash LSH can yield highly skewed block size distributions and many redundant record pair comparisons, where only few of those correspond to true matches (records that refer to the same entity). Furthermore, min-hash LSH is highly parameter sensitive and requires trial and error to determine the optimal trade-off between blocking quality and efficiency of the record pair comparison step. In this paper, we present a novel method to improve the scalability and robustness of min-hash LSH for linking large population databases by exploiting temporal and spatial information available in personal data, and by filtering record pairs based on block sizes and min-hash similarity. Our evaluation on three real-world data sets shows that our method can improve the efficiency of record pair comparison by 75% to 99%, whereas the final average linkage precision can be improved by 28% at the cost of a reduction in the average recall by 4%.
Author supplied keywords
Cite
CITATION STYLE
Nanayakkara, C., & Christen, P. (2022). Locality Sensitive Hashing with Temporal and Spatial Constraints for Efficient Population Record Linkage. In International Conference on Information and Knowledge Management, Proceedings (pp. 4354–4358). Association for Computing Machinery. https://doi.org/10.1145/3511808.3557631
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.