Theoretical limits of microclustering for record linkage

9Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

There has been substantial recent interest in record linkage, where one attempts to group the records pertaining to the same entities from one or more large databases that lack unique identifiers. This can be viewed as a type of microclustering, with few observations per cluster and a very large number of clusters. We show that the problem is fundamentally hard from a theoretical perspective and, even in idealized cases, accurate entity resolution is effectively impossible unless the number of entities is small relative to the number of records and/or the separation between records from different entities is extremely large. These results suggest conservatism in interpretation of the results of record linkage, support collection of additional data to more accurately disambiguate the entities, and motivate a focus on coarser inference. For example, results from a simulation study suggest that sometimes one may obtain accurate results for population size estimation even when fine-scale entity resolution is inaccurate.

Cite

CITATION STYLE

APA

Johndrow, J. E., Lum, K., & Dunson, D. B. (2018). Theoretical limits of microclustering for record linkage. Biometrika, 105(2), 431–446. https://doi.org/10.1093/biomet/asy003

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free