Entity resolution (ER), the problem of identifying and linking records that belong to the same real-world entities in structured and unstructured data, is a primary task in data integration. Accurate and efficient ER has a major practical impact on various applications across commercial, security and scientific domains. Recently, scalable ER techniques have received enormous attention with the increasing need to combine large-scale datasets. The shortage of training and ground truth data impedes the development and testing of ER algorithms. Good public datasets, especially those containing personal information, are restricted in this area and usually small in size. Due to privacy and confidential issues, testing algorithms or techniques with real datasets is challenging in ER research. Simulation is one technique for generating synthetic datasets that have characteristics similar to those of real data for testing algorithms. Many existing simulation tools in ER lack support for generating large-scale data and have problems in complexity, scalability, and limitations of resampling. In our work, we propose a simple, inexpensive, and fast synthetic data generation tool. Our tool only generates entity names in the first stage, but these are commonly used as identification keys in ER algorithms. We avoid the detail-level simulation of entity names using a simple vector representation that delivers simplicity and efficiency. In this paper, we discuss how to simulate simple vectors that approximate the properties of entity names. We describe the overall construction of the tool based on data analysis of a namespace that contains entity names collected from the actual environment.
CITATION STYLE
Herath, S., Roughan, M., & Glonek, G. (2021). Generating Name-Like Vectors for Testing Large-Scale Entity Resolution. IEEE Access, 9, 145288–145300. https://doi.org/10.1109/ACCESS.2021.3122451
Mendeley helps you to discover research relevant for your work.