Abstract
Quality of Record de-duplication is a key factor in decision making process. Correctness in the identification of duplicates from a dataset provides a strong foundation for inference. Blocking is a popular technique in de-duplication. In the traditional de-duplication process blocking key is decided by the domain expert. In real time systems, automation of blocking key generation is a primary requirement. Blocking key generation without any human intervention is the objective of this paper. The proposed Automated Token Formation (ATF) algorithm is a fully automated way for blocking key generation. The attributes shortlisted by ATF are almost similar to that of the manual method for all datasets experimented. Datasets like Cora, Restaurant, and FEBRL are used. It is observed that the token provided by ATF has shown 20 % poor results over manual tokens for Cora dataset while for the other two datasets results are matching with manual tokens. A modification is made to ATF to improve the quality of the result by Semi-Automated Token Formation (SATF) algorithm. SATF is a semi-automated approach where training data is needed. SATF has shown better performance over all the manual tokens as well as tokens by ATF.
Cite
CITATION STYLE
Wangikar*, V., Deshmukh, S., & Bhirud, S. (2020). A Semi-Automated Record De-Duplication Technique for a Data Warehouse Environment. International Journal of Innovative Technology and Exploring Engineering, 9(3), 2914–2920. https://doi.org/10.35940/ijitee.b6265.019320
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.