A Semi-Automated Record De-Duplication Technique for a Data Warehouse Environment

  • et al.
N/ACitations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Quality of Record de-duplication is a key factor in decision making process. Correctness in the identification of duplicates from a dataset provides a strong foundation for inference. Blocking is a popular technique in de-duplication. In the traditional de-duplication process blocking key is decided by the domain expert. In real time systems, automation of blocking key generation is a primary requirement. Blocking key generation without any human intervention is the objective of this paper. The proposed Automated Token Formation (ATF) algorithm is a fully automated way for blocking key generation. The attributes shortlisted by ATF are almost similar to that of the manual method for all datasets experimented. Datasets like Cora, Restaurant, and FEBRL are used. It is observed that the token provided by ATF has shown 20 % poor results over manual tokens for Cora dataset while for the other two datasets results are matching with manual tokens. A modification is made to ATF to improve the quality of the result by Semi-Automated Token Formation (SATF) algorithm. SATF is a semi-automated approach where training data is needed. SATF has shown better performance over all the manual tokens as well as tokens by ATF.

Cite

CITATION STYLE

APA

Wangikar*, V., Deshmukh, S., & Bhirud, S. (2020). A Semi-Automated Record De-Duplication Technique for a Data Warehouse Environment. International Journal of Innovative Technology and Exploring Engineering, 9(3), 2914–2920. https://doi.org/10.35940/ijitee.b6265.019320

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free