Fast Semantic Duplicate Detection Techniques in Databases

Ibrahim Moukouop Nguena; Amolo-Makama Ophélie Carmen Richeline

Journal ArticleOPEN ACCESS

Fast Semantic Duplicate Detection Techniques in Databases

Nguena I
Richeline A

Journal of Software Engineering and Applications (2017) 10(06) 529-545

DOI: 10.4236/jsea.2017.106029

N/ACitations

5Readers

Abstract

Semantic duplicates in databases represent today an important data quality challenge which leads to bad decisions. In large databases, we sometimes find ourselves with tens of thousands of duplicates, which necessitates an auto-matic deduplication. For this, it is necessary to detect duplicates, with a fairly reliable method to find as many duplicates as possible and powerful enough to run in a reasonable time. This paper proposes and compares on real data ef-fective duplicates detection methods for automatic deduplication of files based on names, working with French texts or English texts, and the names of people or places, in Africa or in the West. After conducting a more complete classification of semantic duplicates than the usual classifications, we intro-duce several methods for detecting duplicates whose average complexity ob-served is less than O(2n). Through a simple model, we highlight a global effi-cacy rate, combining precision and recall. We propose a new metric distance between records, as well as rules for automatic duplicate detection. Analyses made on a database containing real data for an administration in Central Africa, and on a known standard database containing names of restaurants in the USA, have shown better results than those of known methods, with a less-er complexity.

Cite

CITATION STYLE

APA

Nguena, I. M., & Richeline, A.-M. O. C. (2017). Fast Semantic Duplicate Detection Techniques in Databases. Journal of Software Engineering and Applications, 10(06), 529–545. https://doi.org/10.4236/jsea.2017.106029

Fast Semantic Duplicate Detection Techniques in Databases

Abstract

Cite

Register to see more suggestions