Semantic deduplication in databases

ISSN: 22783075
0Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.

Abstract

The presence of semantic duplicates imposes a challenge on the quality management of large datasets such as medical datasets and recommendation systems. A huge number of duplicates in large databases necessitate deduplication. Deduplication is a capacity optimization innovation that is being utilized to dramatically enhance storage efficiency. For this, it is required to identify the copies, with a quite solid approach to find as many copies as achievable and sufficiently ample to run in a sensible time. A similarity-based data deduplication is proposed by combining the methods of Content Defined Chunking (CDC) and bloom filter. These methods are exploited to look inside the files to check what portions of the data are duplicates for better storage space savings. Bloom filter is a probabilistic data structure and it is mainly used to decrease the search time. To enhance the performance of the system, methods like Locality Sensitive Hashing (LSH) and Word2Vec are also used. These two techniques are used to identify the semantic similarity between the chunks. In LSH, Levenshtein distance algorithm measures the similarity between the chunks in the repository. The deduplication performed based on semantic similarity checking improves the storage utilization and reduces the computation overhead effectively.

Cite

CITATION STYLE

APA

Anju, K. S., Sadhik, M. S., & Varghese, S. M. (2019). Semantic deduplication in databases. International Journal of Innovative Technology and Exploring Engineering, 8(6), 581–585.

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free