Semantic deduplication in databases

K. S. Anju; M. S. Sadhik; Surekha Mariam Varghese

Journal Article

Semantic deduplication in databases

International Journal of Innovative Technology and Exploring Engineering (2019) 8(6) 581-585

ISSN: 22783075

0Citations

1Readers

Abstract

The presence of semantic duplicates imposes a challenge on the quality management of large datasets such as medical datasets and recommendation systems. A huge number of duplicates in large databases necessitate deduplication. Deduplication is a capacity optimization innovation that is being utilized to dramatically enhance storage efficiency. For this, it is required to identify the copies, with a quite solid approach to find as many copies as achievable and sufficiently ample to run in a sensible time. A similarity-based data deduplication is proposed by combining the methods of Content Defined Chunking (CDC) and bloom filter. These methods are exploited to look inside the files to check what portions of the data are duplicates for better storage space savings. Bloom filter is a probabilistic data structure and it is mainly used to decrease the search time. To enhance the performance of the system, methods like Locality Sensitive Hashing (LSH) and Word2Vec are also used. These two techniques are used to identify the semantic similarity between the chunks. In LSH, Levenshtein distance algorithm measures the similarity between the chunks in the repository. The deduplication performed based on semantic similarity checking improves the storage utilization and reduces the computation overhead effectively.

Author supplied keywords

Cite

CITATION STYLE

APA

Anju, K. S., Sadhik, M. S., & Varghese, S. M. (2019). Semantic deduplication in databases. International Journal of Innovative Technology and Exploring Engineering, 8(6), 581–585.

Semantic deduplication in databases

Abstract

Author supplied keywords

Cite

Register to see more suggestions