Abstract
The presence of semantic duplicates imposes a challenge on the quality management of large datasets such as medical datasets and recommendation systems. A huge number of duplicates in large databases necessitate deduplication. Deduplication is a capacity optimization innovation that is being utilized to dramatically enhance storage efficiency. For this, it is required to identify the copies, with a quite solid approach to find as many copies as achievable and sufficiently ample to run in a sensible time. A similarity-based data deduplication is proposed by combining the methods of Content Defined Chunking (CDC) and bloom filter. These methods are exploited to look inside the files to check what portions of the data are duplicates for better storage space savings. Bloom filter is a probabilistic data structure and it is mainly used to decrease the search time. To enhance the performance of the system, methods like Locality Sensitive Hashing (LSH) and Word2Vec are also used. These two techniques are used to identify the semantic similarity between the chunks. In LSH, Levenshtein distance algorithm measures the similarity between the chunks in the repository. The deduplication performed based on semantic similarity checking improves the storage utilization and reduces the computation overhead effectively.
Author supplied keywords
Cite
CITATION STYLE
Anju, K. S., Sadhik, M. S., & Varghese, S. M. (2019). Semantic deduplication in databases. International Journal of Innovative Technology and Exploring Engineering, 8(6), 581–585.
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.