Efficient data deduplication for big data storage systems

8Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

For efficient chunking, we propose Differential Evolution (DE) based approach which is optimized Two Thresholds Two Divisors (TTTD-P) Content Defined Chunking (CDC) to reduce the number of computing operations using single dynamic optimal parameter divisor D with optimal threshold value exploiting multi-operations nature of TTTD. To reduce chunk size variance, TTTD algorithm introduces an additional backup divisor D′ that has a higher probability of finding cut points, however, adding an additional divisor decreases chunking throughput. To this end, Asymmetric Extremum (AE) significantly improves chunking throughput by using local extreme value in a variable-sized asymmetric window to overcome Rabin and TTTD boundaries shift problem, while achieving nearby same deduplication ratio (DR). Therefore, we propose DE-based TTTD-P optimized chunking to maximize chunking throughput with increased DR; and scalable bucket indexing approach reduces hash values judgment time to identify and declare redundant chunks about 16 times than Rabin CDC, 5 times than AE CDC, 1.6 times than FAST CDC on Hadoop Distributed File System (HDFS).

Cite

CITATION STYLE

APA

Kumar, N., Shobha, & Jain, S. C. (2019). Efficient data deduplication for big data storage systems. In Advances in Intelligent Systems and Computing (Vol. 714, pp. 351–371). Springer Verlag. https://doi.org/10.1007/978-981-13-0224-4_32

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free