Efficient data deduplication for big data storage systems

Naresh Kumar; undefined Shobha; S. C. Jain

Conference Proceedings

Efficient data deduplication for big data storage systems

Advances in Intelligent Systems and Computing (2019) 714 351-371

DOI: 10.1007/978-981-13-0224-4_32

8Citations

5Readers

Get full text

Abstract

For efficient chunking, we propose Differential Evolution (DE) based approach which is optimized Two Thresholds Two Divisors (TTTD-P) Content Defined Chunking (CDC) to reduce the number of computing operations using single dynamic optimal parameter divisor D with optimal threshold value exploiting multi-operations nature of TTTD. To reduce chunk size variance, TTTD algorithm introduces an additional backup divisor D′ that has a higher probability of finding cut points, however, adding an additional divisor decreases chunking throughput. To this end, Asymmetric Extremum (AE) significantly improves chunking throughput by using local extreme value in a variable-sized asymmetric window to overcome Rabin and TTTD boundaries shift problem, while achieving nearby same deduplication ratio (DR). Therefore, we propose DE-based TTTD-P optimized chunking to maximize chunking throughput with increased DR; and scalable bucket indexing approach reduces hash values judgment time to identify and declare redundant chunks about 16 times than Rabin CDC, 5 times than AE CDC, 1.6 times than FAST CDC on Hadoop Distributed File System (HDFS).

Author supplied keywords

Cite

CITATION STYLE

APA

Kumar, N., Shobha, & Jain, S. C. (2019). Efficient data deduplication for big data storage systems. In Advances in Intelligent Systems and Computing (Vol. 714, pp. 351–371). Springer Verlag. https://doi.org/10.1007/978-981-13-0224-4_32

Efficient data deduplication for big data storage systems

Abstract

Author supplied keywords

Cite

Register to see more suggestions