For efficient chunking, we propose Differential Evolution (DE) based approach which is optimized Two Thresholds Two Divisors (TTTD-P) Content Defined Chunking (CDC) to reduce the number of computing operations using single dynamic optimal parameter divisor D with optimal threshold value exploiting multi-operations nature of TTTD. To reduce chunk size variance, TTTD algorithm introduces an additional backup divisor D′ that has a higher probability of finding cut points, however, adding an additional divisor decreases chunking throughput. To this end, Asymmetric Extremum (AE) significantly improves chunking throughput by using local extreme value in a variable-sized asymmetric window to overcome Rabin and TTTD boundaries shift problem, while achieving nearby same deduplication ratio (DR). Therefore, we propose DE-based TTTD-P optimized chunking to maximize chunking throughput with increased DR; and scalable bucket indexing approach reduces hash values judgment time to identify and declare redundant chunks about 16 times than Rabin CDC, 5 times than AE CDC, 1.6 times than FAST CDC on Hadoop Distributed File System (HDFS).
CITATION STYLE
Kumar, N., Shobha, & Jain, S. C. (2019). Efficient data deduplication for big data storage systems. In Advances in Intelligent Systems and Computing (Vol. 714, pp. 351–371). Springer Verlag. https://doi.org/10.1007/978-981-13-0224-4_32
Mendeley helps you to discover research relevant for your work.