Abstract
Data deduplication has been widely used in storage systems to improve storage efficiency and I/O performance. In particular, content-defined variable-size chunking (CDC) is often used in data deduplication systems for its capability to detect and remove duplicate data in modified files. However, the CDC algorithm is very compute-intensive and inherently sequential. Efforts on accelerating it by segmenting a file and running the algorithm independently on each segment in parallel come at a cost of substantial degradation of deduplication ratio. In this paper, we propose SS-CDC, a two-stage parallel CDC, that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio. Further, SS-CDC exploits instruction-level SIMD parallelism available in today’s processors. As a case study, by using Intel AVX-512 instructions, SS-CDC consistently obtains superlinear speedups on a multi-core server. Our experiments using real-world datasets show that, compared to existing parallel CDC methods which only achieve up to a 7.7× speedup on an 8-core processor with the deduplication ratio degraded by up to 40%, SS-CDC can achieve up to a 25.6× speedup with no loss of deduplication ratio.
Author supplied keywords
Cite
CITATION STYLE
Ni, F., Lin, X., & Jiang, S. (2019). SS-CDC: A two-stage parallel content-defined chunking for deduplicating backup storage. In SYSTOR 2019 - Proceedings of the 12th ACM International Systems and Storage Conference (pp. 86–96). Association for Computing Machinery, Inc. https://doi.org/10.1145/3319647.3325834
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.