SS-CDC: A two-stage parallel content-defined chunking for deduplicating backup storage

18Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Data deduplication has been widely used in storage systems to improve storage efficiency and I/O performance. In particular, content-defined variable-size chunking (CDC) is often used in data deduplication systems for its capability to detect and remove duplicate data in modified files. However, the CDC algorithm is very compute-intensive and inherently sequential. Efforts on accelerating it by segmenting a file and running the algorithm independently on each segment in parallel come at a cost of substantial degradation of deduplication ratio. In this paper, we propose SS-CDC, a two-stage parallel CDC, that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio. Further, SS-CDC exploits instruction-level SIMD parallelism available in today’s processors. As a case study, by using Intel AVX-512 instructions, SS-CDC consistently obtains superlinear speedups on a multi-core server. Our experiments using real-world datasets show that, compared to existing parallel CDC methods which only achieve up to a 7.7× speedup on an 8-core processor with the deduplication ratio degraded by up to 40%, SS-CDC can achieve up to a 25.6× speedup with no loss of deduplication ratio.

Cite

CITATION STYLE

APA

Ni, F., Lin, X., & Jiang, S. (2019). SS-CDC: A two-stage parallel content-defined chunking for deduplicating backup storage. In SYSTOR 2019 - Proceedings of the 12th ACM International Systems and Storage Conference (pp. 86–96). Association for Computing Machinery, Inc. https://doi.org/10.1145/3319647.3325834

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free