Deduplication technology has been increasingly used to reduce the storage cost. In practice, the duplicate detection upon large on-disk index incurs unavoidable and significant overheads in write operations. Most existing deduplication methods perform single-pass processing, while pay little attention to develop highly parallel methods for the emerging parallel processors. In this paper, we present the design of G-Paradex, a novel deduplication framework that can significantly reduce the duplicate detecting time. Utilizing a prefix tree to organize the chunk fingerprints, G-Paradex is able to do fast deduplicating by using GPU to search the target tree in parallel. Leveraging the inherent chunk locality in writing data stream, we group consecutive chunks and extract the handprints into the prefix tree, aiming at shrinking the index size and reducing the on-disk accesses. Our experimental evaluation based on real-world datasets demonstrate that, compared with the traditional single-pass method, G-aparadex achieves a speedup of 2-4X for duplicate detecting. © 2013 Springer-Verlag.
CITATION STYLE
Lin, B., Liao, X., Li, S., Wang, Y., Huang, H., & Wen, L. (2013). G-paradex: GPU-based parallel indexing for fast data deduplication. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8299 LNCS, pp. 91–103). https://doi.org/10.1007/978-3-642-45293-2_7
Mendeley helps you to discover research relevant for your work.