Temporal shingling for version identification in web archives

3Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Building and preserving archives of the evolving Web has been an important problem in research. Given the huge volume of content that is added or updated daily, identifying the right versions of pages to store in the archive is an important building block of any large-scale archival system. This paper presents temporal shingling, an extension of the well-established shingling technique for measuring how similar two snapshots of a page are. This novel method considers the lifespan of shingles to differentiate between important updates that should be archived and transient changes that may be ignored. Extensive experiments demonstrate the tradeoff between archive size and version coverage, and show that the novel method yields better archive coverage at smaller sizes than existing techniques. © 2010 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Schenkel, R. (2010). Temporal shingling for version identification in web archives. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5993 LNCS, pp. 508–519). Springer Verlag. https://doi.org/10.1007/978-3-642-12275-0_44

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free