A Framework for Web Archiving and Guaranteed Retrieval

1Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

As of today, ‘web.archive.org’ has more than 338 billion web pages archived. How many of those pages are 100% retrieval. How many of the pages were left out or ignored just because the page doesn’t have some compatibility issue? How many of them were vernacular language and encoded in different formats (before UNICODE is standardized)? If we are talking about the content-type text. Consider other mime types which were encoded and decoded with different algorithms. The fundamental reason for this lies with the fundamental representation of digital data. We all know a sequence of 0 s and 1 s doesn’t make proper sense unless it is decoded properly. At the time of archiving, the browsers which could have rendered properly might have gone obsolete or upgraded way beyond to recognize old formats or the browser platforms could have been upgraded to recognize old formats. We studied various data preservation, web archiving related works and proposed a new framework that could store the exact client browser details (user-agent) in the WARC record and use it to load corresponding browser @ client side and render the archived content.

Cite

CITATION STYLE

APA

Devendran, A., & Arunkumar, K. (2020). A Framework for Web Archiving and Guaranteed Retrieval. In Advances in Intelligent Systems and Computing (Vol. 1016, pp. 205–215). Springer. https://doi.org/10.1007/978-981-13-9364-8_16

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free