Web archive profiling through fulltext search

7Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

An archive profile is, high-level summary of, web archive’s holdings that can be used for routing Memento queries to the appropriate archives. It can be created by generating summaries from the CDX files (index of web archives) which we explored in an earlier work. However, requiring archives to update their profiles periodically is difficult. Alternative means to discover the holdings of an archive involve sampling based approaches such as fulltext keyword searching to learn the URIs present in the response or looking up for, sample set of URIs and see which of those are present in the archive. It is the fulltext search based discovery and profiling that is the scope of this paper. We developed the Random Searcher Model (RSM) to discover the holdings of an archive by, random search walk. We measured the search cost of discovering certain percentages of the archive holdings for various profiling policies under different RSM configurations. We can make routing decisions of 80% of the requests correctly while maintaining about 0.9 recall by discovering only 10% of the archive holdings and generating, profile that costs less than 1% of the complete knowledge profile.

Cite

CITATION STYLE

APA

Alam, S., Nelson, M. L., Van de Sompel, H., & Rosenthal, D. S. H. (2016). Web archive profiling through fulltext search. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9819 LNCS, pp. 121–132). Springer Verlag. https://doi.org/10.1007/978-3-319-43997-6_10

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free