Web archive profiling through fulltext search

Sawood Alam; Michael L. Nelson; Herbert Van de Sompel; David S.H. Rosenthal

Conference Proceedings

Web archive profiling through fulltext search

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 9819 LNCS 121-132

DOI: 10.1007/978-3-319-43997-6_10

7Citations

4Readers

Get full text

Abstract

An archive profile is, high-level summary of, web archive’s holdings that can be used for routing Memento queries to the appropriate archives. It can be created by generating summaries from the CDX files (index of web archives) which we explored in an earlier work. However, requiring archives to update their profiles periodically is difficult. Alternative means to discover the holdings of an archive involve sampling based approaches such as fulltext keyword searching to learn the URIs present in the response or looking up for, sample set of URIs and see which of those are present in the archive. It is the fulltext search based discovery and profiling that is the scope of this paper. We developed the Random Searcher Model (RSM) to discover the holdings of an archive by, random search walk. We measured the search cost of discovering certain percentages of the archive holdings for various profiling policies under different RSM configurations. We can make routing decisions of 80% of the requests correctly while maintaining about 0.9 recall by discovering only 10% of the archive holdings and generating, profile that costs less than 1% of the complete knowledge profile.

Author supplied keywords

Cite

CITATION STYLE

APA

Alam, S., Nelson, M. L., Van de Sompel, H., & Rosenthal, D. S. H. (2016). Web archive profiling through fulltext search. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9819 LNCS, pp. 121–132). Springer Verlag. https://doi.org/10.1007/978-3-319-43997-6_10

Web archive profiling through fulltext search

Abstract

Author supplied keywords

Cite

Register to see more suggestions