What's changed? Measuring document change in web crawling for search engines

5Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes - such as in images, advertisements, and headers - are unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy. © Springer-Verlag Berlin Heidelberg 2003.

Cite

CITATION STYLE

APA

Ali, H., & Williams, H. E. (2003). What’s changed? Measuring document change in web crawling for search engines. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2857, 28–42. https://doi.org/10.1007/978-3-540-39984-1_3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free