What's changed? Measuring document change in web crawling for search engines

Halil Ali; Hugh E. Williams

Journal Article

What's changed? Measuring document change in web crawling for search engines

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2003) 2857 28-42

DOI: 10.1007/978-3-540-39984-1_3

5Citations

4Readers

Get full text

Abstract

To provide fast, scalable search facilities, web search engines store collections locally. The collections are gathered by crawling the Web. A problem with crawling is determining when to revisit resources because they have changed: stale documents contribute towards poor search results, while unnecessary refreshing is expensive. However, some changes - such as in images, advertisements, and headers - are unlikely to affect query results. In this paper, we investigate measures for determining whether documents have changed and should be recrawled. We show that content-based measures are more effective than the traditional approach of using HTTP headers. Refreshing based on HTTP headers typically recrawls 16% of the collection each day, but users do not retrieve the majority of refreshed documents. In contrast, refreshing documents when more than twenty words change recrawls 22% of the collection but updates documents more effectively. We conclude that our simple measures are an effective component of a web crawling strategy. © Springer-Verlag Berlin Heidelberg 2003.

Cite

CITATION STYLE

APA

Ali, H., & Williams, H. E. (2003). What’s changed? Measuring document change in web crawling for search engines. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2857, 28–42. https://doi.org/10.1007/978-3-540-39984-1_3

What's changed? Measuring document change in web crawling for search engines

Abstract

Cite

Register to see more suggestions