A cooperative approach to web crawler URL ordering

A. Chandramouli; S. Gauch; J. Eno

Journal Article

A cooperative approach to web crawler URL ordering

Advances in Intelligent and Soft Computing (2012) 98 343-357

DOI: 10.1007/978-3-642-23187-2_22

3Citations

5Readers

Get full text

Abstract

Uniform Resource Locator (URL) ordering algorithms are used by Web crawlers to determine the order in which to download pages from the Web. The current approaches for URL ordering based on link structure are expensive and/or miss many good pages, particularly in social network environments. In this paper, we present a novel URL ordering system that relies on a cooperative approach between crawlers and web servers based on file system and Web log information. In particular, we develop algorithms based on file timestamps and Web log internal and external counts. By using this change and popularity information for URL ordering, we are able to retrieve high quality pages earlier in the crawl while avoiding requests for pages that are unchanged or no longer available. We perform our experiments on two data sets using the Web logs from university and CiteSeer websites. On these data sets, we achieve a statistically significant improvement in the ordering of the high quality pages (as indicated by Google's PageRank) of 57.2% and 65.7% over that of a breadth-first search crawl while increasing the number of unique pages gathered by skipping unchanged or deleted pages. © 2012 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Chandramouli, A., Gauch, S., & Eno, J. (2012). A cooperative approach to web crawler URL ordering. Advances in Intelligent and Soft Computing, 98, 343–357. https://doi.org/10.1007/978-3-642-23187-2_22

A cooperative approach to web crawler URL ordering

Abstract

Cite

Register to see more suggestions