Uniform Resource Locator (URL) ordering algorithms are used by Web crawlers to determine the order in which to download pages from the Web. The current approaches for URL ordering based on link structure are expensive and/or miss many good pages, particularly in social network environments. In this paper, we present a novel URL ordering system that relies on a cooperative approach between crawlers and web servers based on file system and Web log information. In particular, we develop algorithms based on file timestamps and Web log internal and external counts. By using this change and popularity information for URL ordering, we are able to retrieve high quality pages earlier in the crawl while avoiding requests for pages that are unchanged or no longer available. We perform our experiments on two data sets using the Web logs from university and CiteSeer websites. On these data sets, we achieve a statistically significant improvement in the ordering of the high quality pages (as indicated by Google's PageRank) of 57.2% and 65.7% over that of a breadth-first search crawl while increasing the number of unique pages gathered by skipping unchanged or deleted pages. © 2012 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Chandramouli, A., Gauch, S., & Eno, J. (2012). A cooperative approach to web crawler URL ordering. Advances in Intelligent and Soft Computing, 98, 343–357. https://doi.org/10.1007/978-3-642-23187-2_22
Mendeley helps you to discover research relevant for your work.