A cooperative approach to web crawler URL ordering

3Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Uniform Resource Locator (URL) ordering algorithms are used by Web crawlers to determine the order in which to download pages from the Web. The current approaches for URL ordering based on link structure are expensive and/or miss many good pages, particularly in social network environments. In this paper, we present a novel URL ordering system that relies on a cooperative approach between crawlers and web servers based on file system and Web log information. In particular, we develop algorithms based on file timestamps and Web log internal and external counts. By using this change and popularity information for URL ordering, we are able to retrieve high quality pages earlier in the crawl while avoiding requests for pages that are unchanged or no longer available. We perform our experiments on two data sets using the Web logs from university and CiteSeer websites. On these data sets, we achieve a statistically significant improvement in the ordering of the high quality pages (as indicated by Google's PageRank) of 57.2% and 65.7% over that of a breadth-first search crawl while increasing the number of unique pages gathered by skipping unchanged or deleted pages. © 2012 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Chandramouli, A., Gauch, S., & Eno, J. (2012). A cooperative approach to web crawler URL ordering. Advances in Intelligent and Soft Computing, 98, 343–357. https://doi.org/10.1007/978-3-642-23187-2_22

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free