Scale-adaptable recrawl strategies for DHT-based distributed web crawling system

Xiao Xu; Weizhe Zhang; Hongli Zhang; Binxing Fang

Conference ProceedingsOPEN ACCESS

Scale-adaptable recrawl strategies for DHT-based distributed web crawling system

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2010) 6289 LNCS 91-105

DOI: 10.1007/978-3-642-15672-4_9

0Citations

4Readers

Abstract

Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental crawl which requires recrawl each Web site according to the update frequency of each Web site's content. However, recrawl intervals solely calculated from change frequency of the Web sites may mismatch the system's real-time capacity which leads to inefficient utilization of resources. Based on our previous works on a DHT-based Web crawling system, in this paper, we propose two scale-adaptable recrawl strategies aiming to find solutions to the above issue. The methods proposed are evaluated through simulations based on real Web datasets and show satisfactory results. © 2010 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Xu, X., Zhang, W., Zhang, H., & Fang, B. (2010). Scale-adaptable recrawl strategies for DHT-based distributed web crawling system. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6289 LNCS, pp. 91–105). https://doi.org/10.1007/978-3-642-15672-4_9

Scale-adaptable recrawl strategies for DHT-based distributed web crawling system

Abstract

Author supplied keywords

Cite

Register to see more suggestions