Exploring web partition in DHT-based distributed web crawling

Xiao Xu; Weizhe Zhang; Hongli Zhang; Binxing Fang

Journal Article

Exploring web partition in DHT-based distributed web crawling

IEICE Transactions on Information and Systems (2010) E93-D(11) 2907-2921

DOI: 10.1587/transinf.E93.D.2907

1Citations

5Readers

Get full text

Abstract

The basic requirements of the distributed Web crawling systems are: short download time, low communication overhead and balanced load which largely depends on the systems' Web partition strategies. In this paper, we propose a DHT-based distributed Web crawling system and several DHT-based Web partition methods. First, a new system model based on a DHT method called the Content Addressable Network (CAN) is proposed. Second, based on this model, a network-distance-based Web partition is implemented to reduce the crawler-crawlee network distance in a fully distributed manner. Third, by utilizing the locality on the link space, we propose the concept of link-based Web partition to reduce the communication overhead of the system. This method not only reduces the number of inter-links to be exchanged among the crawlers but also reduces the cost of routing on the DHT overlay. In order to combine the benefits of the above two Web partition methods, we then propose 2 distributed multi-objective Web partition methods. Finally, all the methods we propose in this paper are compared with existing system models in the simulated experiments under different datasets and different system scales. In most cases, the new methods show their superiority. Copyright © 2010 The Institute of Electronics, Information and Communication Engineers.

Author supplied keywords

Cite

CITATION STYLE

APA

Xu, X., Zhang, W., Zhang, H., & Fang, B. (2010). Exploring web partition in DHT-based distributed web crawling. IEICE Transactions on Information and Systems, E93-D(11), 2907–2921. https://doi.org/10.1587/transinf.E93.D.2907

Exploring web partition in DHT-based distributed web crawling

Abstract

Author supplied keywords

Cite

Register to see more suggestions