A fast community based algorithm for generating web crawler seeds set

  • Daneshpajouh S
  • Nasir M
  • Ghodsi M
  • 6

    Readers

    Mendeley users who have this article in their library.
  • 4

    Citations

    Citations of this article.

Abstract

In this paper, we present a new and fast algorithm for generating the seeds set for web crawlers. A typical crawler normally starts from a fixed set like DMOZ links, and then continues crawling from URLs found in these web pages. Crawlers are supposed to download more good pages in less iterations. Crawled pages are good if they have high PageRanks and are from different communities. In this paper, we present a new algorithm with O(n) running time for generating crawler's seeds set based on HITS algorithm. A crawler can download qualified web pages, from different communities, starting from generated seeds set using our algorithm in less iteration.

Author-supplied keywords

  • communities
  • crawl quality metric
  • crawling
  • hits
  • hyperlink analysis
  • seed quality metric
  • web graph

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

  • PUI: 354006375
  • SCOPUS: 2-s2.0-58049180768
  • SGR: 58049180768
  • ISBN: 9789898111265

Authors

  • S Daneshpajouh

  • M M Nasir

  • M Ghodsi

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free