Topic-sensitive hidden-web crawling

Panagiotis Liakos; Alexandros Ntoulas

Conference Proceedings

Topic-sensitive hidden-web crawling

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2012) 7651 LNCS 538-551

DOI: 10.1007/978-3-642-35063-4_39

4Citations

5Readers

Get full text

Abstract

A constantly growing amount of high-quality information is stored in pages coming from the Hidden Web. Such pages are accessible only through a query interface that a Hidden-Web site provides and may span a variety of topics. In order to provide centralized access to the Hidden Web, previous works have focused on query generation techniques that aim at downloading all content of a given Hidden Web site with the minimum cost. In certain settings however, we are interested in downloading only a specific part of such a site. For example, in a news database, a user may be interested in retrieving only sports articles but no politics. In this case, we need to make the best use of our resources in downloading only the portion of the Hidden Web site that we are interested in. In this paper, we study how we can build a topically-focused Hidden Web crawler that can autonomously extract topic-specific pages from the Hidden Web by searching only the subset that is related to the corresponding category. To this end, we present query generation techniques that take into account the topic that we are interested in. We propose a number of different crawling policies and we experimentally evaluate them with data from two popular sites. © 2012 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Liakos, P., & Ntoulas, A. (2012). Topic-sensitive hidden-web crawling. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7651 LNCS, pp. 538–551). https://doi.org/10.1007/978-3-642-35063-4_39

Topic-sensitive hidden-web crawling

Abstract

Author supplied keywords

Cite

Register to see more suggestions