A focused crawler with document segmentation

Jaeyoung Yang; Jinbeom Kang; Joongmin Choi

Conference Proceedings

A focused crawler with document segmentation

Lecture Notes in Computer Science (2005) 3578 94-101

DOI: 10.1007/11508069_13

5Citations

5Readers

Get full text

Abstract

The focused crawler is a topic-driven document-collecting crawler that was suggested as a promising alternative of maintaining up-to-date Web document indices in search engines. A major problem inherent in previous focused crawlers is the liability of missing highly relevant documents that are linked from off-topic documents. This problem mainly originated from the lack of consideration of structural information in a document. Traditional weighting method such as TFIDF employed in document classification can lead to this problem. In order to improve the performance of focused crawlers, this paper proposes a scheme of locality-based document segmentation to determine the relevance of a document to a specific topic. We segment a document into a set of sub-documents using contextual features around the hyperlinks. This information is used to determine whether the crawler would fetch the documents that are linked from hyperlinks in an off-topic document. © Springer-Verlag Berlin Heidelberg 2005.

Cite

CITATION STYLE

APA

Yang, J., Kang, J., & Choi, J. (2005). A focused crawler with document segmentation. In Lecture Notes in Computer Science (Vol. 3578, pp. 94–101). Springer Verlag. https://doi.org/10.1007/11508069_13

A focused crawler with document segmentation

Abstract

Cite

Register to see more suggestions