Identifying content blocks from Web documents

37Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative "primary content blocks" from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the "primary content blocks" based on their features. None of these algorithms require any supervised learning, but still can identify the "primary content blocks" with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time. © Springer-Verlag Berlin Heidelberg 2005.

Cite

CITATION STYLE

APA

Debnath, S., Mitra, P., & Lee Giles, C. (2005). Identifying content blocks from Web documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3488 LNAI, pp. 285–293). Springer Verlag. https://doi.org/10.1007/11425274_30

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free