Extracting the latent hierarchical structure of web documents

2Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The hierarchical structure of a document plays an important role in understanding the relationships between its contents. However, such a structure is not always explicitly represented in web documents through available html hierarchical tags. Headings however, are usually differentiated from 'normal' text in a document in terms of presentation thus providing an implicit structure discernable by a human reader. As such, an important pre-processing step for applications that need to operate on the hierarchical level is to extract the implicitly represented hierarchal structure. In this paper, an algorithm for heading detection and heading level detection which makes use of various visual presentations is presented. Results of evaluating this algorithm are also reported. ©Springer-Verlag Berlin Heidelberg 2009.

Cite

CITATION STYLE

APA

El-Shayeb, M. A., El-Beltagy, S. R., & Rafea, A. (2009). Extracting the latent hierarchical structure of web documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4879 LNCS, pp. 305–313). https://doi.org/10.1007/978-3-642-01350-8_28

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free