Extracting the latent hierarchical structure of web documents

Michael A. El-Shayeb; Samhaa R. El-Beltagy; Ahmed Rafea

Conference Proceedings

Extracting the latent hierarchical structure of web documents

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2009) 4879 LNCS 305-313

DOI: 10.1007/978-3-642-01350-8_28

2Citations

4Readers

Get full text

Abstract

The hierarchical structure of a document plays an important role in understanding the relationships between its contents. However, such a structure is not always explicitly represented in web documents through available html hierarchical tags. Headings however, are usually differentiated from 'normal' text in a document in terms of presentation thus providing an implicit structure discernable by a human reader. As such, an important pre-processing step for applications that need to operate on the hierarchical level is to extract the implicitly represented hierarchal structure. In this paper, an algorithm for heading detection and heading level detection which makes use of various visual presentations is presented. Results of evaluating this algorithm are also reported. ©Springer-Verlag Berlin Heidelberg 2009.

Author supplied keywords

Cite

CITATION STYLE

APA

El-Shayeb, M. A., El-Beltagy, S. R., & Rafea, A. (2009). Extracting the latent hierarchical structure of web documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4879 LNCS, pp. 305–313). https://doi.org/10.1007/978-3-642-01350-8_28

Extracting the latent hierarchical structure of web documents

Abstract

Author supplied keywords

Cite

Register to see more suggestions