The hierarchical structure of a document plays an important role in understanding the relationships between its contents. However, such a structure is not always explicitly represented in web documents through available html hierarchical tags. Headings however, are usually differentiated from 'normal' text in a document in terms of presentation thus providing an implicit structure discernable by a human reader. As such, an important pre-processing step for applications that need to operate on the hierarchical level is to extract the implicitly represented hierarchal structure. In this paper, an algorithm for heading detection and heading level detection which makes use of various visual presentations is presented. Results of evaluating this algorithm are also reported. ©Springer-Verlag Berlin Heidelberg 2009.
CITATION STYLE
El-Shayeb, M. A., El-Beltagy, S. R., & Rafea, A. (2009). Extracting the latent hierarchical structure of web documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4879 LNCS, pp. 305–313). https://doi.org/10.1007/978-3-642-01350-8_28
Mendeley helps you to discover research relevant for your work.