Automatic annotation of content-rich HTML documents: Structural and semantic analysis

Saikat Mukherjee; Guizhen Yang; I. V. Ramakrishnan

Journal ArticleOPEN ACCESS

Automatic annotation of content-rich HTML documents: Structural and semantic analysis

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2003) 2870 533-549

DOI: 10.1007/978-3-540-39718-2_34

28Citations

37Readers

Abstract

Although RDF/XML has been widely recognized as the standard vehicle for representing semantic information on the Web, an enormous amount of semantic data is still being encoded in HTML documents that are designed primarily for human consumption and not directly amenable to machine processing. This paper seeks to bridge this semantic gap by addressing the fundamental problem of automatically annotating HTML documents with semantic labels. Exploiting a key observation that semantically related items exhibit consistency in presentation style as well as spatial locality in template-based content-rich HTML documents, we have developed a novel framework for automatically partitioning such documents into semantic structures. Our framework tightly couples structural analysis of documents with semantic analysis incorporating domain ontologies and lexical databases such as WordNet. We present experimental evidence of the effectiveness of our techniques on a large collection of HTML documents from various news portals. © Springer-Verlag Berlin Heidelberg 2003.

Cite

CITATION STYLE

APA

Mukherjee, S., Yang, G., & Ramakrishnan, I. V. (2003). Automatic annotation of content-rich HTML documents: Structural and semantic analysis. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2870, 533–549. https://doi.org/10.1007/978-3-540-39718-2_34

Automatic annotation of content-rich HTML documents: Structural and semantic analysis

Abstract

Cite

Register to see more suggestions