A case-based recognition of semantic structures in HTML documents: An automated transformation from HTML to XML

Masayuki Umehara; Koji Iwanuma; Hidetomo Nabeshima

Conference Proceedings

A case-based recognition of semantic structures in HTML documents: An automated transformation from HTML to XML

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2002) 2412 141-147

DOI: 10.1007/3-540-45675-9_24

3Citations

1Readers

Get full text

Abstract

The recognition and extraction of semantic/logical structures in HTML documents are substantially important and difficult tasks for intelligent document processing. In this paper, we show that alignment is appropriate for recognizing characteristic semantic/logical structures of a series of HTML documents, within a framework of case-based reasoning. That is, given a series of HTML documents and a sample transformation from an HTML document into an XML format, then the alignment can identify semantic/logical structures in the remaining HTML documents of the series, by matchingthe text-block sequence of the remainingdo cument with the one of the sample transformation. Several important properties of texts, such as continuity and sequentiality, can naturally be utilized by the alignment. The alignment technology can significantly improve the ability of the case-based transformation method which transforms a spatial/temporal series of HTML documents into machine-readable XML formats. Throughout experimental evaluations, we show that the case-based method with alignment achieved a highly accurate transformation of HTML documents into XML.

Cite

CITATION STYLE

APA

Umehara, M., Iwanuma, K., & Nabeshima, H. (2002). A case-based recognition of semantic structures in HTML documents: An automated transformation from HTML to XML. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2412, pp. 141–147). Springer Verlag. https://doi.org/10.1007/3-540-45675-9_24

A case-based recognition of semantic structures in HTML documents: An automated transformation from HTML to XML

Abstract

Cite

Register to see more suggestions