Learning methods for graph models of document structure

0Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This chapter focuses on the structure-based classification of websites according to their hypertext type or genre. A website usually consists of several web pages. Its structure is given by their hyperlinks resulting in a directed graph. In order to represent the logical structure of a website, the underlying graph structure is represented as a so-called directed Generalized Tree (GT), in which a rooted spanning tree represents the logical core structure of the site. The remaining arcs are classified as reflexive, lateral, and vertical up- and downward arcs with respect to this kernel tree. We consider unsupervised and supervised approaches for learning classifiers from a given web corpus. Quantitative Structure Analysis (QSA) is based on describing GTs using a series of attributes that characterize their structural complexity, and employs feature selection combined with unsupervised learning techniques. Kernel methods - the second class of approaches we consider - focus on typical substructures characterizing the classes. We present a series of tree, graph and GT kernels that are suitable for solving the problem and discuss the problem of scalability. All learning approaches are evaluated using a web corpus containing classified websites. © 2011 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Geibel, P., Mehler, A., & Kühnberger, K. U. (2011). Learning methods for graph models of document structure. Studies in Computational Intelligence, 370, 267–298. https://doi.org/10.1007/978-3-642-22613-7_14

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free