This chapter focuses on the structure-based classification of websites according to their hypertext type or genre. A website usually consists of several web pages. Its structure is given by their hyperlinks resulting in a directed graph. In order to represent the logical structure of a website, the underlying graph structure is represented as a so-called directed Generalized Tree (GT), in which a rooted spanning tree represents the logical core structure of the site. The remaining arcs are classified as reflexive, lateral, and vertical up- and downward arcs with respect to this kernel tree. We consider unsupervised and supervised approaches for learning classifiers from a given web corpus. Quantitative Structure Analysis (QSA) is based on describing GTs using a series of attributes that characterize their structural complexity, and employs feature selection combined with unsupervised learning techniques. Kernel methods - the second class of approaches we consider - focus on typical substructures characterizing the classes. We present a series of tree, graph and GT kernels that are suitable for solving the problem and discuss the problem of scalability. All learning approaches are evaluated using a web corpus containing classified websites. © 2011 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Geibel, P., Mehler, A., & Kühnberger, K. U. (2011). Learning methods for graph models of document structure. Studies in Computational Intelligence, 370, 267–298. https://doi.org/10.1007/978-3-642-22613-7_14
Mendeley helps you to discover research relevant for your work.