The bag-of-words model is accepted as the first choice when it comes to representing the content of web documents. It benefits from a low time complexity, but this comes at the cost of ignoring document structure. Obviously, there is a trade-off between the range of document modeling and its computational complexity. In this chapter, we present a model of content and structure learning that tackles this trade-off with a focus on delimiting documents as instances of webgenres. We present and evaluate a two-level algorithm of hypertext zoning that integrates the genre-related classification of web documents with their segmentation. In addition, we present an algorithm of hypertext sounding with respect to the thematic demarcation of web documents. © 2011 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Mehler, A., & Waltinger, U. (2011). Integrating content and structure learning: A model of hypertext zoning and sounding. Studies in Computational Intelligence, 370, 299–329. https://doi.org/10.1007/978-3-642-22613-7_15
Mendeley helps you to discover research relevant for your work.