The categorization of natural language texts is a well established research field in computational and quantitative linguistics (Joachims 2002). In the majority of cases, the vector space model is used in terms of a bag of words approach. That is, lexical features are extracted from input texts in order to train some categorization model and, thus, to attribute, for example, authorship or topic categories. Parallel to these approaches there has been some effort in performing text categorization not in terms of lexical, but of structural features of document structure. More specifically, quantitative text characteristics have been computed in order to derive a sort of structural text signature which nevertheless allows reliable text categorizations (Kelih & Grzybek 2005; Pieper 1975). This "bag of features" approach regains attention when it comes to categorizing websites and other document types whose structure is far away from the simplicity of tree-like structures. Here we present a novel approach to structural classifiers which systematically computes structural signatures of documents. In summary, we present a text categorization algorithm which in the absence of any lexical features nevertheless performs a remarkably good classification even if the classes are thematically defined.
CITATION STYLE
Pustylnikov, O., & Mehler, A. (2008). Structural differentiae of text types - A quantitative model. In Studies in Classification, Data Analysis, and Knowledge Organization (pp. 655–662). Kluwer Academic Publishers. https://doi.org/10.1007/978-3-540-78246-9_77
Mendeley helps you to discover research relevant for your work.