Structural differentiae of text types - A quantitative model

Olga Pustylnikov; Alexander Mehler

Conference Proceedings

Structural differentiae of text types - A quantitative model

Studies in Classification, Data Analysis, and Knowledge Organization (2008) 655-662

DOI: 10.1007/978-3-540-78246-9_77

3Citations

4Readers

Get full text

Abstract

The categorization of natural language texts is a well established research field in computational and quantitative linguistics (Joachims 2002). In the majority of cases, the vector space model is used in terms of a bag of words approach. That is, lexical features are extracted from input texts in order to train some categorization model and, thus, to attribute, for example, authorship or topic categories. Parallel to these approaches there has been some effort in performing text categorization not in terms of lexical, but of structural features of document structure. More specifically, quantitative text characteristics have been computed in order to derive a sort of structural text signature which nevertheless allows reliable text categorizations (Kelih & Grzybek 2005; Pieper 1975). This "bag of features" approach regains attention when it comes to categorizing websites and other document types whose structure is far away from the simplicity of tree-like structures. Here we present a novel approach to structural classifiers which systematically computes structural signatures of documents. In summary, we present a text categorization algorithm which in the absence of any lexical features nevertheless performs a remarkably good classification even if the classes are thematically defined.

Cite

CITATION STYLE

APA

Pustylnikov, O., & Mehler, A. (2008). Structural differentiae of text types - A quantitative model. In Studies in Classification, Data Analysis, and Knowledge Organization (pp. 655–662). Kluwer Academic Publishers. https://doi.org/10.1007/978-3-540-78246-9_77

Structural differentiae of text types - A quantitative model

Abstract

Cite

Register to see more suggestions