Structural differentiae of text types - A quantitative model

3Citations
Citations of this article
4Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The categorization of natural language texts is a well established research field in computational and quantitative linguistics (Joachims 2002). In the majority of cases, the vector space model is used in terms of a bag of words approach. That is, lexical features are extracted from input texts in order to train some categorization model and, thus, to attribute, for example, authorship or topic categories. Parallel to these approaches there has been some effort in performing text categorization not in terms of lexical, but of structural features of document structure. More specifically, quantitative text characteristics have been computed in order to derive a sort of structural text signature which nevertheless allows reliable text categorizations (Kelih & Grzybek 2005; Pieper 1975). This "bag of features" approach regains attention when it comes to categorizing websites and other document types whose structure is far away from the simplicity of tree-like structures. Here we present a novel approach to structural classifiers which systematically computes structural signatures of documents. In summary, we present a text categorization algorithm which in the absence of any lexical features nevertheless performs a remarkably good classification even if the classes are thematically defined.

Cite

CITATION STYLE

APA

Pustylnikov, O., & Mehler, A. (2008). Structural differentiae of text types - A quantitative model. In Studies in Classification, Data Analysis, and Knowledge Organization (pp. 655–662). Kluwer Academic Publishers. https://doi.org/10.1007/978-3-540-78246-9_77

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free