Categorization of large text collections: Feature selection for training neural networks

Pensiri Manomaisupat; Bogdan Vrusias; Khurshid Ahmad

Conference Proceedings

Categorization of large text collections: Feature selection for training neural networks

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2006) 4224 LNCS 1003-1013

DOI: 10.1007/11875581_120

6Citations

3Readers

Get full text

Abstract

Automatic text categorization requires the construction of appropriate surrogates for documents within a text collection. The surrogates, often called document vectors, are used to train learning systems for categorising unseen documents. A comparison of different measures (tfidf and weirdness) for creating document vectors is presented together with two different state-of-the-art classifiers: supervised Kohonen's SOFM and unsupervised Vapniak's SVM. The methods are tested using two 'gold standard' document collections and one data set from a 'real-world' news stream. There appears to be an optimal size both for the of document vectors and for the dimensionality of each vector that gives the best compromise between categorization accuracy and training time. The performance of each of the classifiers was computed for five different surrogate vector models: the first two surrogates were created with tfidf and weirdness measures accordingly, the third surrogate was created purely on the basis of high-frequency words in the training corpus, and the fourth vector model was created from a standardised terminology database. Finally, the fifth surrogate (used for evaluation purposes) was based on a random selection of words from the training corpus. © Springer-Verlag Berlin Heidelberg 2006.

Author supplied keywords

Cite

CITATION STYLE

APA

Manomaisupat, P., Vrusias, B., & Ahmad, K. (2006). Categorization of large text collections: Feature selection for training neural networks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4224 LNCS, pp. 1003–1013). Springer Verlag. https://doi.org/10.1007/11875581_120

Categorization of large text collections: Feature selection for training neural networks

Abstract

Author supplied keywords

Cite

Register to see more suggestions