Tagging with small training corpora

18Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The analysis of textual data may start by classifying words using a predefined tags et. However, it is still a problem for natural language text understanding the assignment of part-of-speech tags to words in unrestricted text (called POS-tagging). Most part of current taggers require huge amounts of hand tagged text for training (in the order of 105 pretagged words): it requires linguistically highly trained man power for a highly repetitive and boring job, and the results obtained have no optimal quality. Moreover, when one wants to change to another text genre the same kind of problem must be faced again. Our proposal goes in another direction. By carefully combininga large lexicon with an efficient neural network based generator of taggers we can generate POS-taggers using no more than 104 hand corrected tagged words for training. This training tagged text size can be feasibly hand corrected. Experimental results are presented and discussed for the SUSANNE Corpus. Results in three additional different Portuguese corpora are also discussed. 96% precision rates are obtained when unknown words occur in the test set. 98% precision rates are obtained when every word in the test set is known.

Cite

CITATION STYLE

APA

Marques, N. C., & Lopes, G. P. (2001). Tagging with small training corpora. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2189, pp. 63–72). Springer Verlag. https://doi.org/10.1007/3-540-44816-0_7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free