Tagging with small training corpora

Nuno C. Marques; Gabriel Pereira Lopes

Conference Proceedings

Tagging with small training corpora

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2001) 2189 63-72

DOI: 10.1007/3-540-44816-0_7

18Citations

13Readers

Get full text

Abstract

The analysis of textual data may start by classifying words using a predefined tags et. However, it is still a problem for natural language text understanding the assignment of part-of-speech tags to words in unrestricted text (called POS-tagging). Most part of current taggers require huge amounts of hand tagged text for training (in the order of 105 pretagged words): it requires linguistically highly trained man power for a highly repetitive and boring job, and the results obtained have no optimal quality. Moreover, when one wants to change to another text genre the same kind of problem must be faced again. Our proposal goes in another direction. By carefully combininga large lexicon with an efficient neural network based generator of taggers we can generate POS-taggers using no more than 104 hand corrected tagged words for training. This training tagged text size can be feasibly hand corrected. Experimental results are presented and discussed for the SUSANNE Corpus. Results in three additional different Portuguese corpora are also discussed. 96% precision rates are obtained when unknown words occur in the test set. 98% precision rates are obtained when every word in the test set is known.

Cite

CITATION STYLE

APA

Marques, N. C., & Lopes, G. P. (2001). Tagging with small training corpora. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2189, pp. 63–72). Springer Verlag. https://doi.org/10.1007/3-540-44816-0_7

Tagging with small training corpora

Abstract

Cite

Register to see more suggestions