Improving a simple bigram HMM part-of-speech tagger by latent annotation and self-training

24Citations
Citations of this article
94Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we describe and evaluate a bigram part-of-speech (POS) tagger that uses latent annotations and then investigate using additional genre-matched unlabeled data for self-training the tagger. The use of latent annotations substantially improves the performance of a baseline HMM bigram tagger, outperforming a trigram HMM tagger with sophisticated smoothing. The performance of the latent tagger is further enhanced by self-training with a large set of unlabeled data, even in situations where standard bigram or trigram taggers do not benefit from self-training when trained on greater amounts of labeled training data. Our best model obtains a state-of-the-art Chinese tagging accuracy of 94.78% when evaluated on a representative test set of the Penn Chinese Treebank 6.0.

Cite

CITATION STYLE

APA

Huang, Z., Eidelman, V., & Harper, M. (2009). Improving a simple bigram HMM part-of-speech tagger by latent annotation and self-training. In NAACL-HLT 2009 - Human Language Technologies: 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Short Papers (pp. 213–216). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1620853.1620911

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free