Abstract
In this paper, we describe and evaluate a bigram part-of-speech (POS) tagger that uses latent annotations and then investigate using additional genre-matched unlabeled data for self-training the tagger. The use of latent annotations substantially improves the performance of a baseline HMM bigram tagger, outperforming a trigram HMM tagger with sophisticated smoothing. The performance of the latent tagger is further enhanced by self-training with a large set of unlabeled data, even in situations where standard bigram or trigram taggers do not benefit from self-training when trained on greater amounts of labeled training data. Our best model obtains a state-of-the-art Chinese tagging accuracy of 94.78% when evaluated on a representative test set of the Penn Chinese Treebank 6.0.
Cite
CITATION STYLE
Huang, Z., Eidelman, V., & Harper, M. (2009). Improving a simple bigram HMM part-of-speech tagger by latent annotation and self-training. In NAACL-HLT 2009 - Human Language Technologies: 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Short Papers (pp. 213–216). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1620853.1620911
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.