We present a system for unsupervised tagging of words into classes produced by a distributional clustering technique called co-clustering. A hidden Markov model (HMM), trained on the high-frequency terms in the lexicon, is used to "tag" occurrences of low-frequency terms. In experiments using the Wall Street Journal portion of the Penn Treebank, we show that previously reported problems in using Baum-Welch estimation for part-of-speech tagging do not occur in this context. We also show how state-level term emission models can be augmented to account for morphological patterns using features automatically derived from the output of co-clustering. Finally, we consider an alternative means of extending the coverage of the lexicon, in which low-frequency terms are added to the lexicon as types, and compare this approach with the token-level assignments made by the HMM.
CITATION STYLE
Freitag, D. (2004). Toward unsupervised whole-corpus tagging. In COLING 2004 - Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics (ACL). https://doi.org/10.3115/1220355.1220407
Mendeley helps you to discover research relevant for your work.