Toward unsupervised whole-corpus tagging

18Citations
Citations of this article
80Readers
Mendeley users who have this article in their library.

Abstract

We present a system for unsupervised tagging of words into classes produced by a distributional clustering technique called co-clustering. A hidden Markov model (HMM), trained on the high-frequency terms in the lexicon, is used to "tag" occurrences of low-frequency terms. In experiments using the Wall Street Journal portion of the Penn Treebank, we show that previously reported problems in using Baum-Welch estimation for part-of-speech tagging do not occur in this context. We also show how state-level term emission models can be augmented to account for morphological patterns using features automatically derived from the output of co-clustering. Finally, we consider an alternative means of extending the coverage of the lexicon, in which low-frequency terms are added to the lexicon as types, and compare this approach with the token-level assignments made by the HMM.

Cite

CITATION STYLE

APA

Freitag, D. (2004). Toward unsupervised whole-corpus tagging. In COLING 2004 - Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics (ACL). https://doi.org/10.3115/1220355.1220407

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free