Similarity-based estimation of word cooccurrence probabilities

Ido Dagan; Fernando Pereira; Lillian Lee

Conference Proceedings

Similarity-based estimation of word cooccurrence probabilities

Proceedings of the Annual Meeting of the Association for Computational Linguistics (1994) 1994-June 272-278

DOI: 10.3115/981732.981770

75Citations

115Readers

Get full text

Abstract

In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations "eat a peach" and "eat a beach" is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on "most similar" words. We describe a probabilistic word association model based on distributional word similarity, and apply it to improving probability estimates for unseen word bigrams in a variant of Katz's back-off model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error.

Cite

CITATION STYLE

APA

Dagan, I., Pereira, F., & Lee, L. (1994). Similarity-based estimation of word cooccurrence probabilities. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1994-June, pp. 272–278). Association for Computational Linguistics (ACL). https://doi.org/10.3115/981732.981770

Similarity-based estimation of word cooccurrence probabilities

Abstract

Cite

Register to see more suggestions