A very very large corpus doesn't always yield reliable estimates

1Citations
Citations of this article
78Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Banko and Brill (2001) suggested that the development of very large training corpora may be more effective for progress in empirical Natural Language Processing than improving methods that use existing smaller training corpora. This work tests their claim by exploring whether a very large corpus can eliminate the sparseness problems associated with estimating unigram probabilities. We do this by empirically investigating the convergence behaviour of unigram probability estimates on a one billion word corpus. When using one billion words, as expected, we do find that many of our estimates do converge to their eventual value. However, we also find that for some words, no such convergence occurs. This leads us to conclude that simply relying upon large corpora is not in itself sufficient: we must pay attention to the statistical modelling as well.

Cite

CITATION STYLE

APA

Curran, J. R., & Osborne, M. (2002). A very very large corpus doesn’t always yield reliable estimates. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (ACL). https://doi.org/10.3115/1118853.1118861

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free