Good-Turing Frequency Estimation Without Tears.
Abstract
Linguists and speech researchers who use statistical methods often need to estimate the frequency of some type of item in a population containing items of various types. A common approach is to divide the number of cases observed in a sample by the size of the sample; sometimes small positive quantities are added to divisor and dividend in order to avoid zero estimates for types missing from the sample. These approaches are obvious and simple, but they lack principled justification, and yield estimates that can be wildly inaccurate. I.J. Good and Alan Turing developed a family of theoretically well-founded techniques appropriate to this domain. Some versions of the GoodTuring approach are very demanding computationally, but we define a version, the Simple GoodTuring estimator, which is straightforward to use. Tested on a variety of natural-language-related data sets, the Simple GoodTuring estimator performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques.
Author-supplied keywords
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


