Using sketches to estimate associations

33Citations
Citations of this article
79Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We should not have to look at the entire corpus (e.g., the Web) to know if two words are associated or not.1 A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the margins (document frequencies) and the size of the collection. Not unsurprisingly, computational work and statistical accuracy (variance or errors) depend on sampling rate, as will be shown both theoretically and empirically. Sampling methods become more and more important with larger and larger collections. AtWeb scale, sampling rates as low as 10-4 may suffice. © 2005 Association for Computational Linguistics.

Cite

CITATION STYLE

APA

Li, P., & Church, K. W. (2005). Using sketches to estimate associations. In HLT/EMNLP 2005 - Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 708–715). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1220575.1220664

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free