Substring statistics

Kyoji Umemura; Kenneth Church

Conference Proceedings

Substring statistics

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2009) 5449 LNCS 53-71

DOI: 10.1007/978-3-642-00382-0_5

3Citations

8Readers

Get full text

Abstract

The goal of this work is to make it practical to compute corpus-based statistics for all substrings (ngrams). Anything you can do with words, we ought to be able to do with substrings. This paper will show how to compute many statistics of interest for all substrings (ngrams) in a large corpus. The method not only computes standard corpus frequency, freq, and document frequency, df , but generalizes naturally to compute, dfk(str), the number of documents that mention the substring str at least k times. dfk can be used to estimate the probability distribution of str across documents, as well as summary statistics of this distribution, e.g., mean, variance (and other moments), entropy and adaptation. © Springer-Verlag Berlin Heidelberg 2009.

Cite

CITATION STYLE

APA

Umemura, K., & Church, K. (2009). Substring statistics. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5449 LNCS, pp. 53–71). https://doi.org/10.1007/978-3-642-00382-0_5

Substring statistics

Abstract

Cite

Register to see more suggestions