Exact and effcient computation of the expected number of missing and common words in random texts

Sven Rahmann; Eric Rivals

Conference Proceedings

Exact and effcient computation of the expected number of missing and common words in random texts

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2000) 1848 375-387

DOI: 10.1007/3-540-45123-4_31

9Citations

4Readers

Get full text

Abstract

The number of missing words (NMW) of length q in a text, and the number of common words (NCW) of two texts are useful text statistics. Knowing the distribution of the NMW in a random text is essential for the construction of so-called monkey tests for pseudorandom number generators. Knowledge of the distribution of the NCW of two independent random texts is useful for the average case analysis of a family of fast pattern matching algorithms, namely those which use a technique called q-gram filtration. Despite these important applications, we are not aware of any exact studies of these text statistics. We pro- pose an effcient method to compute their expected values exactly. The diffculty of the computation lies in the strong dependence of successive words, as they overlap by (q−1) characters. Our method is based on the enumeration of all string autocorrelations of length q, i.e., of the ways a word of length q can overlap itself. For this, we present the first effcient algorithm. Furthermore, by assuming the words are independent, we obtain very simple approximation formulas, which are shown to be surprisingly good when compared to the exact values.

Cite

CITATION STYLE

APA

Rahmann, S., & Rivals, E. (2000). Exact and effcient computation of the expected number of missing and common words in random texts. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1848, pp. 375–387). Springer Verlag. https://doi.org/10.1007/3-540-45123-4_31

Exact and effcient computation of the expected number of missing and common words in random texts

Abstract

Cite

Register to see more suggestions