Scaling laws and fluctuations in the statistics of word frequencies

43Citations
Citations of this article
51Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we combine statistical analysis of written texts and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. The average vocabulary of an ensemble of fixed-length texts is known to scale sublinearly with the total number of words (Heaps' law). Analyzing the fluctuations around this average in three large databases (Google-ngram, English Wikipedia, and a collection of scientific articles), we find that the standard deviation scales linearly with the average (Taylors law), in contrast to the prediction of decaying fluctuations obtained using simple sampling arguments. We explain both scaling laws (Heaps' and Taylor) by modeling the usage of words using a Poisson process with a fat-tailed distribution of word frequencies (Zipfs law) and topic-dependent frequencies of individual words (as in topic models). Considering topical variations lead to quenched averages, turn the vocabulary size a non-self-averaging quantity, and explain the empirical observations. For the numerous practical applications relying on estimations of vocabulary size, our results show that uncertainties remain large even for long texts. We show how to account for these uncertainties in measurements of lexical richness of texts with different lengths.

Cite

CITATION STYLE

APA

Gerlach, M., & Altmann, E. G. (2014). Scaling laws and fluctuations in the statistics of word frequencies. New Journal of Physics, 16. https://doi.org/10.1088/1367-2630/16/11/113010

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free