Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

  • Guthrie D
  • Hepple M
  • 45


    Mendeley users who have this article in their library.
  • 14


    Citations of this article.


We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more effi- cient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current stateof- the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1 byte per n-gram.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

  • PUI: 362643503
  • ISBN: 1932432868
  • SCOPUS: 2-s2.0-80053289945
  • SGR: 80053289945


  • David Guthrie

  • Mark Hepple

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free