Sign up & Download
Sign in

The Virtual Corpus Approach to Deriving Ngram Statistics from Large Scale Corpora ∗ Virtual Corpus Based on Suffix Array

by Chunyu Kit
Sort (1998)

Cite this document (BETA)

Available from citeseerx.ist.psu.edu
Page 1
hidden

The Virtual Corpus Approach to Deriving Ngram Statistics from Large Scale Corpora ∗ Virtual Corpus Based on Suffix Array

The Virtual Corpus Approach to Deriving
Ngram Statistics from Large Scale Corpora

Chunyu Kit
yz
Yorick Wilks
y
Department of Computer Science, University of Sheeld
y
fcykit, yorickg@dcs.shef.ac.uk
Department of Chinese, Translation and Linguistics, City University of Hong Kong
z
ctckit@cityu.edu.hk
Abstract
This paper reports our implementation of the Virtual Corpus approach to deriving
ngram statistics for ngrams of any length from large-scale corpora based on the sux ar-
ray data structure. In order to enable the VC to accommodate corpora with a vocabulary
of di erent size, we rst convert corpus tokens into integer codes. To accelerate the pro-
cessing, we employ a bucket-radixsort for sorting the VC indices (or pointers, each of which
represents a sequence of corpus tokens from its position to the end of the corpus). The
time complexity of the sorting algorithm is of O(N log
n
N) in token code comparisons.
1 Introduction
It is known that ngram model is the simplest, most durable and successful statistical model
for many natural language processing (NLP) and speech processing applications, e.g., as in
[9, 2] and many others. Many high performance part-of-speech tagging systems, e.g., [10, 4, 3]
and others, are based on ngram statistics. Multigram model [1, 6, 7] has become a practical
language modelling technique.
Hitherto bigram and/or trigram models were used rather than any higher order ngrams.
One of the reasons seems to be the unavailability of higher order ngrams, mainly because of
the computational diculty in acquiring long ngrams and their counts. However, the sux
array data structure [11], also known as PAT tree [8], provides a basis for ecient approach to
acquiring ngrams from large-scale corpora. Nagao and Mori [13] practise a position indexing
technique similar to sux tree, but the eciency appears to need further enhancement.
In this paper, we report our implementation of the Virtual Corpus (VC) approach to deriving
ngram statistics for ngrams of any length from large-scale corpora, with plenty of technical
details. The system is called \virtual corpus" because it relies on a virtually existing sorted
corpus for ngram counting (see next sections for detail). This approach rst performs a pre-
processing to convert corpus tokens into integer codes before constructing the VC, by which
it gains the capacity to handle large-scale corpora with a vocabulary of various sizes. The
VC employs a bucket-radixsort to facilitate the sorting of the VC pointers (or indices), each
of which represents a sequence of corpus tokens from a token position in the corpus to the

The rst author gratefully acknowledges a University of Sheeld Research Scholarship awarded to him
that enables him to undertake this work. We wish to thank Makoto Nagao and Shinsuke Mori for helpful
communications, and to thank Hamish Cunningham, Ted Dunning, Rob Gaizauskas, Steve Renals and many
other colleagues for various kinds of help and useful discussions.
1
Page 2
hidden
end of the corpus. The time complexity of the sorting algorithm is O(N log
n
N) in token
comparisons, where N is the corpus length and a token is an integer code. This algorithm
compares favourably with other sorting algorithms that are of O(N logN) complexity in string
comparisons. Experiments show that it takes 1.5 minutes to sort the VC for the 1.29 million
PTB-II WSJ POS tag corpus on a Sun Sparc 4 and 1.5 minutes to sort the VC for the 6-million-
character Brown corpus on a Sun Ultra 2.
2 Virtual Corpus Based on Sux Array
A virtual corpus is realised by a sequence of pointers (or indices), each of which points to a
corpus token (e.g., a word, a character, etc.) in the real corpus and stands, virtually, for the
sequence of tokens from the token to the end of the corpus. This is the essence of the sux
array data structure. The pointers are as many as the number of tokens in the corpus, i.e., the
corpus length N . With these pointers we have a VC consisting of N token sequences. After
sorting these pointers, the count of an ngram item consisting of any n tokens in the corpus is
simply the count of the number of adjacent pointers that take the n tokens as pre x.
This VC approach based on the sux array data structure can be exempli ed by the fol-
lowing 3 stages with a very simple arti cial corpus:
Stage 1 Construct a virtual corpus with pointers to each token's position.
real corpus: b a b a a c b a a b ....[] ([]: end of corpus)
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
pointers: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10....[]
virtual corpus:
p1 -> b a b a a c b a a b ....[]
p2 -> a b a a c b a a b ....[]
p3 -> b a a c b a a b ....[]
p4 -> a a c b a a b ....[]
p5 -> a c b a a b ....[]
p6 -> c b a a b ....[]
p7 -> b a a b ....[]
p8 -> a a b ....[]
p9 -> a b ....[]
p10 -> b ....[]
:
Stage 2 Sort the virtual corpus.
p8 -> a a b ....[]
p4 -> a a c b a a b ....[]
p2 -> a b a a c b a a b ....[]
p9 -> a b ....[]
p5 -> a c b a a b ....[]
p7 -> b a a b ....[]
p3 -> b a a c b a a b ....[]
p1 -> b a b a a c b a a b ....[]
p10 -> b ....[]
p6 -> c b a a b ....[]
:
2

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

4 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
25% Librarian
 
25% Other Professional
 
25% Student (Postgraduate)
by Country
 
50% China
 
25% Switzerland
 
25% United Kingdom