The Virtual Corpus Approach to Deriving Ngram Statistics from Large Scale Corpora ∗ Virtual Corpus Based on Suffix Array
Available from citeseerx.ist.psu.edu
Page 1
The Virtual Corpus Approach to Deriving Ngram Statistics from Large Scale Corpora ∗ Virtual Corpus Based on Suffix Array
The Virtual Corpus Approach to Deriving
Ngram Statistics from Large Scale Corpora
Chunyu Kit
yz
Yorick Wilks
y
Department of Computer Science, University of Sheeld
y
fcykit, yorickg@dcs.shef.ac.uk
Department of Chinese, Translation and Linguistics, City University of Hong Kong
z
ctckit@cityu.edu.hk
Abstract
This paper reports our implementation of the Virtual Corpus approach to deriving
ngram statistics for ngrams of any length from large-scale corpora based on the sux ar-
ray data structure. In order to enable the VC to accommodate corpora with a vocabulary
of dierent size, we rst convert corpus tokens into integer codes. To accelerate the pro-
cessing, we employ a bucket-radixsort for sorting the VC indices (or pointers, each of which
represents a sequence of corpus tokens from its position to the end of the corpus). The
time complexity of the sorting algorithm is of O(N log
n
N) in token code comparisons.
1 Introduction
It is known that ngram model is the simplest, most durable and successful statistical model
for many natural language processing (NLP) and speech processing applications, e.g., as in
[9, 2] and many others. Many high performance part-of-speech tagging systems, e.g., [10, 4, 3]
and others, are based on ngram statistics. Multigram model [1, 6, 7] has become a practical
language modelling technique.
Hitherto bigram and/or trigram models were used rather than any higher order ngrams.
One of the reasons seems to be the unavailability of higher order ngrams, mainly because of
the computational diculty in acquiring long ngrams and their counts. However, the sux
array data structure [11], also known as PAT tree [8], provides a basis for ecient approach to
acquiring ngrams from large-scale corpora. Nagao and Mori [13] practise a position indexing
technique similar to sux tree, but the eciency appears to need further enhancement.
In this paper, we report our implementation of the Virtual Corpus (VC) approach to deriving
ngram statistics for ngrams of any length from large-scale corpora, with plenty of technical
details. The system is called \virtual corpus" because it relies on a virtually existing sorted
corpus for ngram counting (see next sections for detail). This approach rst performs a pre-
processing to convert corpus tokens into integer codes before constructing the VC, by which
it gains the capacity to handle large-scale corpora with a vocabulary of various sizes. The
VC employs a bucket-radixsort to facilitate the sorting of the VC pointers (or indices), each
of which represents a sequence of corpus tokens from a token position in the corpus to the
The rst author gratefully acknowledges a University of Sheeld Research Scholarship awarded to him
that enables him to undertake this work. We wish to thank Makoto Nagao and Shinsuke Mori for helpful
communications, and to thank Hamish Cunningham, Ted Dunning, Rob Gaizauskas, Steve Renals and many
other colleagues for various kinds of help and useful discussions.
1
Ngram Statistics from Large Scale Corpora
Chunyu Kit
yz
Yorick Wilks
y
Department of Computer Science, University of Sheeld
y
fcykit, yorickg@dcs.shef.ac.uk
Department of Chinese, Translation and Linguistics, City University of Hong Kong
z
ctckit@cityu.edu.hk
Abstract
This paper reports our implementation of the Virtual Corpus approach to deriving
ngram statistics for ngrams of any length from large-scale corpora based on the sux ar-
ray data structure. In order to enable the VC to accommodate corpora with a vocabulary
of dierent size, we rst convert corpus tokens into integer codes. To accelerate the pro-
cessing, we employ a bucket-radixsort for sorting the VC indices (or pointers, each of which
represents a sequence of corpus tokens from its position to the end of the corpus). The
time complexity of the sorting algorithm is of O(N log
n
N) in token code comparisons.
1 Introduction
It is known that ngram model is the simplest, most durable and successful statistical model
for many natural language processing (NLP) and speech processing applications, e.g., as in
[9, 2] and many others. Many high performance part-of-speech tagging systems, e.g., [10, 4, 3]
and others, are based on ngram statistics. Multigram model [1, 6, 7] has become a practical
language modelling technique.
Hitherto bigram and/or trigram models were used rather than any higher order ngrams.
One of the reasons seems to be the unavailability of higher order ngrams, mainly because of
the computational diculty in acquiring long ngrams and their counts. However, the sux
array data structure [11], also known as PAT tree [8], provides a basis for ecient approach to
acquiring ngrams from large-scale corpora. Nagao and Mori [13] practise a position indexing
technique similar to sux tree, but the eciency appears to need further enhancement.
In this paper, we report our implementation of the Virtual Corpus (VC) approach to deriving
ngram statistics for ngrams of any length from large-scale corpora, with plenty of technical
details. The system is called \virtual corpus" because it relies on a virtually existing sorted
corpus for ngram counting (see next sections for detail). This approach rst performs a pre-
processing to convert corpus tokens into integer codes before constructing the VC, by which
it gains the capacity to handle large-scale corpora with a vocabulary of various sizes. The
VC employs a bucket-radixsort to facilitate the sorting of the VC pointers (or indices), each
of which represents a sequence of corpus tokens from a token position in the corpus to the
The rst author gratefully acknowledges a University of Sheeld Research Scholarship awarded to him
that enables him to undertake this work. We wish to thank Makoto Nagao and Shinsuke Mori for helpful
communications, and to thank Hamish Cunningham, Ted Dunning, Rob Gaizauskas, Steve Renals and many
other colleagues for various kinds of help and useful discussions.
1
Page 2
end of the corpus. The time complexity of the sorting algorithm is O(N log
n
N) in token
comparisons, where N is the corpus length and a token is an integer code. This algorithm
compares favourably with other sorting algorithms that are of O(N logN) complexity in string
comparisons. Experiments show that it takes 1.5 minutes to sort the VC for the 1.29 million
PTB-II WSJ POS tag corpus on a Sun Sparc 4 and 1.5 minutes to sort the VC for the 6-million-
character Brown corpus on a Sun Ultra 2.
2 Virtual Corpus Based on Sux Array
A virtual corpus is realised by a sequence of pointers (or indices), each of which points to a
corpus token (e.g., a word, a character, etc.) in the real corpus and stands, virtually, for the
sequence of tokens from the token to the end of the corpus. This is the essence of the sux
array data structure. The pointers are as many as the number of tokens in the corpus, i.e., the
corpus length N . With these pointers we have a VC consisting of N token sequences. After
sorting these pointers, the count of an ngram item consisting of any n tokens in the corpus is
simply the count of the number of adjacent pointers that take the n tokens as prex.
This VC approach based on the sux array data structure can be exemplied by the fol-
lowing 3 stages with a very simple articial corpus:
Stage 1 Construct a virtual corpus with pointers to each token's position.
real corpus: b a b a a c b a a b ....[] ([]: end of corpus)
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
pointers: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10....[]
virtual corpus:
p1 -> b a b a a c b a a b ....[]
p2 -> a b a a c b a a b ....[]
p3 -> b a a c b a a b ....[]
p4 -> a a c b a a b ....[]
p5 -> a c b a a b ....[]
p6 -> c b a a b ....[]
p7 -> b a a b ....[]
p8 -> a a b ....[]
p9 -> a b ....[]
p10 -> b ....[]
:
Stage 2 Sort the virtual corpus.
p8 -> a a b ....[]
p4 -> a a c b a a b ....[]
p2 -> a b a a c b a a b ....[]
p9 -> a b ....[]
p5 -> a c b a a b ....[]
p7 -> b a a b ....[]
p3 -> b a a c b a a b ....[]
p1 -> b a b a a c b a a b ....[]
p10 -> b ....[]
p6 -> c b a a b ....[]
:
2
n
N) in token
comparisons, where N is the corpus length and a token is an integer code. This algorithm
compares favourably with other sorting algorithms that are of O(N logN) complexity in string
comparisons. Experiments show that it takes 1.5 minutes to sort the VC for the 1.29 million
PTB-II WSJ POS tag corpus on a Sun Sparc 4 and 1.5 minutes to sort the VC for the 6-million-
character Brown corpus on a Sun Ultra 2.
2 Virtual Corpus Based on Sux Array
A virtual corpus is realised by a sequence of pointers (or indices), each of which points to a
corpus token (e.g., a word, a character, etc.) in the real corpus and stands, virtually, for the
sequence of tokens from the token to the end of the corpus. This is the essence of the sux
array data structure. The pointers are as many as the number of tokens in the corpus, i.e., the
corpus length N . With these pointers we have a VC consisting of N token sequences. After
sorting these pointers, the count of an ngram item consisting of any n tokens in the corpus is
simply the count of the number of adjacent pointers that take the n tokens as prex.
This VC approach based on the sux array data structure can be exemplied by the fol-
lowing 3 stages with a very simple articial corpus:
Stage 1 Construct a virtual corpus with pointers to each token's position.
real corpus: b a b a a c b a a b ....[] ([]: end of corpus)
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
pointers: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10....[]
virtual corpus:
p1 -> b a b a a c b a a b ....[]
p2 -> a b a a c b a a b ....[]
p3 -> b a a c b a a b ....[]
p4 -> a a c b a a b ....[]
p5 -> a c b a a b ....[]
p6 -> c b a a b ....[]
p7 -> b a a b ....[]
p8 -> a a b ....[]
p9 -> a b ....[]
p10 -> b ....[]
:
Stage 2 Sort the virtual corpus.
p8 -> a a b ....[]
p4 -> a a c b a a b ....[]
p2 -> a b a a c b a a b ....[]
p9 -> a b ....[]
p5 -> a c b a a b ....[]
p7 -> b a a b ....[]
p3 -> b a a c b a a b ....[]
p1 -> b a b a a c b a a b ....[]
p10 -> b ....[]
p6 -> c b a a b ....[]
:
2
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
4 Readers on Mendeley
by Discipline
50% Linguistics
by Academic Status
25% Librarian
25% Other Professional
25% Student (Postgraduate)
by Country
50% China
25% Switzerland
25% United Kingdom


