Sign up & Download
Sign in

Improved word-aligned binary compression for text indexing

by V N Anh, A Moffat
IEEE Transactions on Knowledge and Data Engineering (2006)

Abstract

We present an improved compression mechanism for handling the compressed inverted indexes used in text retrieval systems, extending the word-aligned binary coding carry method. Experiments using two typical document collections show that the new method obtains superior compression to previous static codes, without penalty in terms of execution speed

Cite this document (BETA)

Page 1
hidden

Improved word-aligned binary compression for text indexing

Concise Papers __________________________________________________________________________________________
Improved Word-Aligned Binary
Compression for Text Indexing
Vo Ngoc Anh and Alistair Moffat
Abstract—We present an improved compression mechanism for handling the
compressed inverted indexes used in text retrieval systems, extending the word-
aligned binary coding carry method. Experiments using two typical document
collections show that the new method obtains superior compression to previous
static codes, without penalty in terms of execution speed.
Index Terms—Data compaction and compression, textual databases, indexing
methods, file organization, compression, inverted index, binary code, text retrieval
system, text searching, Web searching.

1 INTRODUCTION
UNCOMPRESSED, the inverted index for a document collection has
the potential to be large. For example, if full word-positions are
stored as 32-bit numbers, then each word in the source document
(accounting for 5 or 6 bytes on average) adds 4 bytes to the index,
and the total inverted file might be 70-80 percent of the space
occupied by the source collection. In practice, indexes are smaller
than this, for a range of reasons: not all of the content of the source
collection needs to be indexed (some is formatting markup, for
example), perhaps not all words have their positions recorded
(some stop words may be omitted, or only partially indexed), and
because the index information can be stored compressed.
To compress an inverted list, all ascending sequences of
values—such as the document numbers in a document-sorted list,
or the positional information for one term in one document in a
word-level index—are transformed to sequences of differences
between consecutive values. Those differences tend to be small
rather than large, and any infinite integer code that estimates small
values to have higher probabilities than large values can be used to
represent them. For example, both the Elias and Golomb codes
assume decreasing probability distributions, and for document
collections containing English text, result in an average of around
six to eight bits per document number in a complete inverted
index. Witten et al. [1] describe these and related methods, and
give compression effectiveness results for index data in connection
with typical document collections.
There has also been considerable interest in other coding
methods. For example, byte-aligned codes have received attention
[2], [3], [4], [5]. In a byte code, each integer is represented by a
codeword that is a multiple of eight bits long, which means that to
interpret a stream of codewords, only byte operations are required.
The elimination of bit operations has obvious benefits in terms of
decoding speed. Byte codes also have a useful “skipping” property
that allows the nth codeword in a message to be quickly sought
without every intervening codeword needing to be fully decoded.
In practical settings, the speed gain may compensate for the
compression loss compared to the more precise Golomb codes.
Other recent compression techniques make use of simple binary
codes and offer the prospect of both effective compression and fast
decoding. For example, Anh and Moffat [6] describe the carry
scheme, in which each 32-bit word in the compressed message
stores a set of binary codes, all of the same bit-length. While bit
operations are required to unpack each word, there are no single-
bit accesses, and straight-line decompression remains fast.
This paper describes an extension of carry, called the slide
method, that allows better compression effectiveness than carry,
with only a slight increase in decoding complexity. Both schemes
rely on minimum-width binary codes being packed into machine
words; slide has the additional advantage of making full use of
the trailing bits at the end of partially full words that are wasted in
carry. Our paper includes an experimental evaluation of the new
mechanism, comparing it to the previous carry method, to byte-
aligned codes, and to Golomb codes.
2 TEXT INDEXES
A document-level inverted index stores a set of inverted lists, one for
each term t that appears in the document collection and each one
of which consists of a length indicator, followed by a set of that
many ordered pairs [1]. Each pair represents a document number
in which that term appears, followed by the frequency of the term
within the document. That is, a document-level inverted index can
be described by the expression
t; ft; hd; fd;tiþ
 þ
;
where t is a term description, and is usually stored separately in a
vocabulary; ft is a term-collection frequency and describes how
many hd; fd;ti elements follow, and is possibly also stored in the
vocabulary; d is a document number and fd;t is a within-document
frequency. For example, if a word appears three times in a
collection: twice in document 17, once in document 23, and three
times in document 156, then its inverted list would be stored as
word; 3; h17; 2i; h23; 1i; h156; 3ih i.
If word positions are also included in the index, a further level of
structure is required, and the index is described by
t; ft; d; fd;t; hpiþ
 þD Eþ
;
where each fd;t value indicates the number of p values that
immediately follow, and each p value is a word-count offset at
which term t appears within the document d.
The usual assumption is that within each term’s inverted list,
the tuples may be ordered by increasing document number and
that, in a word-level index, the positions are similarly stored in
increasing order. We refer to such an arrangement as being
document-sorted. Alternatives, not considered in this paper, include
the frequency-sorted indexes of Persin et al. [7], and the impact-
sorted indexes of Anh et al. [8].
In document-sorted indexes, the next step is to transform
each list of document numbers (and word positions, if they
appear) into a set of gaps or differences. For example, consider
the set of n ¼ 12 numbers:
h1; 5; 6; 7; 10; 24; 25; 28; 51; 54; 56; 97i:
Storing the same set as gaps allows the more concise form:
h1; 4; 1; 1; 3; 14; 1; 3; 23; 3; 2; 41i:
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 6, JUNE 2006 857
. The authors are with the Department of Computer Science and Software
Engineering, The University of Melbourne, Australia 3010.
E-mail: {vo, alistair@csse.unimelb.edu.au}.
Manuscript received 10 Aug. 2005; revised 5 Dec. 2005; accepted 22 Feb.
2006; published online 20 Apr. 2006.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-0306-0805.
1041-4347/06/$20.00  2006 IEEE Published by the IEEE Computer Society

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

6 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
50% Ph.D. Student
 
17% Student (Master)
 
17% Researcher (at a non-Academic Institution)
by Country
 
50% China
 
17% Japan
 
17% Austria