An efficient compression code for text databases

Nieves R. Brisaboa; Eva L. Iglesias; Gonzalo Navarro; José R. Paramá

Journal Article

An efficient compression code for text databases

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2003) 2633 468-481

DOI: 10.1007/3-540-36618-0_33

65Citations

14Readers

Get full text

Abstract

We present a new compression format for natural language texts, allowing both exact and approximate search without decompression. This new code -called End-Tagged Dense Code- has some advantages with respect to other compression techniques with similar features such as the Tagged Huffman Code of [Moura et al., ACM TOIS 2000]. Our compression method obtains (i) better compression ratios, (ii) a simpler vocabulary representation, and (iii) a simpler and faster encoding. At the same time, it retains the most interesting features of the method based on the Tagged Huffman Code, i.e., exact search for words and phrases directly on the compressed text using any known sequential pattern matching algorithm, efficient word-based approximate and extended searches without any decoding, and efficient decompression of arbitrary portions of the text. As a side effect, our analytical results give new upper and lower bounds for the redundancy of d-ary Huffman codes. © Springer-Verlag Berlin Heidelberg 2003.

Author supplied keywords

Cite

CITATION STYLE

APA

Brisaboa, N. R., Iglesias, E. L., Navarro, G., & Paramá, J. R. (2003). An efficient compression code for text databases. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2633, 468–481. https://doi.org/10.1007/3-540-36618-0_33

An efficient compression code for text databases

Abstract

Author supplied keywords

Cite

Register to see more suggestions