«Texts written in a natural language are essentially made of words of this language». We use this obvious fact, together with an extensive lexicon to define a good model of the statistical behavior of letters in texts. This model is used with the arithmetic coding scheme to build an efficient universal data compression method. Initially our method was specialized in the compression of French texts. However it can be easily adapted to other languages. Tests show that the compression ratio obtained by our method is on the average 30% on French texts. On the same texts Ziv & Lempel’s method yields an average ratio of 40%. On other kinds of test files (English text, executable files, sources) the use of an order 1 Markov chain leads to results of the same order as Ziv & Lempel’s. We present a new approach to dynamic dictionary construction for natural language compression. The fact well known to linguists that the number of different words is small, makes a dynamic construction possible.
CITATION STYLE
Revuz, D., & Zipstein, M. (1992). DZ a text compression algorithm for natural languages. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 644 LNCS, pp. 193–204). Springer Verlag. https://doi.org/10.1007/3-540-56024-6_16
Mendeley helps you to discover research relevant for your work.