A unifying theory and improvements for existing approaches to text compression
Thesis DisciplineComputer Science
Degree GrantorUniversity of Canterbury
Degree NameDoctor of Philosophy
More than 40 different schemes for performing text compression have been proposed in the literature. Many of these schemes appear to use quite different approaches, such as Huffman coding, dictionary substitution, predictive modelling, and modelling with Finite State Automata (FSA). From the many schemes in the literature, a representative sample has been selected to include all schemes of current interest (i.e. schemes which are in popular use, or those which have been proposed recently). The main result given in the thesis is that each of these schemes disguises some form of variable-order Markov model (VOMM), which is a relatively inexact model for text. In a variable-order Markov model, each symbol is predicted using a finite number of directly preceding symbols as a context. An important class of FSAs, called Finite Context Automata (FCAs) is defined, and is shown that FCAs implement a form of variable-order Markov modelling. Informally, an FCA is an FSA where the current state is determined by some finite number of immediately preceding input symbols. Three types of proof are used to show that text compression schemes use variable-order Markov modelling: (1) some schemes, such as Cleary and Witten's "Prediction by Partial Matching", use a VOMM by definition, (2) Cormack and Horspool's "Dynamic Markov Compression" scheme uses an FSA for prediction, and it is shown that the FSAs generated will always be FCAs, (3) a class of compression schemes called Greedy Macro (GM) schemes is defined, and a wide range of compression schemes, including Ziv-Lempel (LZ) coding, are shown to belong to that class. A construction is then given to generate an FSA equivalent to any GM scheme, and the FSA is shown to implement a form of variable-order Markov modelling. Because variable-order Markov models are only a crude model for text, the main conclusion of the thesis is that more powerful models, such as Pushdown Automata, combined with arithmetic coding, offer better compression than any existing schemes, and should be explored further. However, there is room for improvement in the compression and speed of some existing schemes, and this is explored as follows. The LZ schemes are currently regarded as the most practical, in that they achieve good compression, are usually very fast, and require relatively little memory to perform well. To study these schemes more closely, an explicit probabalistic symbol-wise model is given, which is equivalent to one of the LZ schemes, LZ77. This model is suitable for providing probabilities for character-by-character Huffman or arithmetic coding. Using the insight gained by examining the symbol-wise model, improvements have been found which can be reflected in LZ schemes, resulting in a scheme called LZB, which offers improved compression, and for which the choice of parameters is less critical. Experiments verify that LZB gives better compression than competing LZ schemes for a large number of texts. Although the time complexity for encoding using LZB and similar schemes is O(n) for a text of n characters, straightforward implementations are very slow. The time consuming step of these algorithms is a search for the longest string match. An algorithm is given which uses a binary search tree to find the longest string match, and experiments show that this results in a dramatic increase in encoding speed.