A unifying theory and improvements for existing approaches to text compression (1986)
View/Open
Type of Content
Theses / DissertationsThesis Discipline
Computer ScienceDegree Name
Doctor of PhilosophyPublisher
University of Canterbury. Computer ScienceCollections
Abstract
More than 40 different schemes for performing text compression have been proposed in the literature. Many of these schemes appear to use quite different approaches, such as Huffman coding, dictionary substitution, predictive modelling, and modelling with Finite State Automata (FSA). From the many schemes in the literature, a representative sample has been selected to include all schemes of current interest (i.e. schemes which are in popular use, or those which have been proposed recently). The main result given in the thesis is that each of these schemes disguises some form of variableorder Markov model (VOMM), which is a relatively inexact model for text. In a variableorder Markov model, each symbol is predicted using a finite number of directly preceding symbols as a context. An important class of FSAs, called Finite Context Automata (FCAs) is defined, and is shown that FCAs implement a form of variableorder Markov modelling. Informally, an FCA is an FSA where the current state is determined by some finite number of immediately preceding input symbols. Three types of proof are used to show that text compression schemes use variableorder Markov modelling: (1) some schemes, such as Cleary and Witten's "Prediction by Partial Matching", use a VOMM by definition, (2) Cormack and Horspool's "Dynamic Markov Compression" scheme uses an FSA for prediction, and it is shown that the FSAs generated will always be FCAs, (3) a class of compression schemes called Greedy Macro (GM) schemes is defined, and a wide range of compression schemes, including ZivLempel (LZ) coding, are shown to belong to that class. A construction is then given to generate an FSA equivalent to any GM scheme, and the FSA is shown to implement a form of variableorder Markov modelling. Because variableorder Markov models are only a crude model for text, the main conclusion of the thesis is that more powerful models, such as Pushdown Automata, combined with arithmetic coding, offer better compression than any existing schemes, and should be explored further. However, there is room for improvement in the compression and speed of some existing schemes, and this is explored as follows. The LZ schemes are currently regarded as the most practical, in that they achieve good compression, are usually very fast, and require relatively little memory to perform well. To study these schemes more closely, an explicit probabalistic symbolwise model is given, which is equivalent to one of the LZ schemes, LZ77. This model is suitable for providing probabilities for characterbycharacter Huffman or arithmetic coding. Using the insight gained by examining the symbolwise model, improvements have been found which can be reflected in LZ schemes, resulting in a scheme called LZB, which offers improved compression, and for which the choice of parameters is less critical. Experiments verify that LZB gives better compression than competing LZ schemes for a large number of texts. Although the time complexity for encoding using LZB and similar schemes is O(n) for a text of n characters, straightforward implementations are very slow. The time consuming step of these algorithms is a search for the longest string match. An algorithm is given which uses a binary search tree to find the longest string match, and experiments show that this results in a dramatic increase in encoding speed.
Rights
Copyright Timothy BellRelated items
Showing items related by title, author, creator and subject.

Animation of text compression algorithms
Wilson, Timothy David (University of Canterbury. Computer Science, 1992)It has been said that, there is no particular mystery in animation ... it's very simple and like anything that is simple, it is about the hardest thing in the world to do. Text compression is about finding ways of representing ... 
Pattern matching in compressed texts and images
Bell, T.; Adjeroh, D.; Mukherjee, A. (Department of Computer Science & Software Engineering, University of CanterburyUniversity of Canterbury. Computer Science and Software Engineering, 2001)This paper provides a survey of techniques for pattern matching in compressed text and images. Normally compressed data needs to be decompressed before it is processed, but if the compression has been done in the right ... 
Pattern Matching in Compressed Text and Images
Bell, Tim; Adjeroh, Don; Mukherjee, Amar (University of Canterbury, 2001)