A unifying theory and improvements for existing approaches to text compression (1986)
View/Open
Type of Content
Theses / DissertationsThesis Discipline
Computer ScienceDegree Name
Doctor of PhilosophyPublisher
University of Canterbury. Computer ScienceCollections
Abstract
More than 40 different schemes for performing text compression have been proposed in the literature. Many of these schemes appear to use quite different approaches, such as Huffman coding, dictionary substitution, predictive modelling, and modelling with Finite State Automata (FSA). From the many schemes in the literature, a representative sample has been selected to include all schemes of current interest (i.e. schemes which are in popular use, or those which have been proposed recently). The main result given in the thesis is that each of these schemes disguises some form of variableorder Markov model (VOMM), which is a relatively inexact model for text. In a variableorder Markov model, each symbol is predicted using a finite number of directly preceding symbols as a context. An important class of FSAs, called Finite Context Automata (FCAs) is defined, and is shown that FCAs implement a form of variableorder Markov modelling. Informally, an FCA is an FSA where the current state is determined by some finite number of immediately preceding input symbols. Three types of proof are used to show that text compression schemes use variableorder Markov modelling: (1) some schemes, such as Cleary and Witten's "Prediction by Partial Matching", use a VOMM by definition, (2) Cormack and Horspool's "Dynamic Markov Compression" scheme uses an FSA for prediction, and it is shown that the FSAs generated will always be FCAs, (3) a class of compression schemes called Greedy Macro (GM) schemes is defined, and a wide range of compression schemes, including ZivLempel (LZ) coding, are shown to belong to that class. A construction is then given to generate an FSA equivalent to any GM scheme, and the FSA is shown to implement a form of variableorder Markov modelling. Because variableorder Markov models are only a crude model for text, the main conclusion of the thesis is that more powerful models, such as Pushdown Automata, combined with arithmetic coding, offer better compression than any existing schemes, and should be explored further. However, there is room for improvement in the compression and speed of some existing schemes, and this is explored as follows. The LZ schemes are currently regarded as the most practical, in that they achieve good compression, are usually very fast, and require relatively little memory to perform well. To study these schemes more closely, an explicit probabalistic symbolwise model is given, which is equivalent to one of the LZ schemes, LZ77. This model is suitable for providing probabilities for characterbycharacter Huffman or arithmetic coding. Using the insight gained by examining the symbolwise model, improvements have been found which can be reflected in LZ schemes, resulting in a scheme called LZB, which offers improved compression, and for which the choice of parameters is less critical. Experiments verify that LZB gives better compression than competing LZ schemes for a large number of texts. Although the time complexity for encoding using LZB and similar schemes is O(n) for a text of n characters, straightforward implementations are very slow. The time consuming step of these algorithms is a search for the longest string match. An algorithm is given which uses a binary search tree to find the longest string match, and experiments show that this results in a dramatic increase in encoding speed.
Rights
Copyright Timothy BellRelated items
Showing items related by title, author, creator and subject.

An improved approach in applying compressed sensing in parallel MR imaging
Wu, B.; Watts, R.; Millane, R.; Bones, P. (University of Canterbury. Electrical and Computer EngineeringUniversity of Canterbury. Physics and Astronomy, 2009)Both parallel MRI (pMRI, [1]) and compressed sensing (CS, [2]) allow image reconstruction from an undersampled data set. The former exploits data redundancy in a sparse transform domain representation whereas the latter ... 
A Unified Approach to Listbased Multiuser Detection in Overloaded Receivers
Krause, M.; Taylor, D.P.; Martin, P.A. (University of Canterbury. Electrical and Computer Engineering, 2008)A wireless communication system is overloaded when the number of transmitted signals exceeds the number of receive antennas. The presence of the resulting cochannel interference (CCI) under overload causes linear detection ... 
Improving Visualisation of Large MultiVariate Datasets: New HardwareBased Compression Algorithms and Rendering Techniques
Chernoglazov, Alexander Igorevich (University of Canterbury. Computer Science and Software Engineering, 2012)Spectral computed tomography (CT) is a novel medical imaging technique that involves simultaneously counting photons at several energy levels of the xray spectrum to obtain a single multivariate dataset. Visualisation ...