Text Compression Evaluation
Degree GrantorUniversity of Canterbury
Degree NameBachelor of Science with Honours
Data Compression is a set of techniques aiming to represent information with less data. It is used in the areas of information storage and transmission. Text Compression is a subset of data compression dealing with lossless compression. The term lossless indicates that the original data is able to be reconstructed exactly rather than approximately. This report is confined to the evaluation of text compression techniques only. A large number of compression algorithms have been proposed and implemented. It is possible to evaluate and compare these based on a number of criteria. The principal three of these are the amount of compression, the amount of memory required, and the speed. These factors can be studied analytically or empirically; this report deals exclusively with empirical studies, which can complement and confirm analytical results. Comparison techniques are likely to be useful to people in a number of different areas. Researchers in compression need to determine whether proposed refinements are of practical use. Developers of new algorithms need to compare performance with that of previously developed algorithms. Users of programs would like to compare them in order to select one most suited to their requirements. Likewise developers of systems incorporating some compression techniques would like to choose that which best suits their system. For example, real-time compression applications are likely to prefer symmetric algorithms, which have similar compression and decompression speeds (Bryan, 1995). In all of the above applications it is important to have reasonable confidence in evaluation figures and performance comparisons. Here confidence indicates that published results from one experiment should be useful in some wider context, and should in principle remain valid for a variety of kinds of data that could be encountered. In practice, it is impossible for a compression algorithm to devise an evaluation method which suits this requirement exactly. The number of possible inputs is infinite. For example, even if all files were restricted to a length of 10 bytes with each byte drawn from the 128-symbol ASCII alphabet, there are 128¹⁰ or about 10²¹ possible files. In practice there is no such restriction in input size. If all possible files are considered, files can not be decreased in size on average - they will be expanded on average. However, there is a significant semantic restriction - we tend to compress useful data, and useful data tends to have some pattern. This excludes most of the possible files from consideration because they are essentially random and thus unlikely to be used as input for compression algorithms. Nevertheless, the number of possible inputs still remains very large. The problem is how to produce a result which reflects results on all possible future inputs. One solution would be to take a random sample; the technique of statistical representation of a large population by a relatively small random sample is well understood. However, this encounters obvious practical difficulties. The population is constantly changing as files are created, modified, and deleted, and the population is widely distributed across many different computer systems. Even if these could be overcome, data and information is essentially dynamic – there is no way of knowing whether (for example) a particular file type which may be widely used in the future is represented in the sample. The solution used here is to select a representative sample from some of the available files. The Canterbury Corpus is developed in this way. The popular Calgary corpus, which has been widely used for text compression evaluation, is compared with the Ghost Lake corpus. The Ghost Lake corpus was also selected arbitrarily; this comparison attempts to determine the possible effects of such an arbitrary selection. The remainder of this report surveys current comparison techniques, and goes on to discuss various issues related to the evaluation of text compression techniques. These are related to the development of the Canterbury Corpus. The method and results for the development of the Canterbury Corpus are then discussed.
SubjectsField of Research::01 - Mathematical Sciences
- Engineering: Reports