Compression of parallel texts
The world-wide use of digital storage and communications devices is increasing the need to make texts available in multiple languages. To minimise the cost of storing and transmitting multiple translations of a text, one could store the text in just one language, from which other translations can be created. Unfortunately, the quality of machine translation techniques is not good enough for this to be feasible. An alternative is to store a compressed form of translated versions of a text, taking advantage of the availability of the original text. The original text provides some of the semantic content of the text that is to be compressed, and therefore makes it possible for compression to be more efficient than if that information were not available. This paper reports investigations into the use of a parallel text to represent its translated version compactly. We begin with an experiment to evaluate the information content of a text when a parallel translation is available. This is achieved by having human subjects guess texts letter by letter, with and without a parallel translation. The perceived information content of a text can be determined from the way subjects make their guesses. The design and results of this experiment are described. The main conclusion is that while the text is considerably more predictable with the aid of a parallel translation, there is a surprising amount of information introduced by the translation. Insights obtained from this experiment are then applied in the design of a mechanical system for compressing parallel texts. The system stores one translation of a text intact, and then compresses further translations of the text with the aid of the original. The method described is able to compress texts significantly better than is possible without the aid of a parallel text. Aspects of the design are also applicable to future compressors that might take advantage of the semantic content of a text to obtain better compression.
SubjectsFields of Research::280000 Information, Computing and Communication Sciences::280200 Artificial Intelligence and Signal and Image Processing::280205 Text processing
- Engineering: Reports