Parallel text compression
Parallel texts come in two forms: either the two texts are in the same language, or in two different languages. The most common origin of a pair of parallel texts is some translation process. A text may be available in different languages, for example an English translation of French novel, or the translation of a technical paper from Russian. Also, several translations of significant texts may have been made into the same language, for example Classical texts or the Bible translated into English. The two cases are quite different in the sense that while the same-language pairs (or parallel translations) are few (they occur mostly in translation from ancient documents such as the Bible or Classical writings), they are significant and widely used. Two language pairs appear wherever translations are held together with the source document, or where documents are maintained in several different languages. This may occur especially in multi-lingual environments· such as multi-national organisations and governments of multi-lingual countries (for example, Canada). While the two cases are different, a two-language pair of texts can be transformed into a same-language pair by the automatic translation of one text into· the language of the other. For the purposes of compression (see next section) this translation does not have to be stylistically perfect, so existing techniques of machine translation could be used to convert a two-language pair into a same-language pair. This reduces the problem addressed by this report to the treatment of same-language texts. An informal definition of parallel texts with relevance to compression is "a pair of texts which say approximately the same thing using different words", or more formally, "a pair of natural-language texts with equivalent semantic content"
SubjectsField of Research::08 - Information and Computing Sciences::0801 - Artificial Intelligence and Image Processing::080107 - Natural Language Processing
- Engineering: Reports