Parallel text compression

Nevill, C.

Parallel text compression

Files

nevill_1989_report.pdf (2.73 MB)

Type of content

Discussion / Working Papers

UC permalink

http://hdl.handle.net/10092/12660

Publisher

University of Canterbury

Date

1989

Authors

Nevill, C.

Abstract

Parallel texts come in two forms: either the two texts are in the same language, or in two different languages. The most common origin of a pair of parallel texts is some translation process. A text may be available in different languages, for example an English translation of French novel, or the translation of a technical paper from Russian. Also, several translations of significant texts may have been made into the same language, for example Classical texts or the Bible translated into English. The two cases are quite different in the sense that while the same-language pairs (or parallel translations) are few (they occur mostly in translation from ancient documents such as the Bible or Classical writings), they are significant and widely used. Two language pairs appear wherever translations are held together with the source document, or where documents are maintained in several different languages. This may occur especially in multi-lingual environments· such as multi-national organisations and governments of multi-lingual countries (for example, Canada). While the two cases are different, a two-language pair of texts can be transformed into a same-language pair by the automatic translation of one text into· the language of the other. For the purposes of compression (see next section) this translation does not have to be stylistically perfect, so existing techniques of machine translation could be used to convert a two-language pair into a same-language pair. This reduces the problem addressed by this report to the treatment of same-language texts. An informal definition of parallel texts with relevance to compression is "a pair of texts which say approximately the same thing using different words", or more formally, "a pair of natural-language texts with equivalent semantic content"

ANZSRC fields of research

Field of Research::08 - Information and Computing Sciences::0801 - Artificial Intelligence and Image Processing::080107 - Natural Language Processing

Rights

https://canterbury.libguides.com/rights/theses

Collections

Engineering: Reports

Full item page