Light morphology and arabic information retrieval.

Algarni, Mohammed

Light morphology and arabic information retrieval.

Files

Algarni_ PhD_2016_thesis.pdf (2.88 MB)

Type of content

Theses / Dissertations

UC permalink

http://hdl.handle.net/10092/12879
http://dx.doi.org/10.26021/1341

Thesis discipline

Computer Science

Degree name

Doctor of Philosophy

Publisher

University of Canterbury

Language

English

Date

2016

Authors

Algarni, Mohammed

Abstract

The chief purpose of this study is to investigate the impact of mor- phology on Arabic Information Retrieval (AIR). In doing so, different forms of the surface word have to be examined as indexing terms in order to learn which is the most effective in performance. Exper- iments are needed starting with the root all the way to the surface form so that we can evaluate the difference each selection makes. This has resulted in the development of two experimental stemmers for the Modern Standard Arabic (MSA), one light and the other root-based, which will be referred to hereafter as the Simple Arabic Stemmer (SAS). The stemmers were based on the Quran morphology and con- structed according to its rules. They conform to the Quran guidelines in terms of segmenting a word into its correct morphological combi- nation (prefix-pattern-suffix). The reason for leveraging the Quran as a morphological knowledge base was that the Arabic morphological rules were documented according to the Quran relatively soon af- ter it became known. Using the Text REtrieval Conference (TREC) 2002 Arabic corpus, which contains 383,872 documents, 75 topics, and 10,031 manually-judged documents, we test our approach against two widely-used root stemmers, Khoja and Sebawai. In the experi- ments, our root algorithm has generated better Mean Average Preci- sion (MAP), giving a 13% relative gain over the other stemmers. The Simple Arabic Stemmer outperformed both stemmers in producing more accurate roots for the TREC corpus. We demonstrated that, by placing a restriction on what prefix-pattern-suffix combinations are permissible on the surface, the stemming process would be enhanced, and fewer stemming errors are produced. Another experiment was conducted to measure the difference between the stem and the root as indexing terms. Due to the fact that a root conflates so many stems under one form, its precision degraded when used as an indexing term. The results obtained favoured choosing the stem as an indexing term.

Rights

https://canterbury.libguides.com/rights/theses

Collections

Engineering: Theses and Dissertations

Full item page