Light morphology and arabic information retrieval.
Thesis DisciplineComputer Science
Degree GrantorUniversity of Canterbury
Degree NameDoctor of Philosophy
The chief purpose of this study is to investigate the impact of mor- phology on Arabic Information Retrieval (AIR). In doing so, different forms of the surface word have to be examined as indexing terms in order to learn which is the most effective in performance. Exper- iments are needed starting with the root all the way to the surface form so that we can evaluate the difference each selection makes. This has resulted in the development of two experimental stemmers for the Modern Standard Arabic (MSA), one light and the other root-based, which will be referred to hereafter as the Simple Arabic Stemmer (SAS). The stemmers were based on the Quran morphology and con- structed according to its rules. They conform to the Quran guidelines in terms of segmenting a word into its correct morphological combi- nation (prefix-pattern-suffix). The reason for leveraging the Quran as a morphological knowledge base was that the Arabic morphological rules were documented according to the Quran relatively soon af- ter it became known. Using the Text REtrieval Conference (TREC) 2002 Arabic corpus, which contains 383,872 documents, 75 topics, and 10,031 manually-judged documents, we test our approach against two widely-used root stemmers, Khoja and Sebawai. In the experi- ments, our root algorithm has generated better Mean Average Preci- sion (MAP), giving a 13% relative gain over the other stemmers. The Simple Arabic Stemmer outperformed both stemmers in producing more accurate roots for the TREC corpus. We demonstrated that, by placing a restriction on what prefix-pattern-suffix combinations are permissible on the surface, the stemming process would be enhanced, and fewer stemming errors are produced. Another experiment was conducted to measure the difference between the stem and the root as indexing terms. Due to the fact that a root conflates so many stems under one form, its precision degraded when used as an indexing term. The results obtained favoured choosing the stem as an indexing term.