Enhancing retrieval effectiveness of diacritisized Arabic passages using stemmer and thesaurus

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review



In this paper we discuss the enhancement of Arabic passage retrieval for both diacritisized and nondiacritisized text. Most previous work suggested that retrieval start with pre-processing the Arabic text to remove the diacritical marks (short vowels) to unify the text. In most cases, this process causes considerable
ambiguity at the word level in the absence of context.
However, searching for a word in diacritisized text requires typing and matching all its diacritical marks, which is cumbersome and prevents users from searching and hence retrieving valuable amount of text. The other way around, is to ignore these marks and fall into the problem of ambiguity. In this paper, we propose a passage retrieval approach to search for diacritic and diacritic-less text through query expansion to match a user’s query. We
applied a rule-based stemmer and we compiled a huge thesaurus for this purpose. We tested our approach on the scripts of the Quran as an open domain source of diacritisized text using a set of 40 non-diacritical words obtained from testers. The results are presented and the applied approach reveals future directions for search engines.