Accepted author manuscript, 133 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License
Final published version
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
}
TY - GEN
T1 - Arabic Dialect Identification in the Context of Bivalency and Code-Switching
AU - El Haj, Mahmoud
AU - Rayson, Paul Edward
AU - Aboelezz, Mariam
PY - 2018/5/9
Y1 - 2018/5/9
N2 - In this paper we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods classification accuracy can reach more than 76% and score well (66%) when tested on completely unseen data.
AB - In this paper we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods classification accuracy can reach more than 76% and score well (66%) when tested on completely unseen data.
KW - Arabic
KW - bivalency
KW - language identification
KW - dialects
KW - machine learning
KW - NLP
M3 - Conference contribution/Paper
SN - 9791095546009
SP - 3622
EP - 3627
BT - LREC 2018, Eleventh International Conference on Language Resources and Evaluation
A2 - Calzolari, Nicoletti
A2 - Choukri, Khalid
A2 - Cieri, Christopher
A2 - Declerck, Thierry
A2 - Goggi, Sara
A2 - Hasida, Koiti
A2 - Isahara, Hitoshi
A2 - Maegaard, Bente
A2 - Mariani, Joseph
A2 - Mazo, Helene
A2 - Moreno, Asuncion
A2 - Odijk, Jan
A2 - Piperidis, Stelios
A2 - Tokunaga, Takenobu
T2 - The 11th Edition of the Language Resources and Evaluation Conference
Y2 - 7 May 2018 through 12 May 2018
ER -