Arabic Dialect Identification in the Context of Bivalency and Code-Switching

Associated organisational units

Electronic data

237_Paper
Accepted author manuscript, 133 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Keywords

Arabic, bivalency, language identification, dialects, machine learning, NLP

View graph of relations

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

Arabic Dialect Identification in the Context of Bivalency and Code-Switching. / El Haj, Mahmoud ; Rayson, Paul Edward; Aboelezz, Mariam.
LREC 2018, Eleventh International Conference on Language Resources and Evaluation. ed. / Nicoletti Calzolari; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Koiti Hasida; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Helene Mazo; Asuncion Moreno; Jan Odijk; Stelios Piperidis; Takenobu Tokunaga. 2018. p. 3622-3627.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

El Haj, M , Rayson, PE & Aboelezz, M 2018, Arabic Dialect Identification in the Context of Bivalency and Code-Switching. in N Calzolari, K Choukri, C Cieri, T Declerck, S Goggi, K Hasida, H Isahara, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk, S Piperidis & T Tokunaga (eds), LREC 2018, Eleventh International Conference on Language Resources and Evaluation. pp. 3622-3627, The 11th Edition of the Language Resources and Evaluation Conference, Miyazaki, Japan, 7/05/18. <http://www.lrec-conf.org/proceedings/lrec2018/pdf/237.pdf>

APA

El Haj, M., Rayson, P. E., & Aboelezz, M. (2018). Arabic Dialect Identification in the Context of Bivalency and Code-Switching. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (Eds.), LREC 2018, Eleventh International Conference on Language Resources and Evaluation (pp. 3622-3627) http://www.lrec-conf.org/proceedings/lrec2018/pdf/237.pdf

Vancouver

El Haj M , Rayson PE, Aboelezz M. Arabic Dialect Identification in the Context of Bivalency and Code-Switching. In Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, editors, LREC 2018, Eleventh International Conference on Language Resources and Evaluation. 2018. p. 3622-3627

Author

El Haj, Mahmoud ; Rayson, Paul Edward ; Aboelezz, Mariam. / Arabic Dialect Identification in the Context of Bivalency and Code-Switching. LREC 2018, Eleventh International Conference on Language Resources and Evaluation. editor / Nicoletti Calzolari ; Khalid Choukri ; Christopher Cieri ; Thierry Declerck ; Sara Goggi ; Koiti Hasida ; Hitoshi Isahara ; Bente Maegaard ; Joseph Mariani ; Helene Mazo ; Asuncion Moreno ; Jan Odijk ; Stelios Piperidis ; Takenobu Tokunaga. 2018. pp. 3622-3627

Bibtex

@inproceedings{e0637beabc1a47e193ebccb9e3ccfc51,

title = "Arabic Dialect Identification in the Context of Bivalency and Code-Switching",

abstract = "In this paper we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods classification accuracy can reach more than 76% and score well (66%) when tested on completely unseen data.",

keywords = "Arabic, bivalency, language identification, dialects, machine learning, NLP",

author = "{El Haj}, Mahmoud and Rayson, {Paul Edward} and Mariam Aboelezz",

year = "2018",

month = may,

day = "9",

language = "English",

isbn = "9791095546009",

pages = "3622--3627",

editor = "Nicoletti Calzolari and Khalid Choukri and Cieri, {Christopher } and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Odijk, {Jan } and Stelios Piperidis and Takenobu Tokunaga",

booktitle = "LREC 2018, Eleventh International Conference on Language Resources and Evaluation",

note = "The 11th Edition of the Language Resources and Evaluation Conference, LREC2018 ; Conference date: 07-05-2018 Through 12-05-2018",

url = "http://lrec2018.lrec-conf.org/en/",

}

RIS

TY - GEN

T1 - Arabic Dialect Identification in the Context of Bivalency and Code-Switching

AU - El Haj, Mahmoud

AU - Rayson, Paul Edward

AU - Aboelezz, Mariam

PY - 2018/5/9

Y1 - 2018/5/9

N2 - In this paper we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods classification accuracy can reach more than 76% and score well (66%) when tested on completely unseen data.

AB - In this paper we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods classification accuracy can reach more than 76% and score well (66%) when tested on completely unseen data.

KW - Arabic

KW - bivalency

KW - language identification

KW - dialects

KW - machine learning

KW - NLP

M3 - Conference contribution/Paper

SN - 9791095546009

SP - 3622

EP - 3627

BT - LREC 2018, Eleventh International Conference on Language Resources and Evaluation

A2 - Calzolari, Nicoletti

A2 - Choukri, Khalid

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Hasida, Koiti

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Helene

A2 - Moreno, Asuncion

A2 - Odijk, Jan

A2 - Piperidis, Stelios

A2 - Tokunaga, Takenobu

T2 - The 11th Edition of the Language Resources and Evaluation Conference

Y2 - 7 May 2018 through 12 May 2018

ER -

Research

Associated organisational units

Electronic data

Links

Keywords