Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus - Research Portal

Computing and Communications

Associated organisational unit

Digital Health Group

Electronic data

habibi
Accepted author manuscript, 1.06 MB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

View graph of relations

Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus. / El-Haj, Mahmoud.
LREC 2020, Twelfth International Conference on Language Resources and Evaluation: LREC'20. European Language Resources Association (ELRA), 2020.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

El-Haj, M 2020, Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus. in LREC 2020, Twelfth International Conference on Language Resources and Evaluation: LREC'20. European Language Resources Association (ELRA), The 12th Edition of the Language Resources and Evaluation Conference (LREC2020), Marseille, France, 11/05/20.

APA

El-Haj, M. (2020). Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus. In LREC 2020, Twelfth International Conference on Language Resources and Evaluation: LREC'20 European Language Resources Association (ELRA).

Vancouver

El-Haj M. Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus. In LREC 2020, Twelfth International Conference on Language Resources and Evaluation: LREC'20. European Language Resources Association (ELRA). 2020

Author

El-Haj, Mahmoud. / Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus. LREC 2020, Twelfth International Conference on Language Resources and Evaluation: LREC'20. European Language Resources Association (ELRA), 2020.

Bibtex

@inproceedings{906da15a5dc04988b0c9fb18f78c50a9,

title = "Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus",

abstract = "This paper introduces Habibi the first Arabic Song Lyrics corpus. The corpus comprises more than 30,000 Arabic song lyrics in 6Arabic dialects for singers from 18 different Arabic countries. The lyrics are segmented into more than 500,000 sentences (song verses)with more than 3.5 million words. I provide the corpus in both comma separated value (csv) and annotated plain text (txt) file formats.In addition, I converted the csv version into JavaScript Object Notation (json) and eXtensible Markup Language (xml) file formats.To experiment with the corpus I run extensive binary and multi-class experiments for dialect and country-of-origin identification. Theidentification tasks include the use of several classical machine learning and deep learning models utilising different word embeddings.For the binary dialect identification task the best performing classifier achieved a testing accuracy of 93%. This was achieved using aword-based Convolutional Neural Network (CNN) utilising a Continuous Bag of Words (CBOW) word embeddings model. The resultsoverall show all classical and deep learning models to outperform our baseline, which demonstrates the suitability of the corpus for bothdialect and country-of-origin identification tasks. I am making the corpus and the trained CBOW word embeddings freely available forresearch purposes.",

author = "Mahmoud El-Haj",

year = "2020",

month = may,

day = "11",

language = "English",

booktitle = "LREC 2020, Twelfth International Conference on Language Resources and Evaluation",

publisher = "European Language Resources Association (ELRA)",

note = "The 12th Edition of the Language Resources and Evaluation Conference (LREC2020), LREC'20 ; Conference date: 11-05-2020 Through 16-05-2020",

url = "https://lrec2020.lrec-conf.org/en/",

}

RIS

TY - GEN

T1 - Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus

AU - El-Haj, Mahmoud

PY - 2020/5/11

Y1 - 2020/5/11

N2 - This paper introduces Habibi the first Arabic Song Lyrics corpus. The corpus comprises more than 30,000 Arabic song lyrics in 6Arabic dialects for singers from 18 different Arabic countries. The lyrics are segmented into more than 500,000 sentences (song verses)with more than 3.5 million words. I provide the corpus in both comma separated value (csv) and annotated plain text (txt) file formats.In addition, I converted the csv version into JavaScript Object Notation (json) and eXtensible Markup Language (xml) file formats.To experiment with the corpus I run extensive binary and multi-class experiments for dialect and country-of-origin identification. Theidentification tasks include the use of several classical machine learning and deep learning models utilising different word embeddings.For the binary dialect identification task the best performing classifier achieved a testing accuracy of 93%. This was achieved using aword-based Convolutional Neural Network (CNN) utilising a Continuous Bag of Words (CBOW) word embeddings model. The resultsoverall show all classical and deep learning models to outperform our baseline, which demonstrates the suitability of the corpus for bothdialect and country-of-origin identification tasks. I am making the corpus and the trained CBOW word embeddings freely available forresearch purposes.

AB - This paper introduces Habibi the first Arabic Song Lyrics corpus. The corpus comprises more than 30,000 Arabic song lyrics in 6Arabic dialects for singers from 18 different Arabic countries. The lyrics are segmented into more than 500,000 sentences (song verses)with more than 3.5 million words. I provide the corpus in both comma separated value (csv) and annotated plain text (txt) file formats.In addition, I converted the csv version into JavaScript Object Notation (json) and eXtensible Markup Language (xml) file formats.To experiment with the corpus I run extensive binary and multi-class experiments for dialect and country-of-origin identification. Theidentification tasks include the use of several classical machine learning and deep learning models utilising different word embeddings.For the binary dialect identification task the best performing classifier achieved a testing accuracy of 93%. This was achieved using aword-based Convolutional Neural Network (CNN) utilising a Continuous Bag of Words (CBOW) word embeddings model. The resultsoverall show all classical and deep learning models to outperform our baseline, which demonstrates the suitability of the corpus for bothdialect and country-of-origin identification tasks. I am making the corpus and the trained CBOW word embeddings freely available forresearch purposes.

M3 - Conference contribution/Paper

BT - LREC 2020, Twelfth International Conference on Language Resources and Evaluation

PB - European Language Resources Association (ELRA)

T2 - The 12th Edition of the Language Resources and Evaluation Conference (LREC2020)

Y2 - 11 May 2020 through 16 May 2020

ER -

Research

Associated organisational unit

Electronic data

Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus

Standard

Harvard

APA

Vancouver

Author

Bibtex

RIS

Quick Links

Connect With Us

Faculties & Depts

Contact Us