Final published version
Licence: CC BY: Creative Commons Attribution 4.0 International License
Research output: Contribution to Journal/Magazine › Journal article › peer-review
Research output: Contribution to Journal/Magazine › Journal article › peer-review
}
TY - JOUR
T1 - MasakhaNER 2.0
T2 - Africa-centric Transfer Learning for Named Entity Recognition
AU - Adelani, David Ifeoluwa
AU - Neubig, Graham
AU - Ruder, Sebastian
AU - Rijhwani, Shruti
AU - Beukman, Michael
AU - Palen-Michel, Chester
AU - Lignos, Constantine
AU - Alabi, Jesujoba O.
AU - Muhammad, Shamsuddeen Hassan
AU - Nabende, Peter
AU - Dione, Cheikh M. Bamba
AU - Bukula, Andiswa
AU - Mabuya, Rooweither
AU - Dossou, Bonaventure F. P.
AU - Sibanda, Blessing
AU - Buzaaba, Happy
AU - Mukiibi, Jonathan
AU - Kalipe, Godson
AU - Mbaye, Derguene
AU - Taylor, Amelia
AU - Kabore, Fatoumata Ouoba
AU - Emezue, Chris Chinenye
AU - Anuoluwapo, Aremu
AU - Ogayo, Perez
AU - Gitau, Catherine
AU - Munkoh-Buabeng, Edwin
AU - Koagne, Victoire Memdjokam
AU - Tapo, Allahsera Auguste
AU - Macucwa, Tebogo
AU - Marivate, Vukosi
AU - Mboning, Elvis
AU - Gwadabe, Tajuddeen
AU - Adewumi, Tosin P.
AU - Ahia, Orevaoghene
AU - Nakatumba-Nabende, Joyce
AU - Mokono, Neo L.
AU - Ezeani, Ignatius
AU - Chukwuneke, Chiamaka
AU - Adeyemi, Mofetoluwa
AU - Hacheme, Gilles
AU - Abdulmumin, Idris
AU - Ogundepo, Odunayo
AU - Yousuf, Oreen
AU - Ngoli, Tatiana Moteu
AU - Klakow, Dietrich
PY - 2022/11/15
Y1 - 2022/11/15
N2 - African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages.
AB - African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages.
U2 - 10.48550/arXiv.2210.12391
DO - 10.48550/arXiv.2210.12391
M3 - Journal article
VL - abs/2210.12391
JO - arXiv
JF - arXiv
SN - 2331-8422
ER -