Submitted manuscript, 475 KB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License
Submitted manuscript
Licence: CC BY: Creative Commons Attribution 4.0 International License
Research output: Working paper › Preprint
Research output: Working paper › Preprint
}
TY - UNPB
T1 - MasakhaNEWS
T2 - News Topic Classification for African languages
AU - Adelani, David Ifeoluwa
AU - Chukwuneke, Chiamaka I.
AU - Masiak, Marek
AU - Azime, Israel Abebe
AU - Alabi, Jesujoba Oluwadara
AU - Tonja, Atnafu Lambebo
AU - Mwase, Christine
AU - Ogundepo, Odunayo
AU - Dossou, Bonaventure F. P.
AU - Oladipo, Akintunde
AU - Nixdorf, Doreen
AU - Emezue, Chris Chinenye
AU - al-azzawi, Sana Sabah
AU - Sibanda, Blessing K.
AU - David, Davis
AU - Ndolela, Lolwethu
AU - Mukiibi, Jonathan
AU - Ajayi, Tunde Oluwaseyi
AU - Ngoli, Tatiana Moteu
AU - Odhiambo, Brian
AU - Mbonu, Chinedu E.
AU - Owodunni, Abraham Toluwase
AU - Obiefuna, Nnaemeka C.
AU - Muhammad, Shamsuddeen Hassan
AU - Abdullahi, Saheed Salahudeen
AU - Yigezu, Mesay Gemeda
AU - Gwadabe, Tajuddeen
AU - Abdulmumin, Idris
AU - Bame, Mahlet Taye
AU - Awoyomi, Oluwabusayo Olufunke
AU - Shode, Iyanuoluwa
AU - Adelani, Tolulope Anu
AU - Kailani, Habiba Abdulganiy
AU - Omotayo, Abdul-Hakeem
AU - Adeeko, Adetola
AU - Abeeb, Afolabi
AU - Aremu, Anuoluwapo
AU - Samuel, Olanrewaju
AU - Siro, Clemencia
AU - Kimotho, Wangari
AU - Ogbu, Onyekachi Raphael
AU - Fanijo, Samuel
AU - Ojo, Jessica
AU - Awosan, Oyinkansola F.
AU - Guge, Tadesse Kebede
AU - Sari, Sakayo Toadoum
AU - Nyatsine, Pamela
AU - Sidume, Freedmore
AU - Yousuf, Oreen
AU - Oduwole, Mardiyyah
AU - Kimanuka, Ussen
AU - Tshinu, Kanda Patrick
AU - Diko, Thina
AU - Nxakama, Siyanda
AU - Johar, Abdulmejid Tuni
AU - Gebre, Sinodos
AU - Mohamed, Muhidin
AU - Mohamed, Shafie Abdi
AU - Hassan, Fuad Mire
AU - Mehamed, Moges Ahmed
AU - Ngabire, Evrard
AU - Stenetorp, Pontus
N1 - Accepted to AfricaNLP Workshop @ICLR 2023 (non-archival)
PY - 2023/4/19
Y1 - 2023/4/19
N2 - African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
AB - African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
KW - cs.CL
M3 - Preprint
BT - MasakhaNEWS
ER -