Home > Research > Publications & Outputs > MasakhaNEWS

Electronic data

  • 2304.09972v1

    Submitted manuscript, 475 KB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Keywords

View graph of relations

MasakhaNEWS: News Topic Classification for African languages

Research output: Working paperPreprint

Published

Standard

MasakhaNEWS: News Topic Classification for African languages. / Adelani, David Ifeoluwa; Chukwuneke, Chiamaka I.; Masiak, Marek et al.
2023.

Research output: Working paperPreprint

Harvard

Adelani, DI, Chukwuneke, CI, Masiak, M, Azime, IA, Alabi, JO, Tonja, AL, Mwase, C, Ogundepo, O, Dossou, BFP, Oladipo, A, Nixdorf, D, Emezue, CC, al-azzawi, SS, Sibanda, BK, David, D, Ndolela, L, Mukiibi, J, Ajayi, TO, Ngoli, TM, Odhiambo, B, Mbonu, CE, Owodunni, AT, Obiefuna, NC, Muhammad, SH, Abdullahi, SS, Yigezu, MG, Gwadabe, T, Abdulmumin, I, Bame, MT, Awoyomi, OO, Shode, I, Adelani, TA, Kailani, HA, Omotayo, A-H, Adeeko, A, Abeeb, A, Aremu, A, Samuel, O, Siro, C, Kimotho, W, Ogbu, OR, Fanijo, S, Ojo, J, Awosan, OF, Guge, TK, Sari, ST, Nyatsine, P, Sidume, F, Yousuf, O, Oduwole, M, Kimanuka, U, Tshinu, KP, Diko, T, Nxakama, S, Johar, AT, Gebre, S, Mohamed, M, Mohamed, SA, Hassan, FM, Mehamed, MA, Ngabire, E & Stenetorp, P 2023 'MasakhaNEWS: News Topic Classification for African languages'. <https://arxiv.org/abs/2304.09972v1>

APA

Adelani, D. I., Chukwuneke, C. I., Masiak, M., Azime, I. A., Alabi, J. O., Tonja, A. L., Mwase, C., Ogundepo, O., Dossou, B. F. P., Oladipo, A., Nixdorf, D., Emezue, C. C., al-azzawi, S. S., Sibanda, B. K., David, D., Ndolela, L., Mukiibi, J., Ajayi, T. O., Ngoli, T. M., ... Stenetorp, P. (2023). MasakhaNEWS: News Topic Classification for African languages. https://arxiv.org/abs/2304.09972v1

Vancouver

Adelani DI, Chukwuneke CI, Masiak M, Azime IA, Alabi JO, Tonja AL et al. MasakhaNEWS: News Topic Classification for African languages. 2023 Apr 19.

Author

Bibtex

@techreport{7ca7c4797ff14370a6284a3dd1b38a25,
title = "MasakhaNEWS: News Topic Classification for African languages",
abstract = " African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach. ",
keywords = "cs.CL",
author = "Adelani, {David Ifeoluwa} and Chukwuneke, {Chiamaka I.} and Marek Masiak and Azime, {Israel Abebe} and Alabi, {Jesujoba Oluwadara} and Tonja, {Atnafu Lambebo} and Christine Mwase and Odunayo Ogundepo and Dossou, {Bonaventure F. P.} and Akintunde Oladipo and Doreen Nixdorf and Emezue, {Chris Chinenye} and al-azzawi, {Sana Sabah} and Sibanda, {Blessing K.} and Davis David and Lolwethu Ndolela and Jonathan Mukiibi and Ajayi, {Tunde Oluwaseyi} and Ngoli, {Tatiana Moteu} and Brian Odhiambo and Mbonu, {Chinedu E.} and Owodunni, {Abraham Toluwase} and Obiefuna, {Nnaemeka C.} and Muhammad, {Shamsuddeen Hassan} and Abdullahi, {Saheed Salahudeen} and Yigezu, {Mesay Gemeda} and Tajuddeen Gwadabe and Idris Abdulmumin and Bame, {Mahlet Taye} and Awoyomi, {Oluwabusayo Olufunke} and Iyanuoluwa Shode and Adelani, {Tolulope Anu} and Kailani, {Habiba Abdulganiy} and Abdul-Hakeem Omotayo and Adetola Adeeko and Afolabi Abeeb and Anuoluwapo Aremu and Olanrewaju Samuel and Clemencia Siro and Wangari Kimotho and Ogbu, {Onyekachi Raphael} and Samuel Fanijo and Jessica Ojo and Awosan, {Oyinkansola F.} and Guge, {Tadesse Kebede} and Sari, {Sakayo Toadoum} and Pamela Nyatsine and Freedmore Sidume and Oreen Yousuf and Mardiyyah Oduwole and Ussen Kimanuka and Tshinu, {Kanda Patrick} and Thina Diko and Siyanda Nxakama and Johar, {Abdulmejid Tuni} and Sinodos Gebre and Muhidin Mohamed and Mohamed, {Shafie Abdi} and Hassan, {Fuad Mire} and Mehamed, {Moges Ahmed} and Evrard Ngabire and Pontus Stenetorp",
note = "Accepted to AfricaNLP Workshop @ICLR 2023 (non-archival)",
year = "2023",
month = apr,
day = "19",
language = "English",
type = "WorkingPaper",

}

RIS

TY - UNPB

T1 - MasakhaNEWS

T2 - News Topic Classification for African languages

AU - Adelani, David Ifeoluwa

AU - Chukwuneke, Chiamaka I.

AU - Masiak, Marek

AU - Azime, Israel Abebe

AU - Alabi, Jesujoba Oluwadara

AU - Tonja, Atnafu Lambebo

AU - Mwase, Christine

AU - Ogundepo, Odunayo

AU - Dossou, Bonaventure F. P.

AU - Oladipo, Akintunde

AU - Nixdorf, Doreen

AU - Emezue, Chris Chinenye

AU - al-azzawi, Sana Sabah

AU - Sibanda, Blessing K.

AU - David, Davis

AU - Ndolela, Lolwethu

AU - Mukiibi, Jonathan

AU - Ajayi, Tunde Oluwaseyi

AU - Ngoli, Tatiana Moteu

AU - Odhiambo, Brian

AU - Mbonu, Chinedu E.

AU - Owodunni, Abraham Toluwase

AU - Obiefuna, Nnaemeka C.

AU - Muhammad, Shamsuddeen Hassan

AU - Abdullahi, Saheed Salahudeen

AU - Yigezu, Mesay Gemeda

AU - Gwadabe, Tajuddeen

AU - Abdulmumin, Idris

AU - Bame, Mahlet Taye

AU - Awoyomi, Oluwabusayo Olufunke

AU - Shode, Iyanuoluwa

AU - Adelani, Tolulope Anu

AU - Kailani, Habiba Abdulganiy

AU - Omotayo, Abdul-Hakeem

AU - Adeeko, Adetola

AU - Abeeb, Afolabi

AU - Aremu, Anuoluwapo

AU - Samuel, Olanrewaju

AU - Siro, Clemencia

AU - Kimotho, Wangari

AU - Ogbu, Onyekachi Raphael

AU - Fanijo, Samuel

AU - Ojo, Jessica

AU - Awosan, Oyinkansola F.

AU - Guge, Tadesse Kebede

AU - Sari, Sakayo Toadoum

AU - Nyatsine, Pamela

AU - Sidume, Freedmore

AU - Yousuf, Oreen

AU - Oduwole, Mardiyyah

AU - Kimanuka, Ussen

AU - Tshinu, Kanda Patrick

AU - Diko, Thina

AU - Nxakama, Siyanda

AU - Johar, Abdulmejid Tuni

AU - Gebre, Sinodos

AU - Mohamed, Muhidin

AU - Mohamed, Shafie Abdi

AU - Hassan, Fuad Mire

AU - Mehamed, Moges Ahmed

AU - Ngabire, Evrard

AU - Stenetorp, Pontus

N1 - Accepted to AfricaNLP Workshop @ICLR 2023 (non-archival)

PY - 2023/4/19

Y1 - 2023/4/19

N2 - African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

AB - African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

KW - cs.CL

M3 - Preprint

BT - MasakhaNEWS

ER -