Home > Research > Publications & Outputs > MasakhaNEWS

Electronic data

  • 2304.09972v1

    Submitted manuscript, 475 KB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Keywords

View graph of relations

MasakhaNEWS: News Topic Classification for African languages

Research output: Working paperPreprint

Published
  • David Ifeoluwa Adelani
  • Marek Masiak
  • Israel Abebe Azime
  • Jesujoba Oluwadara Alabi
  • Atnafu Lambebo Tonja
  • Christine Mwase
  • Odunayo Ogundepo
  • Bonaventure F. P. Dossou
  • Akintunde Oladipo
  • Doreen Nixdorf
  • Chris Chinenye Emezue
  • Sana Sabah al-azzawi
  • Blessing K. Sibanda
  • Davis David
  • Lolwethu Ndolela
  • Jonathan Mukiibi
  • Tunde Oluwaseyi Ajayi
  • Tatiana Moteu Ngoli
  • Brian Odhiambo
  • Chinedu E. Mbonu
  • Abraham Toluwase Owodunni
  • Nnaemeka C. Obiefuna
  • Shamsuddeen Hassan Muhammad
  • Saheed Salahudeen Abdullahi
  • Mesay Gemeda Yigezu
  • Tajuddeen Gwadabe
  • Idris Abdulmumin
  • Mahlet Taye Bame
  • Oluwabusayo Olufunke Awoyomi
  • Iyanuoluwa Shode
  • Tolulope Anu Adelani
  • Habiba Abdulganiy Kailani
  • Abdul-Hakeem Omotayo
  • Adetola Adeeko
  • Afolabi Abeeb
  • Anuoluwapo Aremu
  • Olanrewaju Samuel
  • Clemencia Siro
  • Wangari Kimotho
  • Onyekachi Raphael Ogbu
  • Samuel Fanijo
  • Jessica Ojo
  • Oyinkansola F. Awosan
  • Tadesse Kebede Guge
  • Sakayo Toadoum Sari
  • Pamela Nyatsine
  • Freedmore Sidume
  • Oreen Yousuf
  • Mardiyyah Oduwole
  • Ussen Kimanuka
  • Kanda Patrick Tshinu
  • Thina Diko
  • Siyanda Nxakama
  • Abdulmejid Tuni Johar
  • Sinodos Gebre
  • Muhidin Mohamed
  • Shafie Abdi Mohamed
  • Fuad Mire Hassan
  • Moges Ahmed Mehamed
  • Evrard Ngabire
  • Pontus Stenetorp
Close
Publication date19/04/2023
<mark>Original language</mark>English

Abstract

African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

Bibliographic note

Accepted to AfricaNLP Workshop @ICLR 2023 (non-archival)