Named entity recognition for African languages - Research Portal

Computing and Communications

Associated organisational units

Electronic data

2025Chiamakaphd
Final published version, 4.25 MB, PDF document
Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Text available via DOI:

https://doi.org/10.17635/lancaster/thesis/2789
Final published version

Keywords

Igbo, named entity recognition, mapping dictionary, dataset, BERT models

View graph of relations

Named entity recognition for African languages: a focus on the Igbo language

Research output: Thesis › Doctoral Thesis

Published

CI Chukwuneke

More...

Publication date	2025
Number of pages	175
Qualification	PhD
Awarding Institution	Lancaster University
Supervisors/Advisors	Rayson, Paul, Supervisor Ezeani, Ignatius, Supervisor El-Haj, Mo, Supervisor
Publisher	Lancaster University
<mark>Original language</mark>	English

Abstract

Named Entity Recognition (NER) is a crucial task for many downstream NLP applications, including text summarization, document indexing, question answering, classification, and machine translation. Analysis of research reveals that 95% NLP efforts are concentrated on English and a few other languages like Japanese, German, and French, even though there are over 7,000 languages globally. Around 90% of African languages are considered under-resourced in NLP highlighting the gap in resources for African languages
The work presented in this thesis significantly advances Named Entity Recognition (NER) for low-resource languages, particularly African languages like Igbo, which, despite having millions of speakers, has remained largely underrepresented in NLP research. Focusing on Igbo, this research addresses a critical gap where foundational tools and resources, such as IgboNER, have been unavailable, thus limiting the language’s integration into broader computational applications. Prior to this work, the Igbo language lacked dedicated NER resources and a specialised language model essential for accurate information extraction and analysis, which has kept Igbo on the periphery of digital advancements in NLP.
To address this gap, we developed IgboBERT, the first transformer-based language model pre-trained from scratch on the Igbo language, to serve as a baseline model. We created a parallel English-Igbo corpus and utilized spaCy, an existing NER tool for the high-resource English language, to tag the English sentences. These tags were then transferred to Igbo using a projection method, aided by our semiautomatically created mapping dictionary to facilitate the tag transfer process. Additionally, we designed a framework for the creation of the IgboNER dataset, which can be extended to other low-resource languages.
We fine-tuned IgboBERT and several state-of-the-art models, including mBERT, XLM-R, and DistilBERT, for the downstream IgboNER task using transfer learning. Our evaluation across various data sizes indicated that while large transformer models significantly benefited the IgboNER task, fine-tuning a transformer model built from scratch with relatively little Igbo text data also produced commendable results. This work substantially contributes to IgboNLP and the broader African and low-resource NLP landscape.

Research

Associated organisational units

Electronic data

Text available via DOI:

Keywords

Named entity recognition for African languages: a focus on the Igbo language

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us