Home > Research > Publications & Outputs > Named entity recognition for African languages

Electronic data

  • 2025Chiamakaphd

    Final published version, 4.25 MB, PDF document

    Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Text available via DOI:

View graph of relations

Named entity recognition for African languages: a focus on the Igbo language

Research output: ThesisDoctoral Thesis

Published

Standard

Named entity recognition for African languages: a focus on the Igbo language. / Chukwuneke, CI.
Lancaster University, 2025. 175 p.

Research output: ThesisDoctoral Thesis

Harvard

APA

Vancouver

Chukwuneke CI. Named entity recognition for African languages: a focus on the Igbo language. Lancaster University, 2025. 175 p. doi: 10.17635/lancaster/thesis/2789

Author

Bibtex

@phdthesis{2385efaa91a54880b9f44792d47161e6,
title = "Named entity recognition for African languages: a focus on the Igbo language",
abstract = "Named Entity Recognition (NER) is a crucial task for many downstream NLP applications, including text summarization, document indexing, question answering, classification, and machine translation. Analysis of research reveals that 95% NLP efforts are concentrated on English and a few other languages like Japanese, German, and French, even though there are over 7,000 languages globally. Around 90% of African languages are considered under-resourced in NLP highlighting the gap in resources for African languages The work presented in this thesis significantly advances Named Entity Recognition (NER) for low-resource languages, particularly African languages like Igbo, which, despite having millions of speakers, has remained largely underrepresented in NLP research. Focusing on Igbo, this research addresses a critical gap where foundational tools and resources, such as IgboNER, have been unavailable, thus limiting the language{\textquoteright}s integration into broader computational applications. Prior to this work, the Igbo language lacked dedicated NER resources and a specialised language model essential for accurate information extraction and analysis, which has kept Igbo on the periphery of digital advancements in NLP. To address this gap, we developed IgboBERT, the first transformer-based language model pre-trained from scratch on the Igbo language, to serve as a baseline model. We created a parallel English-Igbo corpus and utilized spaCy, an existing NER tool for the high-resource English language, to tag the English sentences. These tags were then transferred to Igbo using a projection method, aided by our semiautomatically created mapping dictionary to facilitate the tag transfer process. Additionally, we designed a framework for the creation of the IgboNER dataset, which can be extended to other low-resource languages. We fine-tuned IgboBERT and several state-of-the-art models, including mBERT, XLM-R, and DistilBERT, for the downstream IgboNER task using transfer learning. Our evaluation across various data sizes indicated that while large transformer models significantly benefited the IgboNER task, fine-tuning a transformer model built from scratch with relatively little Igbo text data also produced commendable results. This work substantially contributes to IgboNLP and the broader African and low-resource NLP landscape.",
keywords = "Igbo, named entity recognition, mapping dictionary, dataset, BERT models",
author = "CI Chukwuneke",
year = "2025",
doi = "10.17635/lancaster/thesis/2789",
language = "English",
publisher = "Lancaster University",
school = "Lancaster University",

}

RIS

TY - BOOK

T1 - Named entity recognition for African languages

T2 - a focus on the Igbo language

AU - Chukwuneke, CI

PY - 2025

Y1 - 2025

N2 - Named Entity Recognition (NER) is a crucial task for many downstream NLP applications, including text summarization, document indexing, question answering, classification, and machine translation. Analysis of research reveals that 95% NLP efforts are concentrated on English and a few other languages like Japanese, German, and French, even though there are over 7,000 languages globally. Around 90% of African languages are considered under-resourced in NLP highlighting the gap in resources for African languages The work presented in this thesis significantly advances Named Entity Recognition (NER) for low-resource languages, particularly African languages like Igbo, which, despite having millions of speakers, has remained largely underrepresented in NLP research. Focusing on Igbo, this research addresses a critical gap where foundational tools and resources, such as IgboNER, have been unavailable, thus limiting the language’s integration into broader computational applications. Prior to this work, the Igbo language lacked dedicated NER resources and a specialised language model essential for accurate information extraction and analysis, which has kept Igbo on the periphery of digital advancements in NLP. To address this gap, we developed IgboBERT, the first transformer-based language model pre-trained from scratch on the Igbo language, to serve as a baseline model. We created a parallel English-Igbo corpus and utilized spaCy, an existing NER tool for the high-resource English language, to tag the English sentences. These tags were then transferred to Igbo using a projection method, aided by our semiautomatically created mapping dictionary to facilitate the tag transfer process. Additionally, we designed a framework for the creation of the IgboNER dataset, which can be extended to other low-resource languages. We fine-tuned IgboBERT and several state-of-the-art models, including mBERT, XLM-R, and DistilBERT, for the downstream IgboNER task using transfer learning. Our evaluation across various data sizes indicated that while large transformer models significantly benefited the IgboNER task, fine-tuning a transformer model built from scratch with relatively little Igbo text data also produced commendable results. This work substantially contributes to IgboNLP and the broader African and low-resource NLP landscape.

AB - Named Entity Recognition (NER) is a crucial task for many downstream NLP applications, including text summarization, document indexing, question answering, classification, and machine translation. Analysis of research reveals that 95% NLP efforts are concentrated on English and a few other languages like Japanese, German, and French, even though there are over 7,000 languages globally. Around 90% of African languages are considered under-resourced in NLP highlighting the gap in resources for African languages The work presented in this thesis significantly advances Named Entity Recognition (NER) for low-resource languages, particularly African languages like Igbo, which, despite having millions of speakers, has remained largely underrepresented in NLP research. Focusing on Igbo, this research addresses a critical gap where foundational tools and resources, such as IgboNER, have been unavailable, thus limiting the language’s integration into broader computational applications. Prior to this work, the Igbo language lacked dedicated NER resources and a specialised language model essential for accurate information extraction and analysis, which has kept Igbo on the periphery of digital advancements in NLP. To address this gap, we developed IgboBERT, the first transformer-based language model pre-trained from scratch on the Igbo language, to serve as a baseline model. We created a parallel English-Igbo corpus and utilized spaCy, an existing NER tool for the high-resource English language, to tag the English sentences. These tags were then transferred to Igbo using a projection method, aided by our semiautomatically created mapping dictionary to facilitate the tag transfer process. Additionally, we designed a framework for the creation of the IgboNER dataset, which can be extended to other low-resource languages. We fine-tuned IgboBERT and several state-of-the-art models, including mBERT, XLM-R, and DistilBERT, for the downstream IgboNER task using transfer learning. Our evaluation across various data sizes indicated that while large transformer models significantly benefited the IgboNER task, fine-tuning a transformer model built from scratch with relatively little Igbo text data also produced commendable results. This work substantially contributes to IgboNLP and the broader African and low-resource NLP landscape.

KW - Igbo

KW - named entity recognition

KW - mapping dictionary

KW - dataset

KW - BERT models

U2 - 10.17635/lancaster/thesis/2789

DO - 10.17635/lancaster/thesis/2789

M3 - Doctoral Thesis

PB - Lancaster University

ER -