Named entity recognition for African languages - Research Portal

Computing and Communications

Associated organisational units

Electronic data

2025Chiamakaphd
Final published version, 4.25 MB, PDF document
Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Text available via DOI:

https://doi.org/10.17635/lancaster/thesis/2789
Final published version

Keywords

Igbo, named entity recognition, mapping dictionary, dataset, BERT models

View graph of relations

Named entity recognition for African languages: a focus on the Igbo language

Research output: Thesis › Doctoral Thesis

Published

Standard

Named entity recognition for African languages: a focus on the Igbo language. / Chukwuneke, CI.
Lancaster University, 2025. 175 p.

Research output: Thesis › Doctoral Thesis

Bibtex

@phdthesis{2385efaa91a54880b9f44792d47161e6,

title = "Named entity recognition for African languages: a focus on the Igbo language",

abstract = "Named Entity Recognition (NER) is a crucial task for many downstream NLP applications, including text summarization, document indexing, question answering, classification, and machine translation. Analysis of research reveals that 95% NLP efforts are concentrated on English and a few other languages like Japanese, German, and French, even though there are over 7,000 languages globally. Around 90% of African languages are considered under-resourced in NLP highlighting the gap in resources for African languages The work presented in this thesis significantly advances Named Entity Recognition (NER) for low-resource languages, particularly African languages like Igbo, which, despite having millions of speakers, has remained largely underrepresented in NLP research. Focusing on Igbo, this research addresses a critical gap where foundational tools and resources, such as IgboNER, have been unavailable, thus limiting the language{\textquoteright}s integration into broader computational applications. Prior to this work, the Igbo language lacked dedicated NER resources and a specialised language model essential for accurate information extraction and analysis, which has kept Igbo on the periphery of digital advancements in NLP. To address this gap, we developed IgboBERT, the first transformer-based language model pre-trained from scratch on the Igbo language, to serve as a baseline model. We created a parallel English-Igbo corpus and utilized spaCy, an existing NER tool for the high-resource English language, to tag the English sentences. These tags were then transferred to Igbo using a projection method, aided by our semiautomatically created mapping dictionary to facilitate the tag transfer process. Additionally, we designed a framework for the creation of the IgboNER dataset, which can be extended to other low-resource languages. We fine-tuned IgboBERT and several state-of-the-art models, including mBERT, XLM-R, and DistilBERT, for the downstream IgboNER task using transfer learning. Our evaluation across various data sizes indicated that while large transformer models significantly benefited the IgboNER task, fine-tuning a transformer model built from scratch with relatively little Igbo text data also produced commendable results. This work substantially contributes to IgboNLP and the broader African and low-resource NLP landscape.",

keywords = "Igbo, named entity recognition, mapping dictionary, dataset, BERT models",

author = "CI Chukwuneke",

year = "2025",

doi = "10.17635/lancaster/thesis/2789",

language = "English",

publisher = "Lancaster University",

school = "Lancaster University",

}

RIS

TY - BOOK

T1 - Named entity recognition for African languages

T2 - a focus on the Igbo language

AU - Chukwuneke, CI

PY - 2025

Y1 - 2025

N2 - Named Entity Recognition (NER) is a crucial task for many downstream NLP applications, including text summarization, document indexing, question answering, classification, and machine translation. Analysis of research reveals that 95% NLP efforts are concentrated on English and a few other languages like Japanese, German, and French, even though there are over 7,000 languages globally. Around 90% of African languages are considered under-resourced in NLP highlighting the gap in resources for African languages The work presented in this thesis significantly advances Named Entity Recognition (NER) for low-resource languages, particularly African languages like Igbo, which, despite having millions of speakers, has remained largely underrepresented in NLP research. Focusing on Igbo, this research addresses a critical gap where foundational tools and resources, such as IgboNER, have been unavailable, thus limiting the language’s integration into broader computational applications. Prior to this work, the Igbo language lacked dedicated NER resources and a specialised language model essential for accurate information extraction and analysis, which has kept Igbo on the periphery of digital advancements in NLP. To address this gap, we developed IgboBERT, the first transformer-based language model pre-trained from scratch on the Igbo language, to serve as a baseline model. We created a parallel English-Igbo corpus and utilized spaCy, an existing NER tool for the high-resource English language, to tag the English sentences. These tags were then transferred to Igbo using a projection method, aided by our semiautomatically created mapping dictionary to facilitate the tag transfer process. Additionally, we designed a framework for the creation of the IgboNER dataset, which can be extended to other low-resource languages. We fine-tuned IgboBERT and several state-of-the-art models, including mBERT, XLM-R, and DistilBERT, for the downstream IgboNER task using transfer learning. Our evaluation across various data sizes indicated that while large transformer models significantly benefited the IgboNER task, fine-tuning a transformer model built from scratch with relatively little Igbo text data also produced commendable results. This work substantially contributes to IgboNLP and the broader African and low-resource NLP landscape.

AB - Named Entity Recognition (NER) is a crucial task for many downstream NLP applications, including text summarization, document indexing, question answering, classification, and machine translation. Analysis of research reveals that 95% NLP efforts are concentrated on English and a few other languages like Japanese, German, and French, even though there are over 7,000 languages globally. Around 90% of African languages are considered under-resourced in NLP highlighting the gap in resources for African languages The work presented in this thesis significantly advances Named Entity Recognition (NER) for low-resource languages, particularly African languages like Igbo, which, despite having millions of speakers, has remained largely underrepresented in NLP research. Focusing on Igbo, this research addresses a critical gap where foundational tools and resources, such as IgboNER, have been unavailable, thus limiting the language’s integration into broader computational applications. Prior to this work, the Igbo language lacked dedicated NER resources and a specialised language model essential for accurate information extraction and analysis, which has kept Igbo on the periphery of digital advancements in NLP. To address this gap, we developed IgboBERT, the first transformer-based language model pre-trained from scratch on the Igbo language, to serve as a baseline model. We created a parallel English-Igbo corpus and utilized spaCy, an existing NER tool for the high-resource English language, to tag the English sentences. These tags were then transferred to Igbo using a projection method, aided by our semiautomatically created mapping dictionary to facilitate the tag transfer process. Additionally, we designed a framework for the creation of the IgboNER dataset, which can be extended to other low-resource languages. We fine-tuned IgboBERT and several state-of-the-art models, including mBERT, XLM-R, and DistilBERT, for the downstream IgboNER task using transfer learning. Our evaluation across various data sizes indicated that while large transformer models significantly benefited the IgboNER task, fine-tuning a transformer model built from scratch with relatively little Igbo text data also produced commendable results. This work substantially contributes to IgboNLP and the broader African and low-resource NLP landscape.

KW - Igbo

KW - named entity recognition

KW - mapping dictionary

KW - dataset

KW - BERT models

U2 - 10.17635/lancaster/thesis/2789

DO - 10.17635/lancaster/thesis/2789

M3 - Doctoral Thesis

PB - Lancaster University

ER -

Research

Associated organisational units

Electronic data

Text available via DOI:

Keywords

Named entity recognition for African languages: a focus on the Igbo language

Standard

Harvard

APA

Vancouver

Author

Bibtex

RIS

Quick Links

Connect With Us

Faculties & Depts

Contact Us