Final published version
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
}
TY - GEN
T1 - Automatic Restoration of Diacritics for Igbo Language
AU - Ezeani, Ignatius
AU - Hepple, Mark
AU - Onyenwe, Ikechukwu
A2 - Sojka, Petr
A2 - Horák, Aleš
A2 - Kopeček, Ivan
A2 - Pala, Karel
PY - 2016/1/1
Y1 - 2016/1/1
N2 - Igbo is a low-resource African language with orthographic and tonal diacritics, which capture distinctions between words that are important for both meaning and pronunciation, and hence of potential value for a range of language processing tasks. Such diacritics, however, are often largely absent from the electronic texts we might want to process, or assemble into corpora, and so the need arises for effective methods for automatic diacritic restoration for Igbo. In this paper, we experiment using an Igbo bible corpus, which is extensively marked for vowel distinctions, and partially for tonal distinctions, and attempt the task of reinstating these diacritics when they have been deleted. We investigate a number of word-level diacritic restoration methods, based on n-grams, under a closed-world assumption, achieving an accuracy of 98.83 % with our most effective method.
AB - Igbo is a low-resource African language with orthographic and tonal diacritics, which capture distinctions between words that are important for both meaning and pronunciation, and hence of potential value for a range of language processing tasks. Such diacritics, however, are often largely absent from the electronic texts we might want to process, or assemble into corpora, and so the need arises for effective methods for automatic diacritic restoration for Igbo. In this paper, we experiment using an Igbo bible corpus, which is extensively marked for vowel distinctions, and partially for tonal distinctions, and attempt the task of reinstating these diacritics when they have been deleted. We investigate a number of word-level diacritic restoration methods, based on n-grams, under a closed-world assumption, achieving an accuracy of 98.83 % with our most effective method.
U2 - 10.1007/978-3-319-45510-5
DO - 10.1007/978-3-319-45510-5
M3 - Conference contribution/Paper
SN - 9783319455099
SP - 198
EP - 205
BT - International Conference on Text, Speech, and Dialogue
PB - Springer
ER -