Final published version
Licence: CC BY: Creative Commons Attribution 4.0 International License
Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review
Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review
}
TY - CONF
T1 - Open-Source Thesaurus Development for Under-Resourced Languages
T2 - Language, Data and Knowledge
AU - Khallaf, Nouran
AU - Arfon, Elin
AU - El-Haj, Mahmoud
AU - Morris, Jonathan
AU - Knight, Dawn
AU - Rayson, Paul
AU - Hammouda, Tymaa
AU - Jarrar, Mustafa
N1 - Conference code: 4
PY - 2023/9/14
Y1 - 2023/9/14
N2 - This paper introduces an open-access, user- friendly online thesaurus for the Welsh language, aimed at enriching digital resources for Welsh speakers and learners. Utilising advances in Natural Language Processing (NLP), our approach combines pre-existing word em- beddings, a Welsh semantic tagger, and human evaluation to establish related terms. In this case, an initial list of 250 words was expanded by adding 6,953 synonyms provided by linguists, creating a more extensive foundation for building the gold-standards. With this expanded list, when a user queries a particular word, the thesaurus presents all of its synonyms, allowing them to choose from a wider range of options. This is especially helpful when a user is unsure of the exact word they want to use or wants to explore different ways to ex- press a concept. The resulting thesaurus offers a comprehensive, reliable resource for Welsh language users, fostering enhanced communication and expression. Our work promotes Welsh NLP and showcases NLP’s potential to support under-resourced languages. The thesaurus will be accessible via a bilingual website, and the ac- companying Python code will be available in a bilingual, public GitHub repository, and it will be available as a web service. Our approach presents a more efficient, cost-effective method for thesaurus creation, with potential applicability to other under-resourced languages.
AB - This paper introduces an open-access, user- friendly online thesaurus for the Welsh language, aimed at enriching digital resources for Welsh speakers and learners. Utilising advances in Natural Language Processing (NLP), our approach combines pre-existing word em- beddings, a Welsh semantic tagger, and human evaluation to establish related terms. In this case, an initial list of 250 words was expanded by adding 6,953 synonyms provided by linguists, creating a more extensive foundation for building the gold-standards. With this expanded list, when a user queries a particular word, the thesaurus presents all of its synonyms, allowing them to choose from a wider range of options. This is especially helpful when a user is unsure of the exact word they want to use or wants to explore different ways to ex- press a concept. The resulting thesaurus offers a comprehensive, reliable resource for Welsh language users, fostering enhanced communication and expression. Our work promotes Welsh NLP and showcases NLP’s potential to support under-resourced languages. The thesaurus will be accessible via a bilingual website, and the ac- companying Python code will be available in a bilingual, public GitHub repository, and it will be available as a web service. Our approach presents a more efficient, cost-effective method for thesaurus creation, with potential applicability to other under-resourced languages.
M3 - Conference paper
Y2 - 12 September 2023 through 15 September 2023
ER -