Standard
Sinhala Encoder-only Language Models and Evaluation. /
Ranasinghe, Tharindu; Hettiarachchi, Hansi; Pathirana, Nadeesha Chathurangi Naradde Vidana et al.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ed. / Wanxiang Che; Joyce Nabende; Ekaterina Shutova; Mohammad Taher Pilehvar. Vienna, Austria: Association for Computational Linguistics, 2025. p. 8623-8636.
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
Harvard
Ranasinghe, T, Hettiarachchi, H, Pathirana, NCNV
, Premasiri, D, Uyangodage, L
, Nanomi Arachchige, I, Plum, A
, Rayson, P & Mitkov, R 2025,
Sinhala Encoder-only Language Models and Evaluation. in W Che, J Nabende, E Shutova & MT Pilehvar (eds),
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, pp. 8623-8636. <
https://aclanthology.org/2025.acl-long.422/>
APA
Ranasinghe, T., Hettiarachchi, H., Pathirana, N. C. N. V.
, Premasiri, D., Uyangodage, L.
, Nanomi Arachchige, I., Plum, A.
, Rayson, P., & Mitkov, R. (2025).
Sinhala Encoder-only Language Models and Evaluation. In W. Che, J. Nabende, E. Shutova, & M. T. Pilehvar (Eds.),
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 8623-8636). Association for Computational Linguistics.
https://aclanthology.org/2025.acl-long.422/
Vancouver
Ranasinghe T, Hettiarachchi H, Pathirana NCNV
, Premasiri D, Uyangodage L
, Nanomi Arachchige I et al.
Sinhala Encoder-only Language Models and Evaluation. In Che W, Nabende J, Shutova E, Pilehvar MT, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics. 2025. p. 8623-8636
Author
Bibtex
@inproceedings{f7dbe4232485424e96ad3f2f30f5048c,
title = "Sinhala Encoder-only Language Models and Evaluation",
abstract = "Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.",
author = "Tharindu Ranasinghe and Hansi Hettiarachchi and Pathirana, {Nadeesha Chathurangi Naradde Vidana} and Damith Premasiri and Lasitha Uyangodage and {Nanomi Arachchige}, Isuri and Alistair Plum and Paul Rayson and Ruslan Mitkov",
year = "2025",
month = jul,
day = "1",
language = "English",
isbn = "9798891762510",
pages = "8623--8636",
editor = "Wanxiang Che and Joyce Nabende and Ekaterina Shutova and Pilehvar, {Mohammad Taher}",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
publisher = "Association for Computational Linguistics",
}
RIS
TY - GEN
T1 - Sinhala Encoder-only Language Models and Evaluation
AU - Ranasinghe, Tharindu
AU - Hettiarachchi, Hansi
AU - Pathirana, Nadeesha Chathurangi Naradde Vidana
AU - Premasiri, Damith
AU - Uyangodage, Lasitha
AU - Nanomi Arachchige, Isuri
AU - Plum, Alistair
AU - Rayson, Paul
AU - Mitkov, Ruslan
PY - 2025/7/1
Y1 - 2025/7/1
N2 - Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.
AB - Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.
M3 - Conference contribution/Paper
SN - 9798891762510
SP - 8623
EP - 8636
BT - Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics
CY - Vienna, Austria
ER -