Sinhala Encoder-only Language Models and Evaluation

Associated organisational units

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

Sinhala Encoder-only Language Models and Evaluation. / Ranasinghe, Tharindu ; Hettiarachchi, Hansi; Pathirana, Nadeesha Chathurangi Naradde Vidana et al.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ed. / Wanxiang Che; Joyce Nabende; Ekaterina Shutova; Mohammad Taher Pilehvar. Vienna, Austria: Association for Computational Linguistics, 2025. p. 8623-8636.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Ranasinghe, T , Hettiarachchi, H, Pathirana, NCNV, Premasiri, D, Uyangodage, L, Nanomi Arachchige, I, Plum, A, Rayson, P & Mitkov, R 2025, Sinhala Encoder-only Language Models and Evaluation. in W Che, J Nabende, E Shutova & MT Pilehvar (eds), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, pp. 8623-8636. <https://aclanthology.org/2025.acl-long.422/>

APA

Ranasinghe, T., Hettiarachchi, H., Pathirana, N. C. N. V., Premasiri, D., Uyangodage, L., Nanomi Arachchige, I., Plum, A., Rayson, P., & Mitkov, R. (2025). Sinhala Encoder-only Language Models and Evaluation. In W. Che, J. Nabende, E. Shutova, & M. T. Pilehvar (Eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 8623-8636). Association for Computational Linguistics. https://aclanthology.org/2025.acl-long.422/

Vancouver

Ranasinghe T , Hettiarachchi H, Pathirana NCNV, Premasiri D, Uyangodage L, Nanomi Arachchige I et al. Sinhala Encoder-only Language Models and Evaluation. In Che W, Nabende J, Shutova E, Pilehvar MT, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics. 2025. p. 8623-8636

Author

Ranasinghe, Tharindu ; Hettiarachchi, Hansi ; Pathirana, Nadeesha Chathurangi Naradde Vidana et al. / Sinhala Encoder-only Language Models and Evaluation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). editor / Wanxiang Che ; Joyce Nabende ; Ekaterina Shutova ; Mohammad Taher Pilehvar. Vienna, Austria : Association for Computational Linguistics, 2025. pp. 8623-8636

Bibtex

@inproceedings{f7dbe4232485424e96ad3f2f30f5048c,

title = "Sinhala Encoder-only Language Models and Evaluation",

abstract = "Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.",

author = "Tharindu Ranasinghe and Hansi Hettiarachchi and Pathirana, {Nadeesha Chathurangi Naradde Vidana} and Damith Premasiri and Lasitha Uyangodage and {Nanomi Arachchige}, Isuri and Alistair Plum and Paul Rayson and Ruslan Mitkov",

year = "2025",

month = jul,

day = "1",

language = "English",

isbn = "9798891762510",

pages = "8623--8636",

editor = "Wanxiang Che and Joyce Nabende and Ekaterina Shutova and Pilehvar, {Mohammad Taher}",

booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",

publisher = "Association for Computational Linguistics",

}

RIS

TY - GEN

T1 - Sinhala Encoder-only Language Models and Evaluation

AU - Ranasinghe, Tharindu

AU - Hettiarachchi, Hansi

AU - Pathirana, Nadeesha Chathurangi Naradde Vidana

AU - Premasiri, Damith

AU - Uyangodage, Lasitha

AU - Nanomi Arachchige, Isuri

AU - Plum, Alistair

AU - Rayson, Paul

AU - Mitkov, Ruslan

PY - 2025/7/1

Y1 - 2025/7/1

N2 - Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.

AB - Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.

M3 - Conference contribution/Paper

SN - 9798891762510

SP - 8623

EP - 8636

BT - Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A2 - Che, Wanxiang

A2 - Nabende, Joyce

A2 - Shutova, Ekaterina

A2 - Pilehvar, Mohammad Taher

PB - Association for Computational Linguistics

CY - Vienna, Austria

ER -

Research

Associated organisational units

Links