Home > Research > Publications & Outputs > NSina

Electronic data

  • 2024.lrec-main.1076

    Final published version, 229 KB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Links

View graph of relations

NSina: A News Corpus for Sinhala

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published
Publication date20/05/2024
Host publicationProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
PublisherELRA and ICCL
Pages12307-12312
Number of pages6
ISBN (print)9782493814104
<mark>Original language</mark>English
Event The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italy
Duration: 20/05/202425/05/2024
https://lrec-coling-2024.org/

Conference

Conference The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Abbreviated titleLREC-COLING 2024
Country/TerritoryItaly
CityTorino
Period20/05/2425/05/24
Internet address

Conference

Conference The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Abbreviated titleLREC-COLING 2024
Country/TerritoryItaly
CityTorino
Period20/05/2425/05/24
Internet address

Abstract

The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.