Home > Research > Publications & Outputs > NSina

Electronic data

  • 2024.lrec-main.1076

    Final published version, 229 KB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Links

View graph of relations

NSina: A News Corpus for Sinhala

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

NSina: A News Corpus for Sinhala. / Hettiarachchi, Hansi; Dola Mullage, Damith; Uyangodage, Lasitha et al.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ed. / Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue. ELRA and ICCL, 2024. p. 12307-12312.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Hettiarachchi, H, Dola Mullage, D, Uyangodage, L & Ranasinghe, T 2024, NSina: A News Corpus for Sinhala. in N Calzolari, M-Y Kan, V Hoste, A Lenci, S Sakti & N Xue (eds), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, pp. 12307-12312, The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italy, 20/05/24. <https://aclanthology.org/2024.lrec-main.1076/>

APA

Hettiarachchi, H., Dola Mullage, D., Uyangodage, L., & Ranasinghe, T. (2024). NSina: A News Corpus for Sinhala. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 12307-12312). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.1076/

Vancouver

Hettiarachchi H, Dola Mullage D, Uyangodage L, Ranasinghe T. NSina: A News Corpus for Sinhala. In Calzolari N, Kan MY, Hoste V, Lenci A, Sakti S, Xue N, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL. 2024. p. 12307-12312

Author

Hettiarachchi, Hansi ; Dola Mullage, Damith ; Uyangodage, Lasitha et al. / NSina : A News Corpus for Sinhala. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). editor / Nicoletta Calzolari ; Min-Yen Kan ; Veronique Hoste ; Alessandro Lenci ; Sakriani Sakti ; Nianwen Xue. ELRA and ICCL, 2024. pp. 12307-12312

Bibtex

@inproceedings{391bbe5d4c804c66a746c3b83eff7e85,
title = "NSina: A News Corpus for Sinhala",
abstract = "The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.",
author = "Hansi Hettiarachchi and {Dola Mullage}, Damith and Lasitha Uyangodage and Tharindu Ranasinghe",
year = "2024",
month = may,
day = "20",
language = "English",
isbn = "9782493814104",
pages = "12307--12312",
editor = "Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
publisher = "ELRA and ICCL",
note = " The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 ; Conference date: 20-05-2024 Through 25-05-2024",
url = "https://lrec-coling-2024.org/",

}

RIS

TY - GEN

T1 - NSina

T2 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

AU - Hettiarachchi, Hansi

AU - Dola Mullage, Damith

AU - Uyangodage, Lasitha

AU - Ranasinghe, Tharindu

PY - 2024/5/20

Y1 - 2024/5/20

N2 - The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.

AB - The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSina, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSina aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSina is the largest news corpus for Sinhala, available up to date.

M3 - Conference contribution/Paper

SN - 9782493814104

SP - 12307

EP - 12312

BT - Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

A2 - Calzolari, Nicoletta

A2 - Kan, Min-Yen

A2 - Hoste, Veronique

A2 - Lenci, Alessandro

A2 - Sakti, Sakriani

A2 - Xue, Nianwen

PB - ELRA and ICCL

Y2 - 20 May 2024 through 25 May 2024

ER -