SOLD: Sinhala offensive language dataset

Associated organisational units

Text available via DOI:

https://doi.org/10.1007/s10579-024-09723-1
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

SOLD: Sinhala offensive language dataset. / Ranasinghe, Tharindu ; Nanomi Arachchige, Isuri ; Dola Mullage, Damith et al.
In: Language Resources and Evaluation, Vol. 59, 06.03.2025, p. 297-337.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Ranasinghe, T , Nanomi Arachchige, I , Dola Mullage, D, Silva, K, Hettiarachchi, H, Uyangodage, L & Zampieri, M 2025, 'SOLD: Sinhala offensive language dataset', Language Resources and Evaluation, vol. 59, pp. 297-337. https://doi.org/10.1007/s10579-024-09723-1

APA

Ranasinghe, T., Nanomi Arachchige, I., Dola Mullage, D., Silva, K., Hettiarachchi, H., Uyangodage, L., & Zampieri, M. (2025). SOLD: Sinhala offensive language dataset. Language Resources and Evaluation, 59, 297-337. https://doi.org/10.1007/s10579-024-09723-1

Vancouver

Ranasinghe T , Nanomi Arachchige I , Dola Mullage D, Silva K, Hettiarachchi H, Uyangodage L et al. SOLD: Sinhala offensive language dataset. Language Resources and Evaluation. 2025 Mar 6;59:297-337. Epub 2024 Mar 6. doi: 10.1007/s10579-024-09723-1

Author

Ranasinghe, Tharindu ; Nanomi Arachchige, Isuri ; Dola Mullage, Damith et al. / SOLD: Sinhala offensive language dataset. In: Language Resources and Evaluation. 2025 ; Vol. 59. pp. 297-337.

Bibtex

@article{35b98fa9e13f4c05a93aa00efbf3051d,

title = "SOLD: Sinhala offensive language dataset",

abstract = "The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.",

author = "Tharindu Ranasinghe and {Nanomi Arachchige}, Isuri and {Dola Mullage}, Damith and Kanishka Silva and Hansi Hettiarachchi and Lasitha Uyangodage and Marcos Zampieri",

year = "2025",

month = mar,

day = "6",

doi = "10.1007/s10579-024-09723-1",

language = "English",

volume = "59",

pages = "297--337",

journal = "Language Resources and Evaluation",

issn = "1574-020X",

publisher = "Springer Netherlands",

}

RIS

TY - JOUR

T1 - SOLD: Sinhala offensive language dataset

AU - Ranasinghe, Tharindu

AU - Nanomi Arachchige, Isuri

AU - Dola Mullage, Damith

AU - Silva, Kanishka

AU - Hettiarachchi, Hansi

AU - Uyangodage, Lasitha

AU - Zampieri, Marcos

PY - 2025/3/6

Y1 - 2025/3/6

N2 - The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.

AB - The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.

U2 - 10.1007/s10579-024-09723-1

DO - 10.1007/s10579-024-09723-1

M3 - Journal article

VL - 59

SP - 297

EP - 337

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

ER -

Research

Associated organisational units

Links

Text available via DOI: