Offensive Language Identification in Transliterated and Code-Mixed Bangla

Computing and Communications

Electronic data

2023.banglalp-1.1
Final published version, 104 KB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Md Nishat Raihan
Umma Tanmoy
Anika Binte Islam
Kai North
Tharindu Ranasinghe
Antonios Anastasopoulos
Marcos Zampieri

Publication date	7/12/2023
Host publication	Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)
Editors	Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Farig Sadeque, Ruhul Amin
Publisher	Association for Computational Linguistics
Pages	1-6
Number of pages	6
ISBN (print)	9798891760585
<mark>Original language</mark>	English
Event	The First Workshop on Bangla Language Processing (BLP-2023) - , Singapore Duration: 7/12/2023 → … https://blp-workshop.github.io/

Workshop

Workshop	The First Workshop on Bangla Language Processing (BLP-2023)
Country/Territory	Singapore
Period	7/12/23 → …
Internet address	https://blp-workshop.github.io/

Workshop

Workshop	The First Workshop on Bangla Language Processing (BLP-2023)
Country/Territory	Singapore
Period	7/12/23 → …
Internet address	https://blp-workshop.github.io/

Abstract

Identifying offensive content in social media is vital to create safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments. We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset. Our results show that English pre-trained transformer-based models, such as fBERT and HateBERT achieve the best performance on this dataset.

Research

Electronic data

Links