Final published version
Licence: CC BY: Creative Commons Attribution 4.0 International License
Research output: Contribution to Journal/Magazine › Journal article › peer-review
Research output: Contribution to Journal/Magazine › Journal article › peer-review
}
TY - JOUR
T1 - On the effects of machine translation on offensive language detection
AU - Dmonte, Alphaeus
AU - Satapara, Shrey
AU - Alsudais, Rehab
AU - Ranasinghe, Tharindu
AU - Zampieri, Marcos
PY - 2025/1/9
Y1 - 2025/1/9
N2 - Machine translation (MT) is widely used to translate content on social media platforms aiming to improve accessibility. A great part of the content circulated on social media is user-generated and often contains non-standard spelling, hashtags, and emojis that pose challenges to MT systems. This leads to many mistranslated instances that are presented to users of these platforms, hindering their understanding of content written in other languages. In this paper, we investigate the impact of MT on offensive language identification. We pose that MT and potential mistranslations have an important and mostly under-explored impact on social media tasks such as sentiment analysis and offensive language identification. We create MT-Offense, a novel dataset containing English originals and translations in Arabic, Hindi, Marathi, Sinhala, and Spanish produced by multiple open-access Neural Machine Translation systems. We evaluate the performance of various offensive language models on both original and MT content in different training and test set combinations. We report the F1 scores of the models. Our results show that (1) offensive language identification models perform better on original data than on MT data, and (2) the use of MT data in training helps models better identify offensive language in MT content compared to models trained exclusively on original data.
AB - Machine translation (MT) is widely used to translate content on social media platforms aiming to improve accessibility. A great part of the content circulated on social media is user-generated and often contains non-standard spelling, hashtags, and emojis that pose challenges to MT systems. This leads to many mistranslated instances that are presented to users of these platforms, hindering their understanding of content written in other languages. In this paper, we investigate the impact of MT on offensive language identification. We pose that MT and potential mistranslations have an important and mostly under-explored impact on social media tasks such as sentiment analysis and offensive language identification. We create MT-Offense, a novel dataset containing English originals and translations in Arabic, Hindi, Marathi, Sinhala, and Spanish produced by multiple open-access Neural Machine Translation systems. We evaluate the performance of various offensive language models on both original and MT content in different training and test set combinations. We report the F1 scores of the models. Our results show that (1) offensive language identification models perform better on original data than on MT data, and (2) the use of MT data in training helps models better identify offensive language in MT content compared to models trained exclusively on original data.
KW - Machine translation
KW - Offensive language identification
KW - Multilinguality
U2 - 10.1007/s13278-024-01398-4
DO - 10.1007/s13278-024-01398-4
M3 - Journal article
VL - 14
JO - Social Network Analysis and Mining
JF - Social Network Analysis and Mining
SN - 1869-5469
IS - 1
M1 - 242
ER -