Cross-Lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

Associated organisational units

Electronic data

3592761
Accepted author manuscript, 691 KB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1145/3592761
Final published version

Keywords

General Computer Science

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Cross-Lingual Text Reuse Detection at Document Level for English-Urdu Language Pair. / Sharjeel, Muhammad; Muneer, Iqra; Nosheen, Sumaira et al.
In: ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Vol. 22, No. 6, 30.06.2023, p. 1-22.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Sharjeel, M, Muneer, I, Nosheen, S, Nawab, RMA & Rayson, P 2023, 'Cross-Lingual Text Reuse Detection at Document Level for English-Urdu Language Pair', ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 22, no. 6, pp. 1-22. https://doi.org/10.1145/3592761

APA

Sharjeel, M., Muneer, I., Nosheen, S., Nawab, R. M. A., & Rayson, P. (2023). Cross-Lingual Text Reuse Detection at Document Level for English-Urdu Language Pair. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 22(6), 1-22. https://doi.org/10.1145/3592761

Vancouver

Sharjeel M, Muneer I, Nosheen S, Nawab RMA, Rayson P. Cross-Lingual Text Reuse Detection at Document Level for English-Urdu Language Pair. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). 2023 Jun 30;22(6):1-22. Epub 2023 May 1. doi: 10.1145/3592761

Author

Sharjeel, Muhammad ; Muneer, Iqra ; Nosheen, Sumaira et al. / Cross-Lingual Text Reuse Detection at Document Level for English-Urdu Language Pair. In: ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). 2023 ; Vol. 22, No. 6. pp. 1-22.

Bibtex

@article{3aae1301da4447b39160cfd9a2271336,

title = "Cross-Lingual Text Reuse Detection at Document Level for English-Urdu Language Pair",

abstract = "In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to detect. In previous studies, different corpora and methods have been developed for CLTRD at the sentence/passage level for the English-Urdu language pair. However, there is a lack of large standard corpora and methods for CLTRD for the English-Urdu language pair at the document level. To overcome this limitation, the significant contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus, called the TREU (Text Reuse for English-Urdu) corpus. It contains English to Urdu real cases of text reuse at the document level. The corpus is manually labelled into three categories (Wholly Derived = 672, Partially Derived = 888, and Non Derived = 697) with the source text in English and the derived text in the Urdu language. Another contribution of this study is the evaluation of the TREU corpus using a diversified range of methods to show its usefulness and how it can be utilized in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level. The best evaluation results, for both binary ( F 1 = 0.78) and ternary ( F 1 = 0.66) classification tasks, are obtained using a combination of all Translation plus Mono-lingual Analysis (T+MA) based methods. The TREU corpus is publicly available to promote CLTRD research in an under-resourced language, i.e. Urdu.",

keywords = "General Computer Science",

author = "Muhammad Sharjeel and Iqra Muneer and Sumaira Nosheen and Nawab, {Rao Muhammad Adeel} and Paul Rayson",

year = "2023",

month = jun,

day = "30",

doi = "10.1145/3592761",

language = "English",

volume = "22",

pages = "1--22",

journal = "ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)",

issn = "2375-4699",

publisher = "Association for Computing Machinery (ACM)",

number = "6",

}

RIS

TY - JOUR

T1 - Cross-Lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

AU - Sharjeel, Muhammad

AU - Muneer, Iqra

AU - Nosheen, Sumaira

AU - Nawab, Rao Muhammad Adeel

AU - Rayson, Paul

PY - 2023/6/30

Y1 - 2023/6/30

N2 - In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to detect. In previous studies, different corpora and methods have been developed for CLTRD at the sentence/passage level for the English-Urdu language pair. However, there is a lack of large standard corpora and methods for CLTRD for the English-Urdu language pair at the document level. To overcome this limitation, the significant contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus, called the TREU (Text Reuse for English-Urdu) corpus. It contains English to Urdu real cases of text reuse at the document level. The corpus is manually labelled into three categories (Wholly Derived = 672, Partially Derived = 888, and Non Derived = 697) with the source text in English and the derived text in the Urdu language. Another contribution of this study is the evaluation of the TREU corpus using a diversified range of methods to show its usefulness and how it can be utilized in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level. The best evaluation results, for both binary ( F 1 = 0.78) and ternary ( F 1 = 0.66) classification tasks, are obtained using a combination of all Translation plus Mono-lingual Analysis (T+MA) based methods. The TREU corpus is publicly available to promote CLTRD research in an under-resourced language, i.e. Urdu.

AB - In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to detect. In previous studies, different corpora and methods have been developed for CLTRD at the sentence/passage level for the English-Urdu language pair. However, there is a lack of large standard corpora and methods for CLTRD for the English-Urdu language pair at the document level. To overcome this limitation, the significant contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus, called the TREU (Text Reuse for English-Urdu) corpus. It contains English to Urdu real cases of text reuse at the document level. The corpus is manually labelled into three categories (Wholly Derived = 672, Partially Derived = 888, and Non Derived = 697) with the source text in English and the derived text in the Urdu language. Another contribution of this study is the evaluation of the TREU corpus using a diversified range of methods to show its usefulness and how it can be utilized in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse at the document level. The best evaluation results, for both binary ( F 1 = 0.78) and ternary ( F 1 = 0.66) classification tasks, are obtained using a combination of all Translation plus Mono-lingual Analysis (T+MA) based methods. The TREU corpus is publicly available to promote CLTRD research in an under-resourced language, i.e. Urdu.

KW - General Computer Science

U2 - 10.1145/3592761

DO - 10.1145/3592761

M3 - Journal article

VL - 22

SP - 1

EP - 22

JO - ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)

JF - ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)

SN - 2375-4699

IS - 6

ER -

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords