CLEU- A Cross-Language-Urdu Corpus and Benchmark For Text Reuse Experiments

Associated organisational units

Text available via DOI:

https://doi.org/10.1002/asi.24074
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

CLEU- A Cross-Language-Urdu Corpus and Benchmark For Text Reuse Experiments. / Muneer, Iqra; Muhammad, Sharjeel; Iqbal, Muntaha et al.
In: Journal of the Association for Information Science and Technology, Vol. 70, No. 7, 01.07.2019, p. 729-741.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Muneer, I, Muhammad, S, Iqbal, M, Nawab, RMA & Rayson, PE 2019, 'CLEU- A Cross-Language-Urdu Corpus and Benchmark For Text Reuse Experiments', Journal of the Association for Information Science and Technology, vol. 70, no. 7, pp. 729-741. https://doi.org/10.1002/asi.24074

APA

Muneer, I., Muhammad, S., Iqbal, M., Nawab, R. M. A., & Rayson, P. E. (2019). CLEU- A Cross-Language-Urdu Corpus and Benchmark For Text Reuse Experiments. Journal of the Association for Information Science and Technology, 70(7), 729-741. https://doi.org/10.1002/asi.24074

Vancouver

Muneer I, Muhammad S, Iqbal M, Nawab RMA, Rayson PE. CLEU- A Cross-Language-Urdu Corpus and Benchmark For Text Reuse Experiments. Journal of the Association for Information Science and Technology. 2019 Jul 1;70(7):729-741. Epub 2018 Nov 19. doi: 10.1002/asi.24074

Author

Muneer, Iqra ; Muhammad, Sharjeel ; Iqbal, Muntaha et al. / CLEU- A Cross-Language-Urdu Corpus and Benchmark For Text Reuse Experiments. In: Journal of the Association for Information Science and Technology. 2019 ; Vol. 70, No. 7. pp. 729-741.

Bibtex

@article{781bee121d9b47639732717362adf325,

title = "CLEU- A Cross-Language-Urdu Corpus and Benchmark For Text Reuse Experiments",

abstract = "Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi‐lingual content on the Web has increased cross‐language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large‐scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross‐language sentence/passage level text reuse corpus for the English‐Urdu language pair. The Cross‐Language English‐Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono‐lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross‐language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross‐language text reuse detection systems for the English‐Urdu language pair.",

author = "Iqra Muneer and Sharjeel Muhammad and Muntaha Iqbal and Nawab, {Rao Muhammad Adeel} and Rayson, {Paul Edward}",

year = "2019",

month = jul,

day = "1",

doi = "10.1002/asi.24074",

language = "English",

volume = "70",

pages = "729--741",

journal = "Journal of the Association for Information Science and Technology",

issn = "0002-8231",

publisher = "John Wiley and Sons Inc.",

number = "7",

}

RIS

TY - JOUR

T1 - CLEU- A Cross-Language-Urdu Corpus and Benchmark For Text Reuse Experiments

AU - Muneer, Iqra

AU - Muhammad, Sharjeel

AU - Iqbal, Muntaha

AU - Nawab, Rao Muhammad Adeel

AU - Rayson, Paul Edward

PY - 2019/7/1

Y1 - 2019/7/1

N2 - Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi‐lingual content on the Web has increased cross‐language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large‐scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross‐language sentence/passage level text reuse corpus for the English‐Urdu language pair. The Cross‐Language English‐Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono‐lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross‐language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross‐language text reuse detection systems for the English‐Urdu language pair.

AB - Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi‐lingual content on the Web has increased cross‐language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large‐scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross‐language sentence/passage level text reuse corpus for the English‐Urdu language pair. The Cross‐Language English‐Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono‐lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross‐language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross‐language text reuse detection systems for the English‐Urdu language pair.

U2 - 10.1002/asi.24074

DO - 10.1002/asi.24074

M3 - Journal article

VL - 70

SP - 729

EP - 741

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

SN - 0002-8231

IS - 7

ER -

Research

Associated organisational units

Links

Text available via DOI: