Measuring Short Text Reuse For The Urdu Language

Associated organisational units

Text available via DOI:

https://doi.org/10.1109/ACCESS.2017.2776842
Final published version
Available under license: CC BY

Keywords

Urdu Text Reuse Detection, Urdu Corpus, Natural Language Processing

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Measuring Short Text Reuse For The Urdu Language. / Sameen, Sara; Muhammad, Sharjeel; Nawab, Rao Muhammad Adeel et al.
In: IEEE Access, Vol. 6, 09.03.2018, p. 7412-7421.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Sameen, S, Muhammad, S, Nawab, RMA, Rayson, PE & Muneer, I 2018, 'Measuring Short Text Reuse For The Urdu Language', IEEE Access, vol. 6, pp. 7412-7421. https://doi.org/10.1109/ACCESS.2017.2776842

APA

Sameen, S., Muhammad, S., Nawab, R. M. A., Rayson, P. E., & Muneer, I. (2018). Measuring Short Text Reuse For The Urdu Language. IEEE Access, 6, 7412-7421. https://doi.org/10.1109/ACCESS.2017.2776842

Vancouver

Sameen S, Muhammad S, Nawab RMA, Rayson PE, Muneer I. Measuring Short Text Reuse For The Urdu Language. IEEE Access. 2018 Mar 9;6:7412-7421. Epub 2017 Nov 22. doi: 10.1109/ACCESS.2017.2776842

Author

Sameen, Sara ; Muhammad, Sharjeel ; Nawab, Rao Muhammad Adeel et al. / Measuring Short Text Reuse For The Urdu Language. In: IEEE Access. 2018 ; Vol. 6. pp. 7412-7421.

Bibtex

@article{8bb0dfaaf88649b1927de5f5c3f4ab57,

title = "Measuring Short Text Reuse For The Urdu Language",

abstract = "Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this work, we propose one such resource for a significantly under-resourced language - Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu Short Text Reuse Corpus contains 2,684 short Urdu text pairs, manually labelled as verbatim (496), paraphrased (1,329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that Character n-gram Overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.",

keywords = "Urdu Text Reuse Detection, Urdu Corpus, Natural Language Processing",

author = "Sara Sameen and Sharjeel Muhammad and Nawab, {Rao Muhammad Adeel} and Rayson, {Paul Edward} and Iqra Muneer",

year = "2018",

month = mar,

day = "9",

doi = "10.1109/ACCESS.2017.2776842",

language = "English",

volume = "6",

pages = "7412--7421",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

RIS

TY - JOUR

T1 - Measuring Short Text Reuse For The Urdu Language

AU - Sameen, Sara

AU - Muhammad, Sharjeel

AU - Nawab, Rao Muhammad Adeel

AU - Rayson, Paul Edward

AU - Muneer, Iqra

PY - 2018/3/9

Y1 - 2018/3/9

N2 - Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this work, we propose one such resource for a significantly under-resourced language - Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu Short Text Reuse Corpus contains 2,684 short Urdu text pairs, manually labelled as verbatim (496), paraphrased (1,329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that Character n-gram Overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.

AB - Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this work, we propose one such resource for a significantly under-resourced language - Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu Short Text Reuse Corpus contains 2,684 short Urdu text pairs, manually labelled as verbatim (496), paraphrased (1,329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that Character n-gram Overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.

KW - Urdu Text Reuse Detection

KW - Urdu Corpus

KW - Natural Language Processing

U2 - 10.1109/ACCESS.2017.2776842

DO - 10.1109/ACCESS.2017.2776842

M3 - Journal article

VL - 6

SP - 7412

EP - 7421

JO - IEEE Access

JF - IEEE Access

SN - 2169-3536

ER -

Research

Associated organisational units

Links

Text available via DOI:

Keywords