Home > Research > Publications & Outputs > COUNTER - COrpus of Urdu News TExt Reuse

Electronic data

  • counter-lre-v4

    Rights statement: The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-016-9367-2

    Accepted author manuscript, 730 KB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Text available via DOI:

View graph of relations

COUNTER - COrpus of Urdu News TExt Reuse

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

COUNTER - COrpus of Urdu News TExt Reuse. / Muhammad, Sharjeel; Nawab, Rao Muhammad Adeel ; Rayson, Paul Edward.
In: Language Resources and Evaluation, Vol. 51, No. 3, 09.2017, p. 777-803.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

Muhammad, S, Nawab, RMA & Rayson, PE 2017, 'COUNTER - COrpus of Urdu News TExt Reuse', Language Resources and Evaluation, vol. 51, no. 3, pp. 777-803. https://doi.org/10.1007/s10579-016-9367-2

APA

Muhammad, S., Nawab, R. M. A., & Rayson, P. E. (2017). COUNTER - COrpus of Urdu News TExt Reuse. Language Resources and Evaluation, 51(3), 777-803. https://doi.org/10.1007/s10579-016-9367-2

Vancouver

Muhammad S, Nawab RMA, Rayson PE. COUNTER - COrpus of Urdu News TExt Reuse. Language Resources and Evaluation. 2017 Sept;51(3):777-803. Epub 2016 Sept 10. doi: 10.1007/s10579-016-9367-2

Author

Muhammad, Sharjeel ; Nawab, Rao Muhammad Adeel ; Rayson, Paul Edward. / COUNTER - COrpus of Urdu News TExt Reuse. In: Language Resources and Evaluation. 2017 ; Vol. 51, No. 3. pp. 777-803.

Bibtex

@article{dce3d37efbaa4ec094ef476ddd4f2ab8,
title = "COUNTER - COrpus of Urdu News TExt Reuse",
abstract = "Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COUNTER (COrpus of Urdu News TExt Reuse) corpus contains 1,200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. We also apply a number of similarity estimation methods on our corpus to show how it can be used for the development, evaluation and comparison of text reuse detection systems for the Urdu language. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.",
keywords = "mono-lingual text reuse, Urdu news corpus, Urdu text reuse detection, corpus generation",
author = "Sharjeel Muhammad and Nawab, {Rao Muhammad Adeel} and Rayson, {Paul Edward}",
note = "The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-016-9367-2",
year = "2017",
month = sep,
doi = "10.1007/s10579-016-9367-2",
language = "English",
volume = "51",
pages = "777--803",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer Netherlands",
number = "3",

}

RIS

TY - JOUR

T1 - COUNTER - COrpus of Urdu News TExt Reuse

AU - Muhammad, Sharjeel

AU - Nawab, Rao Muhammad Adeel

AU - Rayson, Paul Edward

N1 - The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-016-9367-2

PY - 2017/9

Y1 - 2017/9

N2 - Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COUNTER (COrpus of Urdu News TExt Reuse) corpus contains 1,200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. We also apply a number of similarity estimation methods on our corpus to show how it can be used for the development, evaluation and comparison of text reuse detection systems for the Urdu language. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.

AB - Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COUNTER (COrpus of Urdu News TExt Reuse) corpus contains 1,200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. We also apply a number of similarity estimation methods on our corpus to show how it can be used for the development, evaluation and comparison of text reuse detection systems for the Urdu language. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.

KW - mono-lingual text reuse

KW - Urdu news corpus

KW - Urdu text reuse detection

KW - corpus generation

U2 - 10.1007/s10579-016-9367-2

DO - 10.1007/s10579-016-9367-2

M3 - Journal article

VL - 51

SP - 777

EP - 803

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 3

ER -