COUNTER - COrpus of Urdu News TExt Reuse - Research Portal

Home > Research > Publications & Outputs > COUNTER - COrpus of Urdu News TExt Reuse

Associated organisational units

Electronic data

counter-lre-v4
Rights statement: The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-016-9367-2
Accepted author manuscript, 730 KB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1007/s10579-016-9367-2
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Keywords

mono-lingual text reuse, Urdu news corpus, Urdu text reuse detection, corpus generation

View graph of relations

COUNTER - COrpus of Urdu News TExt Reuse

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Sharjeel Muhammad
Rao Muhammad Adeel Nawab
Paul Edward Rayson

More...

<mark>Journal publication date</mark>	09/2017
<mark>Journal</mark>	Language Resources and Evaluation
Issue number	3
Volume	51
Number of pages	27
Pages (from-to)	777-803
Publication Status	Published
Early online date	10/09/16
<mark>Original language</mark>	English

Abstract

Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COUNTER (COrpus of Urdu News TExt Reuse) corpus contains 1,200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. We also apply a number of similarity estimation methods on our corpus to show how it can be used for the development, evaluation and comparison of text reuse detection systems for the Urdu language. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.

Bibliographic note

The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-016-9367-2

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords

COUNTER - COrpus of Urdu News TExt Reuse

Abstract

Bibliographic note

Quick Links

Connect With Us

Faculties & Depts

Contact Us

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords

COUNTER - COrpus of Urdu News TExt Reuse

Abstract

Bibliographic note

Related datasets

COrpus of Urdu News TExt Reuse (COUNTER)

Quick Links

Connect With Us

Faculties & Depts

Contact Us