Home > Research > Publications & Outputs > A word sense disambiguation corpus for Urdu

Electronic data

  • WSD

    Rights statement: The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-018-9438-7

    Accepted author manuscript, 603 KB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Links

Text available via DOI:

View graph of relations

A word sense disambiguation corpus for Urdu

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

A word sense disambiguation corpus for Urdu. / Saeed, Ali; Nawab, Rao Muhammad Adeel; Stevenson, Mark et al.
In: Language Resources and Evaluation, Vol. 53, No. 3, 01.09.2019, p. 397–418.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

Saeed, A, Nawab, RMA, Stevenson, M & Rayson, P 2019, 'A word sense disambiguation corpus for Urdu', Language Resources and Evaluation, vol. 53, no. 3, pp. 397–418. https://doi.org/10.1007/s10579-018-9438-7

APA

Saeed, A., Nawab, R. M. A., Stevenson, M., & Rayson, P. (2019). A word sense disambiguation corpus for Urdu. Language Resources and Evaluation, 53(3), 397–418. https://doi.org/10.1007/s10579-018-9438-7

Vancouver

Saeed A, Nawab RMA, Stevenson M, Rayson P. A word sense disambiguation corpus for Urdu. Language Resources and Evaluation. 2019 Sept 1;53(3):397–418. Epub 2018 Nov 24. doi: 10.1007/s10579-018-9438-7

Author

Saeed, Ali ; Nawab, Rao Muhammad Adeel ; Stevenson, Mark et al. / A word sense disambiguation corpus for Urdu. In: Language Resources and Evaluation. 2019 ; Vol. 53, No. 3. pp. 397–418.

Bibtex

@article{32f14d036bec40e4a1cb4d6b3e2f02f1,
title = "A word sense disambiguation corpus for Urdu",
abstract = "The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an important problem in natural language processing (NLP). Standard evaluation resources are needed to develop, evaluate and compare WSD methods. A range of initiatives have lead to the development of benchmark WSD corpora for a wide range of languages from various language families. However, there is a lack of benchmark WSD corpora for South Asian languages including Urdu, despite there being over 300 million Urdu speakers and a large amounts of Urdu digital text available online. To address that gap, this study describes a novel benchmark corpus for the Urdu Lexical Sample WSD task. This corpus contains 50 target words (30 nouns, 11 adjectives, and 9 verbs). A standard, manually crafted dictionary called Urdu Lughat is used as a sense inventory. Four baseline WSD approaches were applied to the corpus. The results show that the best performance was obtained using a simple Bag of Words approach. To encourage NLP research on the Urdu language the corpus is freely available to the research community.",
keywords = "Lexical sample task, Sense tagged Urdu corpus, Word sense disambiguation",
author = "Ali Saeed and Nawab, {Rao Muhammad Adeel} and Mark Stevenson and Paul Rayson",
note = "The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-018-9438-7",
year = "2019",
month = sep,
day = "1",
doi = "10.1007/s10579-018-9438-7",
language = "English",
volume = "53",
pages = "397–418",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer Netherlands",
number = "3",

}

RIS

TY - JOUR

T1 - A word sense disambiguation corpus for Urdu

AU - Saeed, Ali

AU - Nawab, Rao Muhammad Adeel

AU - Stevenson, Mark

AU - Rayson, Paul

N1 - The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-018-9438-7

PY - 2019/9/1

Y1 - 2019/9/1

N2 - The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an important problem in natural language processing (NLP). Standard evaluation resources are needed to develop, evaluate and compare WSD methods. A range of initiatives have lead to the development of benchmark WSD corpora for a wide range of languages from various language families. However, there is a lack of benchmark WSD corpora for South Asian languages including Urdu, despite there being over 300 million Urdu speakers and a large amounts of Urdu digital text available online. To address that gap, this study describes a novel benchmark corpus for the Urdu Lexical Sample WSD task. This corpus contains 50 target words (30 nouns, 11 adjectives, and 9 verbs). A standard, manually crafted dictionary called Urdu Lughat is used as a sense inventory. Four baseline WSD approaches were applied to the corpus. The results show that the best performance was obtained using a simple Bag of Words approach. To encourage NLP research on the Urdu language the corpus is freely available to the research community.

AB - The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an important problem in natural language processing (NLP). Standard evaluation resources are needed to develop, evaluate and compare WSD methods. A range of initiatives have lead to the development of benchmark WSD corpora for a wide range of languages from various language families. However, there is a lack of benchmark WSD corpora for South Asian languages including Urdu, despite there being over 300 million Urdu speakers and a large amounts of Urdu digital text available online. To address that gap, this study describes a novel benchmark corpus for the Urdu Lexical Sample WSD task. This corpus contains 50 target words (30 nouns, 11 adjectives, and 9 verbs). A standard, manually crafted dictionary called Urdu Lughat is used as a sense inventory. Four baseline WSD approaches were applied to the corpus. The results show that the best performance was obtained using a simple Bag of Words approach. To encourage NLP research on the Urdu language the corpus is freely available to the research community.

KW - Lexical sample task

KW - Sense tagged Urdu corpus

KW - Word sense disambiguation

U2 - 10.1007/s10579-018-9438-7

DO - 10.1007/s10579-018-9438-7

M3 - Journal article

AN - SCOPUS:85057560973

VL - 53

SP - 397

EP - 418

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 3

ER -