A word sense disambiguation corpus for Urdu

Home > Research > Publications & Outputs > A word sense disambiguation corpus for Urdu

Computing and Communications

Associated organisational units

Electronic data

WSD
Rights statement: The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-018-9438-7
Accepted author manuscript, 603 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Text available via DOI:

https://doi.org/10.1007/s10579-018-9438-7
Final published version

Keywords

Lexical sample task, Sense tagged Urdu corpus, Word sense disambiguation

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Ali Saeed
Rao Muhammad Adeel Nawab
Mark Stevenson
Paul Rayson

More...

<mark>Journal publication date</mark>	1/09/2019
<mark>Journal</mark>	Language Resources and Evaluation
Issue number	3
Volume	53
Number of pages	22
Pages (from-to)	397–418
Publication Status	Published
Early online date	24/11/18
<mark>Original language</mark>	English

Abstract

The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an important problem in natural language processing (NLP). Standard evaluation resources are needed to develop, evaluate and compare WSD methods. A range of initiatives have lead to the development of benchmark WSD corpora for a wide range of languages from various language families. However, there is a lack of benchmark WSD corpora for South Asian languages including Urdu, despite there being over 300 million Urdu speakers and a large amounts of Urdu digital text available online. To address that gap, this study describes a novel benchmark corpus for the Urdu Lexical Sample WSD task. This corpus contains 50 target words (30 nouns, 11 adjectives, and 9 verbs). A standard, manually crafted dictionary called Urdu Lughat is used as a sense inventory. Four baseline WSD approaches were applied to the corpus. The results show that the best performance was obtained using a simple Bag of Words approach. To encourage NLP research on the Urdu language the corpus is freely available to the research community.

Bibliographic note

The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-018-9438-7

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords