Developing a tagset for automated part-of-speech tagging in Urdu.

Linguistics and English Language

Electronic data

cl03_urdu.pdf
186 KB, PDF document

Keywords

part-of-speech tagging, Urdu, tagset, EAGLES guidelines

View graph of relations

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review

Published

Standard

Developing a tagset for automated part-of-speech tagging in Urdu. / Hardie, A.
2003. Paper presented at Corpus Linguistics 2003, Lancaster.

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review

Bibtex

@conference{a129d6e96ce54020b1ddc791f0c1a9e5,

title = "Developing a tagset for automated part-of-speech tagging in Urdu.",

abstract = "While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Little work has hitherto been done in the area of tagset creation for Urdu. The tagset discussed here was created in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora. Although these guidelines were written to cover the languages of the European Union, they can be applied fairly easily to Urdu, which, coming as it does from another branch of the Indo- European family, is structurally quite similar. They can also be extended to deal with the idiosyncrasies presented by Urdu grammar. This paper will look at the process of creating one of the necessary resources for the development of a POS tagging system for Urdu, that of a suitable tagset, considering some of the problems encountered along the way.",

keywords = "part-of-speech tagging, Urdu, tagset, EAGLES guidelines",

author = "A Hardie",

year = "2003",

language = "English",

note = "Corpus Linguistics 2003 ; Conference date: 01-03-2003",

}

RIS

TY - CONF

T1 - Developing a tagset for automated part-of-speech tagging in Urdu.

AU - Hardie, A

PY - 2003

Y1 - 2003

N2 - While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Little work has hitherto been done in the area of tagset creation for Urdu. The tagset discussed here was created in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora. Although these guidelines were written to cover the languages of the European Union, they can be applied fairly easily to Urdu, which, coming as it does from another branch of the Indo- European family, is structurally quite similar. They can also be extended to deal with the idiosyncrasies presented by Urdu grammar. This paper will look at the process of creating one of the necessary resources for the development of a POS tagging system for Urdu, that of a suitable tagset, considering some of the problems encountered along the way.

AB - While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Little work has hitherto been done in the area of tagset creation for Urdu. The tagset discussed here was created in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora. Although these guidelines were written to cover the languages of the European Union, they can be applied fairly easily to Urdu, which, coming as it does from another branch of the Indo- European family, is structurally quite similar. They can also be extended to deal with the idiosyncrasies presented by Urdu grammar. This paper will look at the process of creating one of the necessary resources for the development of a POS tagging system for Urdu, that of a suitable tagset, considering some of the problems encountered along the way.

KW - part-of-speech tagging

KW - Urdu

KW - tagset

KW - EAGLES guidelines

M3 - Conference paper

T2 - Corpus Linguistics 2003

Y2 - 1 March 2003

ER -

Research

Electronic data

Keywords