Construction and annotation of a corpus of contemporary Nepali

Associated organisational unit

UCREL - University Centre for Computer Corpus Research on Language

Text available via DOI:

https://doi.org/10.3366/E1749503208000166
Final published version

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Construction and annotation of a corpus of contemporary Nepali. / Yadava, Yogendra; Hardie, Andrew; Lohani, Ram et al.
In: Corpora, Vol. 3, No. 2, 2008, p. 213-225.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Yadava, Y, Hardie, A, Lohani, R, Regmi, BN, Gurung, S, Gurung, A, McEnery, T, Allwood, J & Hall, P 2008, 'Construction and annotation of a corpus of contemporary Nepali', Corpora, vol. 3, no. 2, pp. 213-225. https://doi.org/10.3366/E1749503208000166

APA

Yadava, Y., Hardie, A., Lohani, R., Regmi, B. N., Gurung, S., Gurung, A., McEnery, T., Allwood, J., & Hall, P. (2008). Construction and annotation of a corpus of contemporary Nepali. Corpora, 3(2), 213-225. https://doi.org/10.3366/E1749503208000166

Vancouver

Yadava Y, Hardie A, Lohani R, Regmi BN, Gurung S, Gurung A et al. Construction and annotation of a corpus of contemporary Nepali. Corpora. 2008;3(2):213-225. doi: 10.3366/E1749503208000166

Author

Yadava, Yogendra ; Hardie, Andrew ; Lohani, Ram et al. / Construction and annotation of a corpus of contemporary Nepali. In: Corpora. 2008 ; Vol. 3, No. 2. pp. 213-225.

Bibtex

@article{cbec3659664c4b498272cd950f29aeaf,

title = "Construction and annotation of a corpus of contemporary Nepali",

abstract = "In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.",

author = "Yogendra Yadava and Andrew Hardie and Ram Lohani and Regmi, {Bhim N.} and Srishtee Gurung and Amar Gurung and Tony McEnery and Jens Allwood and Pat Hall",

year = "2008",

doi = "10.3366/E1749503208000166",

language = "English",

volume = "3",

pages = "213--225",

journal = "Corpora",

issn = "1749-5032",

publisher = "Edinburgh University Press",

number = "2",

}

RIS

TY - JOUR

T1 - Construction and annotation of a corpus of contemporary Nepali

AU - Yadava, Yogendra

AU - Hardie, Andrew

AU - Lohani, Ram

AU - Regmi, Bhim N.

AU - Gurung, Srishtee

AU - Gurung, Amar

AU - McEnery, Tony

AU - Allwood, Jens

AU - Hall, Pat

PY - 2008

Y1 - 2008

N2 - In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.

AB - In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.

U2 - 10.3366/E1749503208000166

DO - 10.3366/E1749503208000166

M3 - Journal article

VL - 3

SP - 213

EP - 225

JO - Corpora

JF - Corpora

SN - 1749-5032

IS - 2

ER -

Research

Associated organisational unit

Links

Text available via DOI: