Research output: Contribution to Journal/Magazine › Journal article › peer-review
Research output: Contribution to Journal/Magazine › Journal article › peer-review
}
TY - JOUR
T1 - Construction and annotation of a corpus of contemporary Nepali
AU - Yadava, Yogendra
AU - Hardie, Andrew
AU - Lohani, Ram
AU - Regmi, Bhim N.
AU - Gurung, Srishtee
AU - Gurung, Amar
AU - McEnery, Tony
AU - Allwood, Jens
AU - Hall, Pat
PY - 2008
Y1 - 2008
N2 - In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.
AB - In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.
U2 - 10.3366/E1749503208000166
DO - 10.3366/E1749503208000166
M3 - Journal article
VL - 3
SP - 213
EP - 225
JO - Corpora
JF - Corpora
SN - 1749-5032
IS - 2
ER -