Construction and annotation of a corpus of contemporary Nepali

Home > Research > Publications & Outputs > Construction and annotation of a corpus of cont...

Associated organisational unit

UCREL - University Centre for Computer Corpus Research on Language

Text available via DOI:

https://doi.org/10.3366/E1749503208000166
Final published version

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Yogendra Yadava
Andrew Hardie
Ram Lohani
Bhim N. Regmi
Srishtee Gurung
Amar Gurung
Tony McEnery
Jens Allwood
Pat Hall

More...

<mark>Journal publication date</mark>	2008
<mark>Journal</mark>	Corpora
Issue number	2
Volume	3
Number of pages	13
Pages (from-to)	213-225
Publication Status	Published
<mark>Original language</mark>	English

Abstract

In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.

Research

Associated organisational unit

Links

Text available via DOI:

Construction and annotation of a corpus of contemporary Nepali

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us