Extending corpus annotation of Nepali - Research Portal

Home > Research > Publications & Outputs > Extending corpus annotation of Nepali

Linguistics and English Language

Associated organisational unit

UCREL - University Centre for Computer Corpus Research on Language

Electronic data

HLJ1001G
Final published version, 477 KB, PDF document
Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Text available via DOI:

https://doi.org/10.5070/H910123572
Final published version
Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Keywords

Nepali, corpus, tagging, lemmatisation, tokenisation, morphology

View graph of relations

Extending corpus annotation of Nepali: advances in tokenisation and lemmatisation

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Andrew Hardie
Ram Lohani
Yogendra Yadava

More...

<mark>Journal publication date</mark>	2011
<mark>Journal</mark>	Himalayan Linguistics
Issue number	1
Volume	10
Number of pages	14
Pages (from-to)	151–165
Publication Status	Published
<mark>Original language</mark>	English

Abstract

The Nepali National Corpus (NNC) was, in the process of its creation, annotated with part-of-speech (POS) tags. This paper describes the extension of automated text and corpus annotation in Nepali from POS tags to lemmatisation, enabling a more complex set of corpus-based searches and analyses. This work also addresses certain practical compromises embodied in the initial tagging of the NNC. First, some particular aspects of Nepali morphology – in particular the complexity of the agglutinative verbal inflection system – necessitated improvements to the underlying tokenisation of the text before lemmatisation could be satisfactorily implemented. In practical terms, both the tokenisation and lemmatisation procedures require linguistic knowledge resources to operate successfully: a set of rules describing the default case, and a lexicon containing a list of individual exceptions: words whose form suggests a particular rule should apply to them, but where that rule in fact does not apply. These resources, particularly the lexicons of irregularities, were created by a strongly data-driven process working from analyses of the NNC itself. This approach to tokenisation and lemmatisation, and associated linguistic knowledge resources, may be illustrative and of use to researchers looking at other languages of the Himalayan region, most especially those that have similar morphological behaviour to Nepali.

Research

Associated organisational unit

Electronic data

Text available via DOI:

Keywords

Extending corpus annotation of Nepali: advances in tokenisation and lemmatisation

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us