Home > Research > Publications & Outputs > Extending corpus annotation of Nepali

Electronic data

  • HLJ1001G

    Final published version, 477 KB, PDF document

    Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Text available via DOI:

View graph of relations

Extending corpus annotation of Nepali: advances in tokenisation and lemmatisation

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

Extending corpus annotation of Nepali: advances in tokenisation and lemmatisation. / Hardie, Andrew; Lohani, Ram; Yadava, Yogendra.
In: Himalayan Linguistics, Vol. 10, No. 1, 2011, p. 151–165.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

Hardie, A, Lohani, R & Yadava, Y 2011, 'Extending corpus annotation of Nepali: advances in tokenisation and lemmatisation', Himalayan Linguistics, vol. 10, no. 1, pp. 151–165. https://doi.org/10.5070/H910123572

APA

Vancouver

Hardie A, Lohani R, Yadava Y. Extending corpus annotation of Nepali: advances in tokenisation and lemmatisation. Himalayan Linguistics. 2011;10(1):151–165. doi: 10.5070/H910123572

Author

Hardie, Andrew ; Lohani, Ram ; Yadava, Yogendra. / Extending corpus annotation of Nepali : advances in tokenisation and lemmatisation. In: Himalayan Linguistics. 2011 ; Vol. 10, No. 1. pp. 151–165.

Bibtex

@article{67b4b0c40d1c496cb82c4961e1782592,
title = "Extending corpus annotation of Nepali: advances in tokenisation and lemmatisation",
abstract = "The Nepali National Corpus (NNC) was, in the process of its creation, annotated with part-of-speech (POS) tags. This paper describes the extension of automated text and corpus annotation in Nepali from POS tags to lemmatisation, enabling a more complex set of corpus-based searches and analyses. This work also addresses certain practical compromises embodied in the initial tagging of the NNC. First, some particular aspects of Nepali morphology – in particular the complexity of the agglutinative verbal inflection system – necessitated improvements to the underlying tokenisation of the text before lemmatisation could be satisfactorily implemented. In practical terms, both the tokenisation and lemmatisation procedures require linguistic knowledge resources to operate successfully: a set of rules describing the default case, and a lexicon containing a list of individual exceptions: words whose form suggests a particular rule should apply to them, but where that rule in fact does not apply. These resources, particularly the lexicons of irregularities, were created by a strongly data-driven process working from analyses of the NNC itself. This approach to tokenisation and lemmatisation, and associated linguistic knowledge resources, may be illustrative and of use to researchers looking at other languages of the Himalayan region, most especially those that have similar morphological behaviour to Nepali.",
keywords = "Nepali, corpus, tagging, lemmatisation, tokenisation, morphology",
author = "Andrew Hardie and Ram Lohani and Yogendra Yadava",
year = "2011",
doi = "10.5070/H910123572",
language = "English",
volume = "10",
pages = "151–165",
journal = "Himalayan Linguistics",
issn = "1544-7502",
number = "1",

}

RIS

TY - JOUR

T1 - Extending corpus annotation of Nepali

T2 - advances in tokenisation and lemmatisation

AU - Hardie, Andrew

AU - Lohani, Ram

AU - Yadava, Yogendra

PY - 2011

Y1 - 2011

N2 - The Nepali National Corpus (NNC) was, in the process of its creation, annotated with part-of-speech (POS) tags. This paper describes the extension of automated text and corpus annotation in Nepali from POS tags to lemmatisation, enabling a more complex set of corpus-based searches and analyses. This work also addresses certain practical compromises embodied in the initial tagging of the NNC. First, some particular aspects of Nepali morphology – in particular the complexity of the agglutinative verbal inflection system – necessitated improvements to the underlying tokenisation of the text before lemmatisation could be satisfactorily implemented. In practical terms, both the tokenisation and lemmatisation procedures require linguistic knowledge resources to operate successfully: a set of rules describing the default case, and a lexicon containing a list of individual exceptions: words whose form suggests a particular rule should apply to them, but where that rule in fact does not apply. These resources, particularly the lexicons of irregularities, were created by a strongly data-driven process working from analyses of the NNC itself. This approach to tokenisation and lemmatisation, and associated linguistic knowledge resources, may be illustrative and of use to researchers looking at other languages of the Himalayan region, most especially those that have similar morphological behaviour to Nepali.

AB - The Nepali National Corpus (NNC) was, in the process of its creation, annotated with part-of-speech (POS) tags. This paper describes the extension of automated text and corpus annotation in Nepali from POS tags to lemmatisation, enabling a more complex set of corpus-based searches and analyses. This work also addresses certain practical compromises embodied in the initial tagging of the NNC. First, some particular aspects of Nepali morphology – in particular the complexity of the agglutinative verbal inflection system – necessitated improvements to the underlying tokenisation of the text before lemmatisation could be satisfactorily implemented. In practical terms, both the tokenisation and lemmatisation procedures require linguistic knowledge resources to operate successfully: a set of rules describing the default case, and a lexicon containing a list of individual exceptions: words whose form suggests a particular rule should apply to them, but where that rule in fact does not apply. These resources, particularly the lexicons of irregularities, were created by a strongly data-driven process working from analyses of the NNC itself. This approach to tokenisation and lemmatisation, and associated linguistic knowledge resources, may be illustrative and of use to researchers looking at other languages of the Himalayan region, most especially those that have similar morphological behaviour to Nepali.

KW - Nepali

KW - corpus

KW - tagging

KW - lemmatisation

KW - tokenisation

KW - morphology

U2 - 10.5070/H910123572

DO - 10.5070/H910123572

M3 - Journal article

VL - 10

SP - 151

EP - 165

JO - Himalayan Linguistics

JF - Himalayan Linguistics

SN - 1544-7502

IS - 1

ER -