Corpus linguistics and South Asian languages : corpus creation and tool development.

Home > Research > Publications & Outputs > Corpus linguistics and South Asian languages : ...

Associated organisational units

Text available via DOI:

https://doi.org/10.1093/llc/19.4.509
Final published version

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Paul Baker
Andrew Hardie
Tony McEnery
Richard Z. Xiao
Kalina Bontcheva
Hamish Cunningham
Robert Gaizauskas
Oana Hamza
Diana Maynard
Valentin Tablan
Cristian Ursu
B. D. Jayaram
Mark Leisher

More...

<mark>Journal publication date</mark>	1/11/2004
<mark>Journal</mark>	Literary and Linguistic Computing
Issue number	4
Volume	19
Number of pages	16
Pages (from-to)	509-524
Publication Status	Published
<mark>Original language</mark>	English

Abstract

This paper describes the work carried out on the EMILLE Project (Enabling Minority Language Engineering), which was undertaken by the Universities of Lancaster and Sheffield. The primary resource developed by the project is the EMILLE Corpus, which consists of a series of monolingual corpora for fourteen South Asian languages, totalling more than 96 million words, and a parallel corpus of English and five of these languages. The EMILLE Corpus also includes an annotated component, namely, part-of-speech tagged Urdu data, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use in Hindi. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools for EMILLE has contributed to the ongoing development of the LE architecture GATE, which has been extended to make use of Unicode. GATE thus plugs some of the gaps for language processing R&D necessary for the exploitation of the EMILLE corpora.

Research

Associated organisational units

Links

Text available via DOI:

Corpus linguistics and South Asian languages : corpus creation and tool development.

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us