Constructing corpora of South Asian languages.

Linguistics and English Language

Associated organisational unit

UCREL - University Centre for Computer Corpus Research on Language

Electronic data

McEnery.pdf
344 KB, PDF document

Keywords

corpus, South Asian languages, EMILLE, encoding, Unicode, annotation, corpus building

View graph of relations

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper

Published

More...

Publication date	2003
<mark>Original language</mark>	English
Event	Corpus Linguistics 2003 - Lancaster Duration: 1/03/2003 → …

Conference

Conference	Corpus Linguistics 2003
City	Lancaster
Period	1/03/03 → …

Abstract

The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.

Research

Associated organisational unit

Electronic data

Keywords