Home > Research > Publications & Outputs > Constructing corpora of South Asian languages.

Electronic data

View graph of relations

Constructing corpora of South Asian languages.

Research output: Contribution to conference - Without ISBN/ISSN Conference paper

Publication date2003
<mark>Original language</mark>English
EventCorpus Linguistics 2003 - Lancaster
Duration: 1/03/2003 → …


ConferenceCorpus Linguistics 2003
Period1/03/03 → …


The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.