Home > Research > Publications & Outputs > EMILLE, A 67-million word corpus of indic langu...
View graph of relations

EMILLE, A 67-million word corpus of indic languages: Data collection, mark-up and harmonisation

Research output: Contribution to conference - Without ISBN/ISSN Conference paper

Published
Publication date1/01/2002
Number of pages7
Pages819-825
Original languageEnglish
Event3rd International Conference on Language Resources and Evaluation, LREC 2002 - Las Palmas, Canary Islands, Spain
Duration: 29/05/200231/05/2002

Conference

Conference3rd International Conference on Language Resources and Evaluation, LREC 2002
CountrySpain
CityLas Palmas, Canary Islands
Period29/05/0231/05/02

Abstract

The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.