Home > Research > Publications & Outputs > EMILLE, a 67-million word corpus of Indic langu...


View graph of relations

EMILLE, a 67-million word corpus of Indic languages: data collection, mark-up and harmonization.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Publication date2002
Host publicationProceedings of LREC 2002
Number of pages7
<mark>Original language</mark>English


The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing
a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.