EMILLE, a 67-million word corpus of Indic languages - Research Portal

Associated organisational units

EMILLE, a 67-million word corpus of Indic languages: data collection, mark-up and harmonization.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Paul Baker
Andrew Hardie
Tony McEnery
Hamish Cunningham
Robert Gaizauskas

Publication date	2002
Host publication	Proceedings of LREC 2002
Pages	819-825
Number of pages	7
<mark>Original language</mark>	English

Abstract

The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing
a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

Research

Associated organisational units

Links

EMILLE, a 67-million word corpus of Indic languages: data collection, mark-up and harmonization.

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us