EMILLE, A 67-million word corpus of indic languages - Research Portal

Lancaster Environment Centre

EMILLE, A 67-million word corpus of indic languages: Data collection, mark-up and harmonisation

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review

Published

Paul Baker
Andrew Hardie
Tony McEnery
Hamish Cunningham
Rob Gaizauskas

Publication date	1/01/2002
Number of pages	7
Pages	819-825
<mark>Original language</mark>	English
Event	3rd International Conference on Language Resources and Evaluation, LREC 2002 - Las Palmas, Canary Islands, Spain Duration: 29/05/2002 → 31/05/2002

Conference

Conference	3rd International Conference on Language Resources and Evaluation, LREC 2002
Country/Territory	Spain
City	Las Palmas, Canary Islands
Period	29/05/02 → 31/05/02

Abstract

The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

Research

Links

EMILLE, A 67-million word corpus of indic languages: Data collection, mark-up and harmonisation

Conference

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us