Home > Research > Publications & Outputs > EMILLE, a 67-million word corpus of Indic langu...

Links

View graph of relations

EMILLE, a 67-million word corpus of Indic languages: data collection, mark-up and harmonization.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

EMILLE, a 67-million word corpus of Indic languages: data collection, mark-up and harmonization. / Baker, Paul; Hardie, Andrew; McEnery, Tony et al.
Proceedings of LREC 2002. 2002. p. 819-825.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

APA

Vancouver

Author

Bibtex

@inproceedings{0d3fe94f210247899fd8b0fa43243be3,
title = "EMILLE, a 67-million word corpus of Indic languages: data collection, mark-up and harmonization.",
abstract = "The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishinga language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.",
author = "Paul Baker and Andrew Hardie and Tony McEnery and Hamish Cunningham and Robert Gaizauskas",
year = "2002",
language = "English",
pages = "819--825",
booktitle = "Proceedings of LREC 2002",

}

RIS

TY - GEN

T1 - EMILLE, a 67-million word corpus of Indic languages

T2 - data collection, mark-up and harmonization.

AU - Baker, Paul

AU - Hardie, Andrew

AU - McEnery, Tony

AU - Cunningham, Hamish

AU - Gaizauskas, Robert

PY - 2002

Y1 - 2002

N2 - The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishinga language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

AB - The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishinga language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

M3 - Conference contribution/Paper

SP - 819

EP - 825

BT - Proceedings of LREC 2002

ER -