EMILLE, a 67-million word corpus of Indic languages - Research Portal

Associated organisational units

EMILLE, a 67-million word corpus of Indic languages: data collection, mark-up and harmonization.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

EMILLE, a 67-million word corpus of Indic languages: data collection, mark-up and harmonization. / Baker, Paul ; Hardie, Andrew ; McEnery, Tony et al.
Proceedings of LREC 2002. 2002. p. 819-825.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Bibtex

@inproceedings{0d3fe94f210247899fd8b0fa43243be3,

title = "EMILLE, a 67-million word corpus of Indic languages: data collection, mark-up and harmonization.",

abstract = "The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishinga language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.",

author = "Paul Baker and Andrew Hardie and Tony McEnery and Hamish Cunningham and Robert Gaizauskas",

year = "2002",

language = "English",

pages = "819--825",

booktitle = "Proceedings of LREC 2002",

}

RIS

TY - GEN

T1 - EMILLE, a 67-million word corpus of Indic languages

T2 - data collection, mark-up and harmonization.

AU - Baker, Paul

AU - Hardie, Andrew

AU - McEnery, Tony

AU - Cunningham, Hamish

AU - Gaizauskas, Robert

PY - 2002

Y1 - 2002

N2 - The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishinga language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

AB - The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishinga language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

M3 - Conference contribution/Paper

SP - 819

EP - 825

BT - Proceedings of LREC 2002

ER -

Research

Associated organisational units

Links

EMILLE, a 67-million word corpus of Indic languages: data collection, mark-up and harmonization.

Standard

Harvard

APA

Vancouver

Author

Bibtex

RIS

Quick Links

Connect With Us

Faculties & Depts

Contact Us