Home > Research > Publications & Outputs > EMILLE, A 67-million word corpus of indic langu...
View graph of relations

EMILLE, A 67-million word corpus of indic languages: Data collection, mark-up and harmonisation

Research output: Contribution to conference - Without ISBN/ISSN Conference paper

Published

Standard

EMILLE, A 67-million word corpus of indic languages : Data collection, mark-up and harmonisation. / Baker, Paul; Hardie, Andrew; McEnery, Tony; Cunningham, Hamish; Gaizauskas, Rob.

2002. 819-825 Paper presented at 3rd International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, Canary Islands, Spain.

Research output: Contribution to conference - Without ISBN/ISSN Conference paper

Harvard

Baker, P, Hardie, A, McEnery, T, Cunningham, H & Gaizauskas, R 2002, 'EMILLE, A 67-million word corpus of indic languages: Data collection, mark-up and harmonisation' Paper presented at 3rd International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, Canary Islands, Spain, 29/05/02 - 31/05/02, pp. 819-825.

APA

Baker, P., Hardie, A., McEnery, T., Cunningham, H., & Gaizauskas, R. (2002). EMILLE, A 67-million word corpus of indic languages: Data collection, mark-up and harmonisation. 819-825. Paper presented at 3rd International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, Canary Islands, Spain.

Vancouver

Baker P, Hardie A, McEnery T, Cunningham H, Gaizauskas R. EMILLE, A 67-million word corpus of indic languages: Data collection, mark-up and harmonisation. 2002. Paper presented at 3rd International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, Canary Islands, Spain.

Author

Baker, Paul ; Hardie, Andrew ; McEnery, Tony ; Cunningham, Hamish ; Gaizauskas, Rob. / EMILLE, A 67-million word corpus of indic languages : Data collection, mark-up and harmonisation. Paper presented at 3rd International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, Canary Islands, Spain.7 p.

Bibtex

@conference{2bcecb22af334b6181bf4a6e389fdf82,
title = "EMILLE, A 67-million word corpus of indic languages: Data collection, mark-up and harmonisation",
abstract = "The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.",
author = "Paul Baker and Andrew Hardie and Tony McEnery and Hamish Cunningham and Rob Gaizauskas",
year = "2002",
month = "1",
day = "1",
language = "English",
pages = "819--825",
note = "3rd International Conference on Language Resources and Evaluation, LREC 2002 ; Conference date: 29-05-2002 Through 31-05-2002",

}

RIS

TY - CONF

T1 - EMILLE, A 67-million word corpus of indic languages

T2 - Data collection, mark-up and harmonisation

AU - Baker, Paul

AU - Hardie, Andrew

AU - McEnery, Tony

AU - Cunningham, Hamish

AU - Gaizauskas, Rob

PY - 2002/1/1

Y1 - 2002/1/1

N2 - The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

AB - The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

M3 - Conference paper

SP - 819

EP - 825

ER -