Home > Research > Publications & Outputs > EMILLE Corpora
View graph of relations

EMILLE Corpora

Research output: Other contribution

Published

Standard

EMILLE Corpora. / McEnery, Anthony; Baker, John Paul; Hardie, Andrew John.
Paris: European Language Resources Association (ELRA). 2004, EMILLE Corpora .

Research output: Other contribution

Harvard

APA

Vancouver

Author

McEnery, Anthony ; Baker, John Paul ; Hardie, Andrew John. / EMILLE Corpora. 2004. Paris : European Language Resources Association (ELRA).

Bibtex

@misc{0ebf3c1087ee487cafcb033cfdf347bf,
title = "EMILLE Corpora",
abstract = "The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain approximately 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. ",
keywords = "corpus, spoken, written, South Asian languages",
author = "Anthony McEnery and Baker, {John Paul} and Hardie, {Andrew John}",
year = "2004",
month = sep,
day = "15",
language = "English",
publisher = "European Language Resources Association (ELRA)",
type = "Other",

}

RIS

TY - GEN

T1 - EMILLE Corpora

AU - McEnery, Anthony

AU - Baker, John Paul

AU - Hardie, Andrew John

PY - 2004/9/15

Y1 - 2004/9/15

N2 - The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain approximately 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.

AB - The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu. The EMILLE monolingual corpora contain approximately 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.

KW - corpus

KW - spoken

KW - written

KW - South Asian languages

M3 - Other contribution

PB - European Language Resources Association (ELRA)

CY - Paris

ER -