Home > Research > Publications & Outputs > Constructing corpora of South Asian languages.

Electronic data

View graph of relations

Constructing corpora of South Asian languages.

Research output: Contribution to conference - Without ISBN/ISSN Conference paper

Published

Standard

Constructing corpora of South Asian languages. / Baker, Paul; Hardie, Andrew; McEnery, Tony; Jayaram, BD.

2003. Paper presented at Corpus Linguistics 2003, Lancaster, .

Research output: Contribution to conference - Without ISBN/ISSN Conference paper

Harvard

Baker, P, Hardie, A, McEnery, T & Jayaram, BD 2003, 'Constructing corpora of South Asian languages.', Paper presented at Corpus Linguistics 2003, Lancaster, 1/03/03.

APA

Baker, P., Hardie, A., McEnery, T., & Jayaram, BD. (2003). Constructing corpora of South Asian languages.. Paper presented at Corpus Linguistics 2003, Lancaster, .

Vancouver

Baker P, Hardie A, McEnery T, Jayaram BD. Constructing corpora of South Asian languages.. 2003. Paper presented at Corpus Linguistics 2003, Lancaster, .

Author

Baker, Paul ; Hardie, Andrew ; McEnery, Tony ; Jayaram, BD. / Constructing corpora of South Asian languages. Paper presented at Corpus Linguistics 2003, Lancaster, .

Bibtex

@conference{c62531d02d314968810d71d280c98762,
title = "Constructing corpora of South Asian languages.",
abstract = "The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.",
keywords = "corpus, South Asian languages, EMILLE, encoding, Unicode, annotation, corpus building",
author = "Paul Baker and Andrew Hardie and Tony McEnery and BD Jayaram",
year = "2003",
language = "English",
note = "Corpus Linguistics 2003 ; Conference date: 01-03-2003",

}

RIS

TY - CONF

T1 - Constructing corpora of South Asian languages.

AU - Baker, Paul

AU - Hardie, Andrew

AU - McEnery, Tony

AU - Jayaram, BD

PY - 2003

Y1 - 2003

N2 - The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.

AB - The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.

KW - corpus

KW - South Asian languages

KW - EMILLE

KW - encoding

KW - Unicode

KW - annotation

KW - corpus building

M3 - Conference paper

T2 - Corpus Linguistics 2003

Y2 - 1 March 2003

ER -