Home > Research > Publications & Outputs > Constructing corpora of South Asian languages.

Electronic data

View graph of relations

Constructing corpora of South Asian languages.

Research output: Contribution to conference - Without ISBN/ISSN Conference paper

Published

Standard

Constructing corpora of South Asian languages. / Baker, Paul; Hardie, Andrew; McEnery, Tony et al.
2003. Paper presented at Corpus Linguistics 2003, Lancaster.

Research output: Contribution to conference - Without ISBN/ISSN Conference paper

Harvard

Baker, P, Hardie, A, McEnery, T & Jayaram, BD 2003, 'Constructing corpora of South Asian languages.', Paper presented at Corpus Linguistics 2003, Lancaster, 1/03/03.

APA

Baker, P., Hardie, A., McEnery, T., & Jayaram, BD. (2003). Constructing corpora of South Asian languages.. Paper presented at Corpus Linguistics 2003, Lancaster.

Vancouver

Baker P, Hardie A, McEnery T, Jayaram BD. Constructing corpora of South Asian languages.. 2003. Paper presented at Corpus Linguistics 2003, Lancaster.

Author

Baker, Paul ; Hardie, Andrew ; McEnery, Tony et al. / Constructing corpora of South Asian languages. Paper presented at Corpus Linguistics 2003, Lancaster.

Bibtex

@conference{c62531d02d314968810d71d280c98762,
title = "Constructing corpora of South Asian languages.",
abstract = "The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.",
keywords = "corpus, South Asian languages, EMILLE, encoding, Unicode, annotation, corpus building",
author = "Paul Baker and Andrew Hardie and Tony McEnery and BD Jayaram",
year = "2003",
language = "English",
note = "Corpus Linguistics 2003 ; Conference date: 01-03-2003",

}

RIS

TY - CONF

T1 - Constructing corpora of South Asian languages.

AU - Baker, Paul

AU - Hardie, Andrew

AU - McEnery, Tony

AU - Jayaram, BD

PY - 2003

Y1 - 2003

N2 - The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.

AB - The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.

KW - corpus

KW - South Asian languages

KW - EMILLE

KW - encoding

KW - Unicode

KW - annotation

KW - corpus building

M3 - Conference paper

T2 - Corpus Linguistics 2003

Y2 - 1 March 2003

ER -