344 KB, PDF document
Research output: Contribution to conference - Without ISBN/ISSN › Conference paper
Research output: Contribution to conference - Without ISBN/ISSN › Conference paper
}
TY - CONF
T1 - Constructing corpora of South Asian languages.
AU - Baker, Paul
AU - Hardie, Andrew
AU - McEnery, Tony
AU - Jayaram, BD
PY - 2003
Y1 - 2003
N2 - The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.
AB - The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.
KW - corpus
KW - South Asian languages
KW - EMILLE
KW - encoding
KW - Unicode
KW - annotation
KW - corpus building
M3 - Conference paper
T2 - Corpus Linguistics 2003
Y2 - 1 March 2003
ER -