Corpus linguistics and South Asian languages : corpus creation and tool development.

Associated organisational units

Text available via DOI:

https://doi.org/10.1093/llc/19.4.509
Final published version

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Corpus linguistics and South Asian languages : corpus creation and tool development. / Baker, Paul ; Hardie, Andrew ; McEnery, Tony et al.
In: Literary and Linguistic Computing, Vol. 19, No. 4, 01.11.2004, p. 509-524.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Baker, P , Hardie, A , McEnery, T , Xiao, RZ, Bontcheva, K, Cunningham, H, Gaizauskas, R, Hamza, O, Maynard, D, Tablan, V, Ursu, C, Jayaram, BD & Leisher, M 2004, 'Corpus linguistics and South Asian languages : corpus creation and tool development.', Literary and Linguistic Computing, vol. 19, no. 4, pp. 509-524. https://doi.org/10.1093/llc/19.4.509

APA

Baker, P., Hardie, A., McEnery, T., Xiao, R. Z., Bontcheva, K., Cunningham, H., Gaizauskas, R., Hamza, O., Maynard, D., Tablan, V., Ursu, C., Jayaram, B. D., & Leisher, M. (2004). Corpus linguistics and South Asian languages : corpus creation and tool development. Literary and Linguistic Computing, 19(4), 509-524. https://doi.org/10.1093/llc/19.4.509

Vancouver

Baker P , Hardie A , McEnery T , Xiao RZ, Bontcheva K, Cunningham H et al. Corpus linguistics and South Asian languages : corpus creation and tool development. Literary and Linguistic Computing. 2004 Nov 1;19(4):509-524. doi: 10.1093/llc/19.4.509

Author

Baker, Paul ; Hardie, Andrew ; McEnery, Tony et al. / Corpus linguistics and South Asian languages : corpus creation and tool development. In: Literary and Linguistic Computing. 2004 ; Vol. 19, No. 4. pp. 509-524.

Bibtex

@article{e5f2c4be1554454aa71bc3719914594f,

title = "Corpus linguistics and South Asian languages : corpus creation and tool development.",

abstract = "This paper describes the work carried out on the EMILLE Project (Enabling Minority Language Engineering), which was undertaken by the Universities of Lancaster and Sheffield. The primary resource developed by the project is the EMILLE Corpus, which consists of a series of monolingual corpora for fourteen South Asian languages, totalling more than 96 million words, and a parallel corpus of English and five of these languages. The EMILLE Corpus also includes an annotated component, namely, part-of-speech tagged Urdu data, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use in Hindi. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools for EMILLE has contributed to the ongoing development of the LE architecture GATE, which has been extended to make use of Unicode. GATE thus plugs some of the gaps for language processing R&D necessary for the exploitation of the EMILLE corpora.",

author = "Paul Baker and Andrew Hardie and Tony McEnery and Xiao, {Richard Z.} and Kalina Bontcheva and Hamish Cunningham and Robert Gaizauskas and Oana Hamza and Diana Maynard and Valentin Tablan and Cristian Ursu and Jayaram, {B. D.} and Mark Leisher",

year = "2004",

month = nov,

day = "1",

doi = "10.1093/llc/19.4.509",

language = "English",

volume = "19",

pages = "509--524",

journal = "Literary and Linguistic Computing",

issn = "0268-1145",

publisher = "Oxford University Press",

number = "4",

}

RIS

TY - JOUR

T1 - Corpus linguistics and South Asian languages : corpus creation and tool development.

AU - Baker, Paul

AU - Hardie, Andrew

AU - McEnery, Tony

AU - Xiao, Richard Z.

AU - Bontcheva, Kalina

AU - Cunningham, Hamish

AU - Gaizauskas, Robert

AU - Hamza, Oana

AU - Maynard, Diana

AU - Tablan, Valentin

AU - Ursu, Cristian

AU - Jayaram, B. D.

AU - Leisher, Mark

PY - 2004/11/1

Y1 - 2004/11/1

N2 - This paper describes the work carried out on the EMILLE Project (Enabling Minority Language Engineering), which was undertaken by the Universities of Lancaster and Sheffield. The primary resource developed by the project is the EMILLE Corpus, which consists of a series of monolingual corpora for fourteen South Asian languages, totalling more than 96 million words, and a parallel corpus of English and five of these languages. The EMILLE Corpus also includes an annotated component, namely, part-of-speech tagged Urdu data, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use in Hindi. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools for EMILLE has contributed to the ongoing development of the LE architecture GATE, which has been extended to make use of Unicode. GATE thus plugs some of the gaps for language processing R&D necessary for the exploitation of the EMILLE corpora.

AB - This paper describes the work carried out on the EMILLE Project (Enabling Minority Language Engineering), which was undertaken by the Universities of Lancaster and Sheffield. The primary resource developed by the project is the EMILLE Corpus, which consists of a series of monolingual corpora for fourteen South Asian languages, totalling more than 96 million words, and a parallel corpus of English and five of these languages. The EMILLE Corpus also includes an annotated component, namely, part-of-speech tagged Urdu data, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use in Hindi. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools for EMILLE has contributed to the ongoing development of the LE architecture GATE, which has been extended to make use of Unicode. GATE thus plugs some of the gaps for language processing R&D necessary for the exploitation of the EMILLE corpora.

U2 - 10.1093/llc/19.4.509

DO - 10.1093/llc/19.4.509

M3 - Journal article

VL - 19

SP - 509

EP - 524

JO - Literary and Linguistic Computing

JF - Literary and Linguistic Computing

SN - 0268-1145

IS - 4

ER -

Research

Associated organisational units

Links

Text available via DOI: