Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Chapter
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Chapter
}
TY - CHAP
T1 - The Lancaster Corpus of Mandarin Chinese
T2 - LREC 2004
AU - McEnery, A. M.
AU - Xiao, R. Z.
PY - 2004/5
Y1 - 2004/5
N2 - This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. LCMC is a one-million-word balanced corpus of written Mandarin Chinese. The corpus contains five hundred 2,000-word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totalling one million words. LCMC is XML-compliant and conforms to CES, with each document containing a corpus header giving general information about the corpus and a body of text. The corpus is segmented and POS tagged with a tagging precision rate of over 98%. The corpus is a useful resource for research into modern Chinese as well as the cross-linguistic contrast between English and Chinese.
AB - This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. LCMC is a one-million-word balanced corpus of written Mandarin Chinese. The corpus contains five hundred 2,000-word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totalling one million words. LCMC is XML-compliant and conforms to CES, with each document containing a corpus header giving general information about the corpus and a body of text. The corpus is segmented and POS tagged with a tagging precision rate of over 98%. The corpus is a useful resource for research into modern Chinese as well as the cross-linguistic contrast between English and Chinese.
KW - corpus
KW - Chinese
KW - contrastive study
M3 - Chapter
BT - LREC
Y2 - 24 May 2004 through 30 May 2004
ER -