Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
}
TY - GEN
T1 - Sublanguage corpus analysis toolkit
T2 - a tool for assessing the representativeness and sublanguage characteristics of corpora
AU - Temnikova, Irina
AU - Baumgartner, William
AU - Hailu, Negacy
AU - Nikolova, Ivelina
AU - McEnery, Tony
AU - Kilgarriff, Adam
AU - Angelova, Galia
AU - Cohen, Bretonnel
PY - 2014/5/1
Y1 - 2014/5/1
N2 - Sublanguages are varieties of language that form “subsets” of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed―English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.
AB - Sublanguages are varieties of language that form “subsets” of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed―English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.
M3 - Conference contribution/Paper
SN - 9782951740884
VL - 2014
BT - Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
PB - European Language Resources Association (ELRA)
CY - Reykjavik
ER -