For linguists working with corpora, a common difficulty after quantitative analysis is deciding which texts to select for follow-up, close analysis, without arousing suspicion of ‘cherry-picking’. The ProtAnt tool (Anthony & Baker 2015) provides a major boost in this respect. Building on a now well-established tradition of corpus keywords analysis (since Scott 1997), and the association between prototypes and frequency of instantiation (e.g., Rosch 1975, Gries 2003), ProtAnt ranks the texts in a target corpus from most to least prototypical according to the number of keywords they contain. ProtAnt’s capabilities have been increasingly exploited in text/discourse analysis (e.g., Levon 2016, Bednarek and Caple 2017, Price 2022), but to date, all such studies have been confined to traditional lexical-based keywords, rather than keywords generated at other linguistic levels such as parts of speech (POS), semantic domains, and speech acts. The current paper seeks to address this research gap, posing the question: How successfully can ProtAnt identify prototypical and outlier texts in corpora at various non-lexical linguistic levels? We address this question through a series of experiments. Results show that ProtAnt is able to use key POS-tags to identify stylistically prototypical texts in registers of the American AmE06 corpus and also flag outlier texts that have been artificially included from another register. Using semantic tags, outliers are identified with still higher success. Other results show that speech act tags (in SPICE-Ireland) yield more mixed results. On the whole, non-lexical key items are able to complement those at the lexical level in profiling texts, with success seemingly affected by the granularity of the tags, accuracy of the linguistic annotations, and degree of specialization of the register. We discuss the theoretical and practical implications of our work in areas such as grammar, stylistics, discourse analysis, and data-driven learning (DDL).
References
Anthony, L. & Baker, P. (2015). ProtAnt: A tool for analysing the prototypicality of texts International Journal of Corpus Linguistics 20(3): 273-292.
Bednarek, M. & Caple, H. (2017). The Discourse of News Values: How News Organizations Create Newsworthiness. Oxford: Oxford University Press.
Gries, S. (2003). Towards a corpus-based identification of prototypical instances of con- structions. Annual Review of Cognitive Linguistics, 1:1-27. DOI: 10.1075/arcl.1.02gri
Levon, E. (2016). Qualitative analysis of stance. In Baker. P. & Egbert, J. (eds.) Triangulating Methodological Approaches in Corpus Linguistic Research. London/New York: Routledge.
Price, H. (2022) The Language of Mental Illness: Corpus Linguistics and the Construction of mental illness in the Press. Cambridge: Cambridge University Press.
Rosch, E. (1975). Cognitive representations of semantic categories. Journal of Experimental
Psychology: General, 104(3): 192–233. DOI: 10.1037/0096-3445.104.3.192
Scott, M. (1997). PC analysis of key words - and key key words. System 25(2): 233-245.