Home > Research > Publications & Outputs > Understanding corpus text prototypicality: A mu...

Electronic data

  • icame_44_anthony_et_al

    Final published version, 2.66 MB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

View graph of relations

Understanding corpus text prototypicality: A multifaceted problem

Research output: Contribution to conference - Without ISBN/ISSN Conference paperpeer-review

Published
  • Laurence Anthony
  • Nicholas Smith
  • Sebastian Hoffmann
  • Paul Rayson
Close
Publication date18/05/2023
<mark>Original language</mark>English
EventInternational Computer Archive of Modern and Medieval English (ICAME 44) - North-West University and Emerald Resort, Vanderbijlpark, South Africa
Duration: 17/05/202321/05/2023
https://icame.info/icame-44/

Conference

ConferenceInternational Computer Archive of Modern and Medieval English (ICAME 44)
Abbreviated titleICAME44
Country/TerritorySouth Africa
CityVanderbijlpark
Period17/05/2321/05/23
Internet address

Abstract

Prototypicality is a complex, multifaceted concept relating to the centrality and typicality of examples in a category. While prominent in cognitive psychology and linguistics, it is often overlooked in corpus studies. Corpora are ideally built to be representative of a target domain or language variety. To achieve this goal, corpus builders need to identify an accurate sampling frame and collect relevant texts that capture the diversity of language in and across the sampling categories. In practice, however, corpora are built within the limitations of text availability, time, and human resources leading to questions about the suitability/prototypicality of individual texts in a corpus and their effect on the representativeness of the corpus as whole. Prototypicality also comes into play at the analysis stage. Most corpus analysis approaches use the corpus as a whole as the unit of analysis, including concordance and keyword analysis. To validate findings, a necessary but often omitted step is the close reading of individual texts. Here, a significant challenge is identifying which texts to read. A researcher may decide to randomly choose texts, but it is an open question if such texts are representative/prototypical of the corpus. Prototypicality also comes into play when corpora are used for pedagogic purposes, such as Data-Driven Learning (DDL). In these situations, there is often an implicit conflation of two facets of prototypicality, namely frequency of use and closeness to an ideal, particularly in the case of expert writing.

In this paper, we first outline the multifaceted character of corpus text prototypicality. Next, we describe experiments that attempt to rank the prototypicality of individual corpus texts at different linguistic levels as a guide to choosing texts for close reading or excluding texts from a corpus at the data collection stage. Results using a modified version of the ProtAnt tool (Anthony and Baker, 2015) show prototypicality rankings can be dramatically affected by the linguistic level of analysis applied. Standard keywords effectively rank the prototypicality of texts in terms of topic, but the results can be enhanced using key semantic tags. On the other hand, key part-of-speech (POS) tags allow for a more nuanced view of text prototypicality centered on stylistics. The results also reveal the limitations of current corpus software tools and offer suggestions for how new tools might be developed to increase our understanding of prototypicality at the textual level.