Home > Research > Publications & Outputs > Tagging the Bard
View graph of relations

Tagging the Bard: Evaluating the Accuracy of a Modern POS Tagger on Early Modern English Corpora

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

Tagging the Bard: Evaluating the Accuracy of a Modern POS Tagger on Early Modern English Corpora. / Rayson, Paul; Archer, Dawn; Baron, Alistair et al.
Proceedings of the Corpus Linguistics Conference: CL2007. ed. / Matthew Davies; Paul Rayson; Susan Hunston; Pernilla Danielsson. UCREL, 2007.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Rayson, P, Archer, D, Baron, A, Culpeper, J & Smith, N 2007, Tagging the Bard: Evaluating the Accuracy of a Modern POS Tagger on Early Modern English Corpora. in M Davies, P Rayson, S Hunston & P Danielsson (eds), Proceedings of the Corpus Linguistics Conference: CL2007. UCREL, Corpus Linguistics Conference (CL2007), University of Birmingham, UK, 27/07/07. <http://ucrel.lancs.ac.uk/publications/CL2007/paper/192_Paper.pdf>

APA

Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the Bard: Evaluating the Accuracy of a Modern POS Tagger on Early Modern English Corpora. In M. Davies, P. Rayson, S. Hunston, & P. Danielsson (Eds.), Proceedings of the Corpus Linguistics Conference: CL2007 UCREL. http://ucrel.lancs.ac.uk/publications/CL2007/paper/192_Paper.pdf

Vancouver

Rayson P, Archer D, Baron A, Culpeper J, Smith N. Tagging the Bard: Evaluating the Accuracy of a Modern POS Tagger on Early Modern English Corpora. In Davies M, Rayson P, Hunston S, Danielsson P, editors, Proceedings of the Corpus Linguistics Conference: CL2007. UCREL. 2007

Author

Rayson, Paul ; Archer, Dawn ; Baron, Alistair et al. / Tagging the Bard : Evaluating the Accuracy of a Modern POS Tagger on Early Modern English Corpora. Proceedings of the Corpus Linguistics Conference: CL2007. editor / Matthew Davies ; Paul Rayson ; Susan Hunston ; Pernilla Danielsson. UCREL, 2007.

Bibtex

@inproceedings{de1bbb68e351415fac28c417d7e26111,
title = "Tagging the Bard: Evaluating the Accuracy of a Modern POS Tagger on Early Modern English Corpora",
abstract = "In this paper we focus on automatic part-of-speech (POS) annotation, in the context of historical English texts. Techniques that were originally developed for modern English have been applied to numerous other languages over recent years. Despite this diversification, it is still almost invariably the case that the texts being analysed are from contemporary rather than historical sources. Although there is some recognition among historical linguists of the advantages of annotation for the retrieval of lexical, grammatical and other linguistic phenomena, the implementation of such forms of annotation by automatic methods is problematic. For example, changes in grammar over time will lead to a mismatch between probabilistic language models derived from, say, Present-day English and Middle English. Similarly, variability and changes in spelling can cause problems for POS taggers with fixed lexicons and rulebases. To determine the extent of the problem, and develop possible solutions, we decided to evaluate the accuracy of existing POS taggers, trained on modern English, when they are applied to Early Modern English (EModE) datasets. We focus here on the CLAWS POS tagger, a hybrid rule-based and statistical tool for English, and use as experimental data the Shakespeare First Folio and the Lampeter Corpus. First, using a manually post-edited test set, we evaluate the accuracy of CLAWS when no modifications are made either to its grammatical model or to its lexicon. We then compare this output with CLAWS' performance when using a pre-processor that detects spelling variants and matches them to modern equivalents. This experiment highlights (i) the extent to which the handling of orthographic variants is sufficient for the tagging accuracy of EModE data to approximate to the levels attained on modernday text(s), and (ii) in turn, whether revisions to the lexical resources and language models of POS taggers need to be made.",
author = "Paul Rayson and Dawn Archer and Alistair Baron and Jonathan Culpeper and Nicholas Smith",
year = "2007",
language = "English",
editor = "Matthew Davies and Paul Rayson and Susan Hunston and Pernilla Danielsson",
booktitle = "Proceedings of the Corpus Linguistics Conference",
publisher = "UCREL",
note = "Corpus Linguistics Conference (CL2007) ; Conference date: 27-07-2007 Through 30-07-2007",

}

RIS

TY - GEN

T1 - Tagging the Bard

T2 - Corpus Linguistics Conference (CL2007)

AU - Rayson, Paul

AU - Archer, Dawn

AU - Baron, Alistair

AU - Culpeper, Jonathan

AU - Smith, Nicholas

PY - 2007

Y1 - 2007

N2 - In this paper we focus on automatic part-of-speech (POS) annotation, in the context of historical English texts. Techniques that were originally developed for modern English have been applied to numerous other languages over recent years. Despite this diversification, it is still almost invariably the case that the texts being analysed are from contemporary rather than historical sources. Although there is some recognition among historical linguists of the advantages of annotation for the retrieval of lexical, grammatical and other linguistic phenomena, the implementation of such forms of annotation by automatic methods is problematic. For example, changes in grammar over time will lead to a mismatch between probabilistic language models derived from, say, Present-day English and Middle English. Similarly, variability and changes in spelling can cause problems for POS taggers with fixed lexicons and rulebases. To determine the extent of the problem, and develop possible solutions, we decided to evaluate the accuracy of existing POS taggers, trained on modern English, when they are applied to Early Modern English (EModE) datasets. We focus here on the CLAWS POS tagger, a hybrid rule-based and statistical tool for English, and use as experimental data the Shakespeare First Folio and the Lampeter Corpus. First, using a manually post-edited test set, we evaluate the accuracy of CLAWS when no modifications are made either to its grammatical model or to its lexicon. We then compare this output with CLAWS' performance when using a pre-processor that detects spelling variants and matches them to modern equivalents. This experiment highlights (i) the extent to which the handling of orthographic variants is sufficient for the tagging accuracy of EModE data to approximate to the levels attained on modernday text(s), and (ii) in turn, whether revisions to the lexical resources and language models of POS taggers need to be made.

AB - In this paper we focus on automatic part-of-speech (POS) annotation, in the context of historical English texts. Techniques that were originally developed for modern English have been applied to numerous other languages over recent years. Despite this diversification, it is still almost invariably the case that the texts being analysed are from contemporary rather than historical sources. Although there is some recognition among historical linguists of the advantages of annotation for the retrieval of lexical, grammatical and other linguistic phenomena, the implementation of such forms of annotation by automatic methods is problematic. For example, changes in grammar over time will lead to a mismatch between probabilistic language models derived from, say, Present-day English and Middle English. Similarly, variability and changes in spelling can cause problems for POS taggers with fixed lexicons and rulebases. To determine the extent of the problem, and develop possible solutions, we decided to evaluate the accuracy of existing POS taggers, trained on modern English, when they are applied to Early Modern English (EModE) datasets. We focus here on the CLAWS POS tagger, a hybrid rule-based and statistical tool for English, and use as experimental data the Shakespeare First Folio and the Lampeter Corpus. First, using a manually post-edited test set, we evaluate the accuracy of CLAWS when no modifications are made either to its grammatical model or to its lexicon. We then compare this output with CLAWS' performance when using a pre-processor that detects spelling variants and matches them to modern equivalents. This experiment highlights (i) the extent to which the handling of orthographic variants is sufficient for the tagging accuracy of EModE data to approximate to the levels attained on modernday text(s), and (ii) in turn, whether revisions to the lexical resources and language models of POS taggers need to be made.

M3 - Conference contribution/Paper

BT - Proceedings of the Corpus Linguistics Conference

A2 - Davies, Matthew

A2 - Rayson, Paul

A2 - Hunston, Susan

A2 - Danielsson, Pernilla

PB - UCREL

Y2 - 27 July 2007 through 30 July 2007

ER -