Home > Research > Publications & Outputs > Automatic standardisation of texts containing s...
View graph of relations

Automatic standardisation of texts containing spelling variation: How much training data do you need?

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

Automatic standardisation of texts containing spelling variation: How much training data do you need? / Baron, Alistair; Rayson, Paul.
Proceedings of the Corpus Linguistics Conference: CL2009. ed. / Michaela Mahlberg ; Victorina González-Díaz; Catherine Smith. Lancaster: Lancaster University, 2009.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Baron, A & Rayson, P 2009, Automatic standardisation of texts containing spelling variation: How much training data do you need? in M Mahlberg , V González-Díaz & C Smith (eds), Proceedings of the Corpus Linguistics Conference: CL2009. Lancaster University, Lancaster, Corpus Linguistics 2009, Liverpool, UK, 20/07/09. <http://ucrel.lancs.ac.uk/publications/cl2009/>

APA

Baron, A., & Rayson, P. (2009). Automatic standardisation of texts containing spelling variation: How much training data do you need? In M. Mahlberg , V. González-Díaz, & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference: CL2009 Lancaster University. http://ucrel.lancs.ac.uk/publications/cl2009/

Vancouver

Baron A, Rayson P. Automatic standardisation of texts containing spelling variation: How much training data do you need? In Mahlberg M, González-Díaz V, Smith C, editors, Proceedings of the Corpus Linguistics Conference: CL2009. Lancaster: Lancaster University. 2009

Author

Baron, Alistair ; Rayson, Paul. / Automatic standardisation of texts containing spelling variation: How much training data do you need?. Proceedings of the Corpus Linguistics Conference: CL2009. editor / Michaela Mahlberg ; Victorina González-Díaz ; Catherine Smith. Lancaster : Lancaster University, 2009.

Bibtex

@inproceedings{917042800daf47248038c327c8e15731,
title = "Automatic standardisation of texts containing spelling variation: How much training data do you need?",
abstract = "Large quantities of spelling variation in corpora, such as that found in Early Modern English, can cause significant problems for corpus linguistic tools and methods. Having texts with standardised spelling is key to making such tools and methods accurate and meaningful in their analysis. Gaining access to such versions of texts can be problematic however, and manual stan- dardisation of the texts is often too time-consuming to be feasible. Our solution is a piece of software named VARD 2 which can be used to manually and automatically standardise spelling variation in individual texts, or corpora of any size. This paper evaluates VARD 2{\textquoteright}s performance on a corpus of Early Modern English letters and a corpus of children{\textquoteright}s written English. The software{\textquoteright}s ability to learn from manual standardisation is put under particular scrutiny as we examine what effect different levels of training have on its performance.",
author = "Alistair Baron and Paul Rayson",
year = "2009",
language = "English",
editor = "{Mahlberg }, {Michaela } and Gonz{\'a}lez-D{\'i}az, {Victorina } and Catherine Smith",
booktitle = "Proceedings of the Corpus Linguistics Conference",
publisher = "Lancaster University",
note = "Corpus Linguistics 2009 ; Conference date: 20-07-2009 Through 23-07-2009",

}

RIS

TY - GEN

T1 - Automatic standardisation of texts containing spelling variation: How much training data do you need?

AU - Baron, Alistair

AU - Rayson, Paul

PY - 2009

Y1 - 2009

N2 - Large quantities of spelling variation in corpora, such as that found in Early Modern English, can cause significant problems for corpus linguistic tools and methods. Having texts with standardised spelling is key to making such tools and methods accurate and meaningful in their analysis. Gaining access to such versions of texts can be problematic however, and manual stan- dardisation of the texts is often too time-consuming to be feasible. Our solution is a piece of software named VARD 2 which can be used to manually and automatically standardise spelling variation in individual texts, or corpora of any size. This paper evaluates VARD 2’s performance on a corpus of Early Modern English letters and a corpus of children’s written English. The software’s ability to learn from manual standardisation is put under particular scrutiny as we examine what effect different levels of training have on its performance.

AB - Large quantities of spelling variation in corpora, such as that found in Early Modern English, can cause significant problems for corpus linguistic tools and methods. Having texts with standardised spelling is key to making such tools and methods accurate and meaningful in their analysis. Gaining access to such versions of texts can be problematic however, and manual stan- dardisation of the texts is often too time-consuming to be feasible. Our solution is a piece of software named VARD 2 which can be used to manually and automatically standardise spelling variation in individual texts, or corpora of any size. This paper evaluates VARD 2’s performance on a corpus of Early Modern English letters and a corpus of children’s written English. The software’s ability to learn from manual standardisation is put under particular scrutiny as we examine what effect different levels of training have on its performance.

M3 - Conference contribution/Paper

BT - Proceedings of the Corpus Linguistics Conference

A2 - Mahlberg , Michaela

A2 - González-Díaz, Victorina

A2 - Smith, Catherine

PB - Lancaster University

CY - Lancaster

T2 - Corpus Linguistics 2009

Y2 - 20 July 2009 through 23 July 2009

ER -