12,000

We have over 12,000 students, from over 100 countries, within one of the safest campuses in the UK

93%

93% of Lancaster students go into work or further study within six months of graduating

Home > Research > Publications & Outputs > Automatic standardisation of texts containing s...
View graph of relations

« Back

Automatic standardisation of texts containing spelling variation: How much training data do you need?

Research output: Contribution in Book/Report/ProceedingsPaper

Published

Publication date2009
Host publicationProceedings of the Corpus Linguistics Conference: CL2009
EditorsMichaela Mahlberg , Victorina González-Díaz, Catherine Smith
Place of publicationLancaster
PublisherLancaster University
Number of pages25
Original languageEnglish

Conference

ConferenceCorpus Linguistics 2009
CityLiverpool, UK
Period20/07/0923/07/09

Conference

ConferenceCorpus Linguistics 2009
CityLiverpool, UK
Period20/07/0923/07/09

Abstract

Large quantities of spelling variation in corpora, such as that found in Early Modern English, can cause significant problems for corpus linguistic tools and methods. Having texts with standardised spelling is key to making such tools and methods accurate and meaningful in their analysis. Gaining access to such versions of texts can be problematic however, and manual stan- dardisation of the texts is often too time-consuming to be feasible. Our solution is a piece of software named VARD 2 which can be used to manually and automatically standardise spelling variation in individual texts, or corpora of any size. This paper evaluates VARD 2’s performance on a corpus of Early Modern English letters and a corpus of children’s written English. The software’s ability to learn from manual standardisation is put under particular scrutiny as we examine what effect different levels of training have on its performance.