Home > Research > Publications & Outputs > Normalising the corpus of English dialogues (15...
View graph of relations

Normalising the corpus of English dialogues (1560-1760) using VARD2: decisions and justifications

Research output: Contribution to conference - Without ISBN/ISSN Conference paperpeer-review

Publication date4/05/2014
Number of pages1
<mark>Original language</mark>English
Event35th ICAME conference - University of Nottingham, Nottingham, United Kingdom
Duration: 30/04/20144/05/2014


Conference35th ICAME conference
CountryUnited Kingdom


The development of (semi-)automatic tools such as the VARD (Baron and Rayson, 2008) has afforded compilers of historical corpora the opportunity to normalise variant spellings relatively quickly – following, that is, a dedicated period of manual training using relevant corpus samples (see, e.g., Lehto et al. 2010).
In the case of VARD2, this period of manual training involves the user:

(i) reading a given text, via the VARD interface,
(ii) distinguishing variants within the text – via the tool’s recommended list of (ranked) candidate replacements – or personally – by highlighting variant forms manually,
(iii) choosing the most appropriate normalized form for each variant found – where relevant, being guided by the VARD’s known variant list or f-score calculation (derived from , e.g., letter replacement rules, edit distance measures and/or phonetic matching algorithms),
(iv) replacing the variant with the normalised form – but in such a way that the original spelling is retained in an XML tag (Baron and Rayson, 2008).

The corpus-linguistic argument for normalisation is that it helps improve automated techniques such as part-of-speech and keyword analysis, thereby allowing existing linguistic tools to be used unmodified (see, e.g., Archer et al. 2003; Rayson et al. 2007a/b; Rayson et al. 2009; Hiltunen and Tyrkkö 2013). But such normalisation needs to be handled sensitively: so that, for example, we can maintain - within the text - the original spelling of those forms which convey important morphosyntactic or orthographic information (as opposed to retaining these original spellings as part of the XML tag – see (iv)). Hence the inclusion of an IGNORE VARIANT facility within VARD.

In this paper, we outline some of the decisions we have made, in respect to the Corpus of English Dialogues (CED), when determining which features required normalisation and which should be left as they were originally (and why). Compiled by Merja Kytö and Jonathan Culpeper, the CED covers a 200-year period (1560-1760) and contains speech-related texts representative of five genres – the courtroom, witness proceedings, comedy dramas, prose fiction and handbooks – as well as a group of texts subsumed under a miscellaneous category.

In particular, we will discuss our treatment of: names; the genitive construction; auxiliaries and verbs; (open-hyphenated-closed) compounds; abbreviations; graphemes such as the tilde; terms which are now archaic, obsolete or rare; foreign terms; dialect terms; and personal pronouns. This work, although focussed on the CED, also has a wider aim: determining the feasibility of developing normalisation guidelines that are generalisable to other historical corpora such as ARCHER (A Representative Corpus of Historical English Registers) and EEBO (Early English Books Online). Hence, as part of the presentation, we will compare the normalisation decisions made in respect to the CED with those made in respect to the Early Modern English Medical Texts (see Lehto et al. 2010).

Archer, D., McEnery, A. M., Rayson, P. and Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English. In D. Archer, P. Rayson, A. Wilson and A. M. McEnery (eds.) Proceedings of the Corpus Linguistics Conference 2003. Lancaster: University of Lancaster. 22–31.
Baron, A. and Rayson, P. (2009). Automatic standardization of texts containing spelling variation, how much training data do you need? In M. Mahlberg, V. González-Díaz and C. Smith (eds.) Proceedings of the Corpus Linguistics Conference, CL2009, University of Liverpool, UK, 20-23 July 2009, See http://ucrel.lancs.ac.uk/publications/cl2009/314_FullPaper.pdf
Baron, A., Rayson, P. and Archer, D. (2009). Word frequency and key word statistics in historical corpus linguistics. Anglistik: International Journal of English Studies, 20 (1), pp. 41–67.
Baron, A. and Rayson, P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK, 22 May 2008. http://eprints.lancs.ac.uk/41666/1/BaronRaysonAston2008.pdf
A Corpus of English Dialogues 1560-1760. (2006). Compiled under the supervision of Merja Kytö (Uppsala University) and Jonathan Culpeper (Lancaster University).
Hiltunen, T. and Tyrkkö, J. (2013). Tagging Early Modern English Medical Texts. Corpus Analysis with Noise in the Signal (CANS) 2013. Lancaster University. See http://ucrel.lancs.ac.uk/cans2013/
Lehto, A., Baron, A., Ratia, M. and Rayson, P. (2010). Improving the precision of corpus methods: The standardized version of Early Modern English Medical Texts. In I. Taavitsainen and P. Pahta (eds.) Early Modern English Medical Texts: Corpus Description and Studies. Amsterdam: John Benjamins. 279-290.
Rayson, P., Archer, D., Baron, A. and Smith, N. (2007a). Tagging historical corpora – the problem of spelling variation. In Proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, International Conference and Research Center for Computer Science, Schloss Dagstuhl, Wadern, Germany, 3rd-8th December 2006. ISSN 1862-4405. http://www.comp.lancs.ac.uk/~paul/publications/rabs_extAbs_dagstuhl06.pdf
Rayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007b). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In: Proceedings of the Corpus Linguistics Conference 2007. Birmingham: University of Birmingham. http://comp.eprints.lancs.ac.uk/1528/1/192_Paper.pdf