Normalising the corpus of English dialogues (1560-1760) using VARD2: decisions and justifications

Publication date4/05/2014
35th ICAME conference - University of Nottingham, Nottingham, United Kingdom
Duration: 30/04/20144/05/2014


Conference35th ICAME conference
CountryUnited Kingdom


The development of (semi-)automatic tools such as the VARD (Baron and Rayson, 2008) has afforded compilers of historical corpora the opportunity to normalise variant spellings relatively quickly – following, that is, a dedicated period of manual training using relevant corpus samples (see, e.g., Lehto et al. 2010).
In the case of VARD2, this period of manual training involves the user:

(i) reading a given text, via the VARD interface,
(ii) distinguishing variants within the text – via the tool’s recommended list of (ranked) candidate replacements – or personally – by highlighting variant forms manually,
(iii) choosing the most appropriate normalized form for each variant found – where relevant, being guided by the VARD’s known variant list or f-score calculation (derived from , e.g., letter replacement rules, edit distance measures and/or phonetic matching algorithms),
(iv) replacing the variant with the normalised form – but in such a way that the original spelling is retained in an XML tag (Baron and Rayson, 2008).

The corpus-linguistic argument for normalisation is that it helps improve automated techniques such as part-of-speech and keyword analysis, thereby allowing existing linguistic tools to be used unmodified (see, e.g., Archer et al. 2003; Rayson et al. 2007a/b; Rayson et al. 2009; Hiltunen and Tyrkkö 2013). But such normalisation needs to be handled sensitively: so that, for example, we can maintain - within the text - the original spelling of those forms which convey important morphosyntactic or orthographic information (as opposed to retaining these original spellings as part of the XML tag – see (iv)). Hence the inclusion of an IGNORE VARIANT facility within VARD.

In this paper, we outline some of the decisions we have made, in respect to the Corpus of English Dialogues (CED), when determining which features required normalisation and which should be left as they were originally (and why). Compiled by Merja Kytö and Jonathan Culpeper, the CED covers a 200-year period (1560-1760) and contains speech-related texts representative of five genres – the courtroom, witness proceedings, comedy dramas, prose fiction and handbooks – as well as a group of texts subsumed under a miscellaneous category.

In particular, we will discuss our treatment of: names; the genitive construction; auxiliaries and verbs; (open-hyphenated-closed) compounds; abbreviations; graphemes such as the tilde; terms which are now archaic, obsolete or rare; foreign terms; dialect terms; and personal pronouns. This work, although focussed on the CED, also has a wider aim: determining the feasibility of developing normalisation guidelines that are generalisable to other historical corpora such as ARCHER (A Representative Corpus of Historical English Registers) and EEBO (Early English Books Online). Hence, as part of the presentation, we will compare the normalisation decisions made in respect to the CED with those made in respect to the Early Modern English Medical Texts (see Lehto et al. 2010).

