Home > Research > Publications & Outputs > Normalising the corpus of English dialogues (15...
View graph of relations

Normalising the corpus of English dialogues (1560-1760) using VARD2: decisions and justifications

Research output: Contribution to conference - Without ISBN/ISSN Conference paperpeer-review

Published

Standard

Normalising the corpus of English dialogues (1560-1760) using VARD2: decisions and justifications. / Archer, Dawn; Kytö, Merja; Baron, Alistair et al.
2014. 23 Paper presented at 35th ICAME conference, Nottingham, United Kingdom.

Research output: Contribution to conference - Without ISBN/ISSN Conference paperpeer-review

Harvard

APA

Vancouver

Archer D, Kytö M, Baron A, Rayson P. Normalising the corpus of English dialogues (1560-1760) using VARD2: decisions and justifications. 2014. Paper presented at 35th ICAME conference, Nottingham, United Kingdom.

Author

Archer, Dawn ; Kytö, Merja ; Baron, Alistair et al. / Normalising the corpus of English dialogues (1560-1760) using VARD2 : decisions and justifications. Paper presented at 35th ICAME conference, Nottingham, United Kingdom.1 p.

Bibtex

@conference{3d66ca79dafb444d91a1fe4eac1acb92,
title = "Normalising the corpus of English dialogues (1560-1760) using VARD2: decisions and justifications",
abstract = "The development of (semi-)automatic tools such as the VARD (Baron and Rayson, 2008) has afforded compilers of historical corpora the opportunity to normalise variant spellings relatively quickly – following, that is, a dedicated period of manual training using relevant corpus samples (see, e.g., Lehto et al. 2010). In the case of VARD2, this period of manual training involves the user: (i) reading a given text, via the VARD interface, (ii) distinguishing variants within the text – via the tool{\textquoteright}s recommended list of (ranked) candidate replacements – or personally – by highlighting variant forms manually, (iii) choosing the most appropriate normalized form for each variant found – where relevant, being guided by the VARD{\textquoteright}s known variant list or f-score calculation (derived from , e.g., letter replacement rules, edit distance measures and/or phonetic matching algorithms),(iv) replacing the variant with the normalised form – but in such a way that the original spelling is retained in an XML tag (Baron and Rayson, 2008).The corpus-linguistic argument for normalisation is that it helps improve automated techniques such as part-of-speech and keyword analysis, thereby allowing existing linguistic tools to be used unmodified (see, e.g., Archer et al. 2003; Rayson et al. 2007a/b; Rayson et al. 2009; Hiltunen and Tyrkk{\"o} 2013). But such normalisation needs to be handled sensitively: so that, for example, we can maintain - within the text - the original spelling of those forms which convey important morphosyntactic or orthographic information (as opposed to retaining these original spellings as part of the XML tag – see (iv)). Hence the inclusion of an IGNORE VARIANT facility within VARD.In this paper, we outline some of the decisions we have made, in respect to the Corpus of English Dialogues (CED), when determining which features required normalisation and which should be left as they were originally (and why). Compiled by Merja Kyt{\"o} and Jonathan Culpeper, the CED covers a 200-year period (1560-1760) and contains speech-related texts representative of five genres – the courtroom, witness proceedings, comedy dramas, prose fiction and handbooks – as well as a group of texts subsumed under a miscellaneous category.In particular, we will discuss our treatment of: names; the genitive construction; auxiliaries and verbs; (open-hyphenated-closed) compounds; abbreviations; graphemes such as the tilde; terms which are now archaic, obsolete or rare; foreign terms; dialect terms; and personal pronouns. This work, although focussed on the CED, also has a wider aim: determining the feasibility of developing normalisation guidelines that are generalisable to other historical corpora such as ARCHER (A Representative Corpus of Historical English Registers) and EEBO (Early English Books Online). Hence, as part of the presentation, we will compare the normalisation decisions made in respect to the CED with those made in respect to the Early Modern English Medical Texts (see Lehto et al. 2010). REFERENCESArcher, D., McEnery, A. M., Rayson, P. and Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English. In D. Archer, P. Rayson, A. Wilson and A. M. McEnery (eds.) Proceedings of the Corpus Linguistics Conference 2003. Lancaster: University of Lancaster. 22–31. Baron, A. and Rayson, P. (2009). Automatic standardization of texts containing spelling variation, how much training data do you need? In M. Mahlberg, V. Gonz{\'a}lez-D{\'i}az and C. Smith (eds.) Proceedings of the Corpus Linguistics Conference, CL2009, University of Liverpool, UK, 20-23 July 2009, See http://ucrel.lancs.ac.uk/publications/cl2009/314_FullPaper.pdfBaron, A., Rayson, P. and Archer, D. (2009). Word frequency and key word statistics in historical corpus linguistics. Anglistik: International Journal of English Studies, 20 (1), pp. 41–67.Baron, A. and Rayson, P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK, 22 May 2008. http://eprints.lancs.ac.uk/41666/1/BaronRaysonAston2008.pdfA Corpus of English Dialogues 1560-1760. (2006). Compiled under the supervision of Merja Kyt{\"o} (Uppsala University) and Jonathan Culpeper (Lancaster University).Hiltunen, T. and Tyrkk{\"o}, J. (2013). Tagging Early Modern English Medical Texts. Corpus Analysis with Noise in the Signal (CANS) 2013. Lancaster University. See http://ucrel.lancs.ac.uk/cans2013/Lehto, A., Baron, A., Ratia, M. and Rayson, P. (2010). Improving the precision of corpus methods: The standardized version of Early Modern English Medical Texts. In I. Taavitsainen and P. Pahta (eds.) Early Modern English Medical Texts: Corpus Description and Studies. Amsterdam: John Benjamins. 279-290.Rayson, P., Archer, D., Baron, A. and Smith, N. (2007a). Tagging historical corpora – the problem of spelling variation. In Proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, International Conference and Research Center for Computer Science, Schloss Dagstuhl, Wadern, Germany, 3rd-8th December 2006. ISSN 1862-4405. http://www.comp.lancs.ac.uk/~paul/publications/rabs_extAbs_dagstuhl06.pdfRayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007b). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In: Proceedings of the Corpus Linguistics Conference 2007. Birmingham: University of Birmingham. http://comp.eprints.lancs.ac.uk/1528/1/192_Paper.pdf",
author = "Dawn Archer and Merja Kyt{\"o} and Alistair Baron and Paul Rayson",
year = "2014",
month = may,
day = "4",
language = "English",
pages = "23",
note = "35th ICAME conference ; Conference date: 30-04-2014 Through 04-05-2014",

}

RIS

TY - CONF

T1 - Normalising the corpus of English dialogues (1560-1760) using VARD2

T2 - 35th ICAME conference

AU - Archer, Dawn

AU - Kytö, Merja

AU - Baron, Alistair

AU - Rayson, Paul

PY - 2014/5/4

Y1 - 2014/5/4

N2 - The development of (semi-)automatic tools such as the VARD (Baron and Rayson, 2008) has afforded compilers of historical corpora the opportunity to normalise variant spellings relatively quickly – following, that is, a dedicated period of manual training using relevant corpus samples (see, e.g., Lehto et al. 2010). In the case of VARD2, this period of manual training involves the user: (i) reading a given text, via the VARD interface, (ii) distinguishing variants within the text – via the tool’s recommended list of (ranked) candidate replacements – or personally – by highlighting variant forms manually, (iii) choosing the most appropriate normalized form for each variant found – where relevant, being guided by the VARD’s known variant list or f-score calculation (derived from , e.g., letter replacement rules, edit distance measures and/or phonetic matching algorithms),(iv) replacing the variant with the normalised form – but in such a way that the original spelling is retained in an XML tag (Baron and Rayson, 2008).The corpus-linguistic argument for normalisation is that it helps improve automated techniques such as part-of-speech and keyword analysis, thereby allowing existing linguistic tools to be used unmodified (see, e.g., Archer et al. 2003; Rayson et al. 2007a/b; Rayson et al. 2009; Hiltunen and Tyrkkö 2013). But such normalisation needs to be handled sensitively: so that, for example, we can maintain - within the text - the original spelling of those forms which convey important morphosyntactic or orthographic information (as opposed to retaining these original spellings as part of the XML tag – see (iv)). Hence the inclusion of an IGNORE VARIANT facility within VARD.In this paper, we outline some of the decisions we have made, in respect to the Corpus of English Dialogues (CED), when determining which features required normalisation and which should be left as they were originally (and why). Compiled by Merja Kytö and Jonathan Culpeper, the CED covers a 200-year period (1560-1760) and contains speech-related texts representative of five genres – the courtroom, witness proceedings, comedy dramas, prose fiction and handbooks – as well as a group of texts subsumed under a miscellaneous category.In particular, we will discuss our treatment of: names; the genitive construction; auxiliaries and verbs; (open-hyphenated-closed) compounds; abbreviations; graphemes such as the tilde; terms which are now archaic, obsolete or rare; foreign terms; dialect terms; and personal pronouns. This work, although focussed on the CED, also has a wider aim: determining the feasibility of developing normalisation guidelines that are generalisable to other historical corpora such as ARCHER (A Representative Corpus of Historical English Registers) and EEBO (Early English Books Online). Hence, as part of the presentation, we will compare the normalisation decisions made in respect to the CED with those made in respect to the Early Modern English Medical Texts (see Lehto et al. 2010). REFERENCESArcher, D., McEnery, A. M., Rayson, P. and Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English. In D. Archer, P. Rayson, A. Wilson and A. M. McEnery (eds.) Proceedings of the Corpus Linguistics Conference 2003. Lancaster: University of Lancaster. 22–31. Baron, A. and Rayson, P. (2009). Automatic standardization of texts containing spelling variation, how much training data do you need? In M. Mahlberg, V. González-Díaz and C. Smith (eds.) Proceedings of the Corpus Linguistics Conference, CL2009, University of Liverpool, UK, 20-23 July 2009, See http://ucrel.lancs.ac.uk/publications/cl2009/314_FullPaper.pdfBaron, A., Rayson, P. and Archer, D. (2009). Word frequency and key word statistics in historical corpus linguistics. Anglistik: International Journal of English Studies, 20 (1), pp. 41–67.Baron, A. and Rayson, P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK, 22 May 2008. http://eprints.lancs.ac.uk/41666/1/BaronRaysonAston2008.pdfA Corpus of English Dialogues 1560-1760. (2006). Compiled under the supervision of Merja Kytö (Uppsala University) and Jonathan Culpeper (Lancaster University).Hiltunen, T. and Tyrkkö, J. (2013). Tagging Early Modern English Medical Texts. Corpus Analysis with Noise in the Signal (CANS) 2013. Lancaster University. See http://ucrel.lancs.ac.uk/cans2013/Lehto, A., Baron, A., Ratia, M. and Rayson, P. (2010). Improving the precision of corpus methods: The standardized version of Early Modern English Medical Texts. In I. Taavitsainen and P. Pahta (eds.) Early Modern English Medical Texts: Corpus Description and Studies. Amsterdam: John Benjamins. 279-290.Rayson, P., Archer, D., Baron, A. and Smith, N. (2007a). Tagging historical corpora – the problem of spelling variation. In Proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, International Conference and Research Center for Computer Science, Schloss Dagstuhl, Wadern, Germany, 3rd-8th December 2006. ISSN 1862-4405. http://www.comp.lancs.ac.uk/~paul/publications/rabs_extAbs_dagstuhl06.pdfRayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007b). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In: Proceedings of the Corpus Linguistics Conference 2007. Birmingham: University of Birmingham. http://comp.eprints.lancs.ac.uk/1528/1/192_Paper.pdf

AB - The development of (semi-)automatic tools such as the VARD (Baron and Rayson, 2008) has afforded compilers of historical corpora the opportunity to normalise variant spellings relatively quickly – following, that is, a dedicated period of manual training using relevant corpus samples (see, e.g., Lehto et al. 2010). In the case of VARD2, this period of manual training involves the user: (i) reading a given text, via the VARD interface, (ii) distinguishing variants within the text – via the tool’s recommended list of (ranked) candidate replacements – or personally – by highlighting variant forms manually, (iii) choosing the most appropriate normalized form for each variant found – where relevant, being guided by the VARD’s known variant list or f-score calculation (derived from , e.g., letter replacement rules, edit distance measures and/or phonetic matching algorithms),(iv) replacing the variant with the normalised form – but in such a way that the original spelling is retained in an XML tag (Baron and Rayson, 2008).The corpus-linguistic argument for normalisation is that it helps improve automated techniques such as part-of-speech and keyword analysis, thereby allowing existing linguistic tools to be used unmodified (see, e.g., Archer et al. 2003; Rayson et al. 2007a/b; Rayson et al. 2009; Hiltunen and Tyrkkö 2013). But such normalisation needs to be handled sensitively: so that, for example, we can maintain - within the text - the original spelling of those forms which convey important morphosyntactic or orthographic information (as opposed to retaining these original spellings as part of the XML tag – see (iv)). Hence the inclusion of an IGNORE VARIANT facility within VARD.In this paper, we outline some of the decisions we have made, in respect to the Corpus of English Dialogues (CED), when determining which features required normalisation and which should be left as they were originally (and why). Compiled by Merja Kytö and Jonathan Culpeper, the CED covers a 200-year period (1560-1760) and contains speech-related texts representative of five genres – the courtroom, witness proceedings, comedy dramas, prose fiction and handbooks – as well as a group of texts subsumed under a miscellaneous category.In particular, we will discuss our treatment of: names; the genitive construction; auxiliaries and verbs; (open-hyphenated-closed) compounds; abbreviations; graphemes such as the tilde; terms which are now archaic, obsolete or rare; foreign terms; dialect terms; and personal pronouns. This work, although focussed on the CED, also has a wider aim: determining the feasibility of developing normalisation guidelines that are generalisable to other historical corpora such as ARCHER (A Representative Corpus of Historical English Registers) and EEBO (Early English Books Online). Hence, as part of the presentation, we will compare the normalisation decisions made in respect to the CED with those made in respect to the Early Modern English Medical Texts (see Lehto et al. 2010). REFERENCESArcher, D., McEnery, A. M., Rayson, P. and Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English. In D. Archer, P. Rayson, A. Wilson and A. M. McEnery (eds.) Proceedings of the Corpus Linguistics Conference 2003. Lancaster: University of Lancaster. 22–31. Baron, A. and Rayson, P. (2009). Automatic standardization of texts containing spelling variation, how much training data do you need? In M. Mahlberg, V. González-Díaz and C. Smith (eds.) Proceedings of the Corpus Linguistics Conference, CL2009, University of Liverpool, UK, 20-23 July 2009, See http://ucrel.lancs.ac.uk/publications/cl2009/314_FullPaper.pdfBaron, A., Rayson, P. and Archer, D. (2009). Word frequency and key word statistics in historical corpus linguistics. Anglistik: International Journal of English Studies, 20 (1), pp. 41–67.Baron, A. and Rayson, P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK, 22 May 2008. http://eprints.lancs.ac.uk/41666/1/BaronRaysonAston2008.pdfA Corpus of English Dialogues 1560-1760. (2006). Compiled under the supervision of Merja Kytö (Uppsala University) and Jonathan Culpeper (Lancaster University).Hiltunen, T. and Tyrkkö, J. (2013). Tagging Early Modern English Medical Texts. Corpus Analysis with Noise in the Signal (CANS) 2013. Lancaster University. See http://ucrel.lancs.ac.uk/cans2013/Lehto, A., Baron, A., Ratia, M. and Rayson, P. (2010). Improving the precision of corpus methods: The standardized version of Early Modern English Medical Texts. In I. Taavitsainen and P. Pahta (eds.) Early Modern English Medical Texts: Corpus Description and Studies. Amsterdam: John Benjamins. 279-290.Rayson, P., Archer, D., Baron, A. and Smith, N. (2007a). Tagging historical corpora – the problem of spelling variation. In Proceedings of Digital Historical Corpora, Dagstuhl-Seminar 06491, International Conference and Research Center for Computer Science, Schloss Dagstuhl, Wadern, Germany, 3rd-8th December 2006. ISSN 1862-4405. http://www.comp.lancs.ac.uk/~paul/publications/rabs_extAbs_dagstuhl06.pdfRayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007b). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In: Proceedings of the Corpus Linguistics Conference 2007. Birmingham: University of Birmingham. http://comp.eprints.lancs.ac.uk/1528/1/192_Paper.pdf

M3 - Conference paper

SP - 23

Y2 - 30 April 2014 through 4 May 2014

ER -