Home > Research > Publications & Outputs > Corpus linguistics for History

Electronic data

  • 2017Joulain-JayPhD

    Final published version, 15.4 MB, PDF document

    Available under license: CC BY-ND: Creative Commons Attribution-NoDerivatives 4.0 International License

Text available via DOI:

View graph of relations

Corpus linguistics for History: the methodology of investigating place-name discourses in digitised nineteenth-century newspapers

Research output: ThesisDoctoral Thesis

Published
Publication date2017
Number of pages394
QualificationPhD
Awarding Institution
Supervisors/Advisors
Publisher
  • Lancaster University
<mark>Original language</mark>English

Abstract

The increasing availability of historical sources in a digital form has led to calls for new forms of reading in history. This thesis responds to these calls by exploring the potential of approaches from the field of corpus linguistics to be useful to historical research. Specifically, two sets of methodological issues are considered that arise when corpus linguistic methods are used on digitised historical sources.
The first set of issues surrounds optical character recognition (OCR), computerised text transcription based on image reproduction of the original printed source. This process is error-prone, which leads to potentially unreliable word-counts. I find that OCR errors are very varied, and more different from their corrections than natural spelling variation from a standard form. As a result of OCR errors, the test OCR corpus examined has a slightly inflated overall token count (as compared to a hand-corrected gold standard), and a vastly inflated type count. Not all spurious types are infrequent: around 7% of types occurring at least 10 times in my test OCR corpus are spurious. I also find evidence that real-word errors occur.
Assessing the impact of OCR errors on two common collocation statistics, Mutual Information (MI) and Log-Likelihood (LL), I find that both are affected by OCR errors. This analysis also provides evidence that OCR errors are not homogenously distributed throughout the corpus. Nevertheless, for small collocation spans, MI rankings are broadly reliable in OCR data, especially when used in combination with an LL threshold. Large spans are best avoided, as both statistics become increasingly less reliable in OCR data, when used with larger spans. Both statistics attract non-negligible rates of false positives. Using a frequency floor will eliminate many OCR errors, but does not reduce the rates of MI and LL false positives.
Assessing the potential of two post-OCR correction methods, I find that VARD, a program designed to standardise natural spelling variation, proves unpromising for dealing with OCR errors. By contrast, Overproof, a commercial system designed for OCR errors, is effective, and its application leads to substantial improvements in the reliability of MI and LL, particularly for large spans.
The second set of issues relate to the effectiveness of approaches to analysing the discourses surrounding place-names in digitised nineteenth-century newspapers. I single out three approaches to identifying place-names mentioned in large amounts of text without the need for a geo-parser system. The first involves relying on USAS, a semantic tagger, which has a 'Z2' tag for geographic names. This approach cannot identify multi-word place-names, but is scalable. A difficulty is that frequency counts of place-names do not account for their possible polysemy; I suggest a procedure involving reading a random sample of concordance lines for each place-name, in order to obtain an estimate of the actual number of mentions of that place-name in reference to a specific place. This method is best used to identify the most frequent place-names. A second, related, approach is to automatically compare a list of words tagged 'Z2' with a gazetteer, a reference list of place-names. This method, however, suffers from the same difficulties as the previous one, and is best used when accurate frequency counts are not required. A third approach involves starting from a principled, text-external, list of place-names, such as a population table, then attempting to locate each place in the set of texts. The scalability of this method depends on the length of the list of place-names, but it can accommodate any quantity of text. Its advantage over the two other methods is that it helps to contextualise the findings and can help identify place-names which are not mentioned in the texts.
Finally, I consider two approaches to investigating the discourses surrounding place-names in large quantities of text. Both are scalable operationalisations of proximity-based collocation. The first approach starts with the whole corpus, searching for the place-name of interest and generating a list of statistical collocates of the place-name; these collocates can then be further categorised and analysed via concordance analysis. The second approach starts with small samples of concordance lines for the place-name of interest, and involves analysing these concordance lines to develop a framework for description of the phraseologies within which place-names are mentioned. Both methods are useful and scalable; the findings they yield are, to some extent, overlapping, but also complementary. This suggests that both methods may be fruitfully used together, albeit neither is ideally-suited for comparing results across corpora. Both approaches are well-suited for exploratory research.