New approaches to analysing large newspaper corpora

History

Keywords

Texts, corpora, British Library, GIS, nineteenth-century, disease

Activity: Talk or presentation types › Invited talk

Dr Catherine Porter - Speaker

4/05/2016

Recent years have seen the digitisation of some very large corpora of newspaper and other material which include hundreds of millions or even billions of words of text. This presents both opportunities and challenges: how can this information best be exploited to increase knowledge in the Humanities? This paper presents work funded by the European Research Council that will be contributing to an ESRC Case award between Lancaster University and the British Library. Seeking to go beyond simple searching and browsing, it focuses on how approaches from corpus linguistics and geographical information systems (GIS) can be used to better understand the geographies within corpora. The work centres around the British Library’s Nineteenth Century Newspapers Collection, a collection which includes 49 series of newspapers mainly in continuous series for most or all of the century, amounting to around 2 million pages of material and around 30 billion words. It clearly contains a vast amount of potential information about a wide range of topics relevant to nineteenth century history. The first part of the paper looks at OCR errors and their impact on corpus linguistic techniques. The second part looks at how diseases were represented geographically both internationally and within Britain and how this compared with the patterns of mortality from these diseases.

External organisation (External collaborations)

Name	The British Library

Research

Keywords

New approaches to analysing large newspaper corpora

External organisation (External collaborations)

Quick Links

Connect With Us

Faculties & Depts

Contact Us