Home > Research > Publications & Outputs > Exploring Novel Datasets and Methods for the St...

Electronic data

  • 2022deardenphd

    Final published version, 10.5 MB, PDF document

    Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Text available via DOI:

View graph of relations

Exploring Novel Datasets and Methods for the Study of False Information

Research output: ThesisDoctoral Thesis

Publication date1/04/2022
Number of pages298
Awarding Institution
  • Lancaster University
<mark>Original language</mark>English


False information has increasingly become a subject of much discussion. Recently, disinformation has been linked to causing massive social harm, leading to the decline of democracy, and hindering global efforts in an international health crisis. In computing, and specifically Natural Language Processing (NLP), much effort has been put into tackling this problem. This has led to an increase of research in automated fact-checking and the language of disinformation. However, current research suffers from looking at a limited variety of sources. Much focus has, understandably, been given to platforms such as Twitter, Facebook and WhatsApp, as well as on traditional news articles online. Few works in NLP have looked at the specific communities where false information ferments. There has also been something of a topical constraint, with most examples of “Fake News” relating to current political issues.

This thesis contributes to this rapidly growing research area by looking wider for new sources of data, and developing methods to analyse them. Specifically, it introduces two new datasets to the field and performs analyses on both. The first of these, a corpus of April Fools hoaxes, is analysed with a feature-driven approach to examine the generalisability of different features in the classification of false information. This is the first corpus of April Fools news articles, and is publicly available for researchers. The second dataset, a corpus of online Flat Earth communities, is also the first of its kind. In addition to performing the first NLP analysis of the language of Flat Earth fora, an exploration is performed to look for the existence of sub-groups within these communities, as well as an analysis of language change. To support this analysis, language change methods are surveyed, and a new method for comparing the language change of groups over time is developed. The methods used, brought together from both NLP and Corpus Linguistics, provide new insight into the language of false information, and the way communities discuss it.