Exploring Novel Datasets and Methods for the Study of False Information

Computing and Communications

Associated organisational units

Electronic data

2022deardenphd
Final published version, 10.5 MB, PDF document
Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Text available via DOI:

https://doi.org/10.17635/lancaster/thesis/1591
Final published version

Keywords

NLP, False Information, computer science, data science, natural language processing, social media, conspiracy theories, flat earth, april fools

View graph of relations

Research output: Thesis › Doctoral Thesis

Published

Edward Dearden

More...

Publication date	1/04/2022
Number of pages	298
Qualification	PhD
Awarding Institution	Lancaster University
Supervisors/Advisors	Baron, Alistair, Supervisor Rayson, Paul, Supervisor
Publisher	Lancaster University
<mark>Original language</mark>	English

Abstract

False information has increasingly become a subject of much discussion. Recently, disinformation has been linked to causing massive social harm, leading to the decline of democracy, and hindering global efforts in an international health crisis. In computing, and specifically Natural Language Processing (NLP), much effort has been put into tackling this problem. This has led to an increase of research in automated fact-checking and the language of disinformation. However, current research suffers from looking at a limited variety of sources. Much focus has, understandably, been given to platforms such as Twitter, Facebook and WhatsApp, as well as on traditional news articles online. Few works in NLP have looked at the specific communities where false information ferments. There has also been something of a topical constraint, with most examples of “Fake News” relating to current political issues.

This thesis contributes to this rapidly growing research area by looking wider for new sources of data, and developing methods to analyse them. Specifically, it introduces two new datasets to the field and performs analyses on both. The first of these, a corpus of April Fools hoaxes, is analysed with a feature-driven approach to examine the generalisability of different features in the classification of false information. This is the first corpus of April Fools news articles, and is publicly available for researchers. The second dataset, a corpus of online Flat Earth communities, is also the first of its kind. In addition to performing the first NLP analysis of the language of Flat Earth fora, an exploration is performed to look for the existence of sub-groups within these communities, as well as an analysis of language change. To support this analysis, language change methods are surveyed, and a new method for comparing the language change of groups over time is developed. The methods used, brought together from both NLP and Corpus Linguistics, provide new insight into the language of false information, and the way communities discuss it.

Research

Associated organisational units

Electronic data

Text available via DOI:

Keywords