Home > Research > Publications & Outputs > The interpretation of topic models for scholarl...

Text available via DOI:

View graph of relations

The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice

Research output: Contribution to Journal/MagazineJournal articlepeer-review

E-pub ahead of print

Standard

The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice. / Gillings, Mathew; Hardie, Andrew.
In: Digital Scholarship in the Humanities, 22.12.2022.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

APA

Vancouver

Gillings M, Hardie A. The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice. Digital Scholarship in the Humanities. 2022 Dec 22. Epub 2022 Dec 22. doi: 10.1093/llc/fqac075

Author

Bibtex

@article{92ad125743384568aa7d6998a23c1647,
title = "The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice",
abstract = "Topic modelling is a method of statistical data mining of a corpus of documents, popular in the digital humanities and, increasingly, in social sciences. A critical methodological issue is how {\textquoteleft}topics{\textquoteright} (groups of co-selected word types) can be interpreted in analytically meaningful terms. In the current literature, this is typically done by {\textquoteleft}eyeballing{\textquoteright}; that is, cursory and largely unsystematic examination of the {\textquoteleft}top{\textquoteright} words in each algorithmically identified word group. We critically evaluate this approach in a dual analysis, comparing the {\textquoteleft}eyeballing{\textquoteright} approach with an alternative using sample close reading across the corpus. We used MALLET to extract two topic models from a test corpus: one with stopwords included, another with stopwords excluded. We then used the aforementioned methods to assign labels to these topics. The results suggest that a close-reading approach is more effective not only in level of detail but even in terms of accuracy. In particular, we found that: assigning labels via eyeballing yields incomplete or incorrect topic labels; removing stopwords drastically affects the analysis outcome; topic labelling and interpretation depend considerably on the analysts{\textquoteright} specialist knowledge; and differences of perspective or construal are unlikely to be captured through a topic model. We conclude that an interpretive paradigm founded in close reading may make topic modelling more appealing to humanities researchers.",
keywords = "Computer Science Applications, Linguistics and Language, Language and Linguistics, Information Systems",
author = "Mathew Gillings and Andrew Hardie",
year = "2022",
month = dec,
day = "22",
doi = "10.1093/llc/fqac075",
language = "English",
journal = "Digital Scholarship in the Humanities",
issn = "2055-7671",
publisher = "Oxford University Press",

}

RIS

TY - JOUR

T1 - The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice

AU - Gillings, Mathew

AU - Hardie, Andrew

PY - 2022/12/22

Y1 - 2022/12/22

N2 - Topic modelling is a method of statistical data mining of a corpus of documents, popular in the digital humanities and, increasingly, in social sciences. A critical methodological issue is how ‘topics’ (groups of co-selected word types) can be interpreted in analytically meaningful terms. In the current literature, this is typically done by ‘eyeballing’; that is, cursory and largely unsystematic examination of the ‘top’ words in each algorithmically identified word group. We critically evaluate this approach in a dual analysis, comparing the ‘eyeballing’ approach with an alternative using sample close reading across the corpus. We used MALLET to extract two topic models from a test corpus: one with stopwords included, another with stopwords excluded. We then used the aforementioned methods to assign labels to these topics. The results suggest that a close-reading approach is more effective not only in level of detail but even in terms of accuracy. In particular, we found that: assigning labels via eyeballing yields incomplete or incorrect topic labels; removing stopwords drastically affects the analysis outcome; topic labelling and interpretation depend considerably on the analysts’ specialist knowledge; and differences of perspective or construal are unlikely to be captured through a topic model. We conclude that an interpretive paradigm founded in close reading may make topic modelling more appealing to humanities researchers.

AB - Topic modelling is a method of statistical data mining of a corpus of documents, popular in the digital humanities and, increasingly, in social sciences. A critical methodological issue is how ‘topics’ (groups of co-selected word types) can be interpreted in analytically meaningful terms. In the current literature, this is typically done by ‘eyeballing’; that is, cursory and largely unsystematic examination of the ‘top’ words in each algorithmically identified word group. We critically evaluate this approach in a dual analysis, comparing the ‘eyeballing’ approach with an alternative using sample close reading across the corpus. We used MALLET to extract two topic models from a test corpus: one with stopwords included, another with stopwords excluded. We then used the aforementioned methods to assign labels to these topics. The results suggest that a close-reading approach is more effective not only in level of detail but even in terms of accuracy. In particular, we found that: assigning labels via eyeballing yields incomplete or incorrect topic labels; removing stopwords drastically affects the analysis outcome; topic labelling and interpretation depend considerably on the analysts’ specialist knowledge; and differences of perspective or construal are unlikely to be captured through a topic model. We conclude that an interpretive paradigm founded in close reading may make topic modelling more appealing to humanities researchers.

KW - Computer Science Applications

KW - Linguistics and Language

KW - Language and Linguistics

KW - Information Systems

U2 - 10.1093/llc/fqac075

DO - 10.1093/llc/fqac075

M3 - Journal article

JO - Digital Scholarship in the Humanities

JF - Digital Scholarship in the Humanities

SN - 2055-7671

ER -