The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice

Linguistics and English Language

Text available via DOI:

https://doi.org/10.1093/llc/fqac075
Final published version

Keywords

Computer Science Applications, Linguistics and Language, Language and Linguistics, Information Systems

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

E-pub ahead of print

Standard

The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice. / Gillings, Mathew ; Hardie, Andrew.
In: Digital Scholarship in the Humanities, 22.12.2022.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Bibtex

@article{92ad125743384568aa7d6998a23c1647,

title = "The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice",

abstract = "Topic modelling is a method of statistical data mining of a corpus of documents, popular in the digital humanities and, increasingly, in social sciences. A critical methodological issue is how {\textquoteleft}topics{\textquoteright} (groups of co-selected word types) can be interpreted in analytically meaningful terms. In the current literature, this is typically done by {\textquoteleft}eyeballing{\textquoteright}; that is, cursory and largely unsystematic examination of the {\textquoteleft}top{\textquoteright} words in each algorithmically identified word group. We critically evaluate this approach in a dual analysis, comparing the {\textquoteleft}eyeballing{\textquoteright} approach with an alternative using sample close reading across the corpus. We used MALLET to extract two topic models from a test corpus: one with stopwords included, another with stopwords excluded. We then used the aforementioned methods to assign labels to these topics. The results suggest that a close-reading approach is more effective not only in level of detail but even in terms of accuracy. In particular, we found that: assigning labels via eyeballing yields incomplete or incorrect topic labels; removing stopwords drastically affects the analysis outcome; topic labelling and interpretation depend considerably on the analysts{\textquoteright} specialist knowledge; and differences of perspective or construal are unlikely to be captured through a topic model. We conclude that an interpretive paradigm founded in close reading may make topic modelling more appealing to humanities researchers.",

keywords = "Computer Science Applications, Linguistics and Language, Language and Linguistics, Information Systems",

author = "Mathew Gillings and Andrew Hardie",

year = "2022",

month = dec,

day = "22",

doi = "10.1093/llc/fqac075",

language = "English",

journal = "Digital Scholarship in the Humanities",

issn = "2055-7671",

publisher = "Oxford University Press",

}

RIS

TY - JOUR

T1 - The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice

AU - Gillings, Mathew

AU - Hardie, Andrew

PY - 2022/12/22

Y1 - 2022/12/22

N2 - Topic modelling is a method of statistical data mining of a corpus of documents, popular in the digital humanities and, increasingly, in social sciences. A critical methodological issue is how ‘topics’ (groups of co-selected word types) can be interpreted in analytically meaningful terms. In the current literature, this is typically done by ‘eyeballing’; that is, cursory and largely unsystematic examination of the ‘top’ words in each algorithmically identified word group. We critically evaluate this approach in a dual analysis, comparing the ‘eyeballing’ approach with an alternative using sample close reading across the corpus. We used MALLET to extract two topic models from a test corpus: one with stopwords included, another with stopwords excluded. We then used the aforementioned methods to assign labels to these topics. The results suggest that a close-reading approach is more effective not only in level of detail but even in terms of accuracy. In particular, we found that: assigning labels via eyeballing yields incomplete or incorrect topic labels; removing stopwords drastically affects the analysis outcome; topic labelling and interpretation depend considerably on the analysts’ specialist knowledge; and differences of perspective or construal are unlikely to be captured through a topic model. We conclude that an interpretive paradigm founded in close reading may make topic modelling more appealing to humanities researchers.

AB - Topic modelling is a method of statistical data mining of a corpus of documents, popular in the digital humanities and, increasingly, in social sciences. A critical methodological issue is how ‘topics’ (groups of co-selected word types) can be interpreted in analytically meaningful terms. In the current literature, this is typically done by ‘eyeballing’; that is, cursory and largely unsystematic examination of the ‘top’ words in each algorithmically identified word group. We critically evaluate this approach in a dual analysis, comparing the ‘eyeballing’ approach with an alternative using sample close reading across the corpus. We used MALLET to extract two topic models from a test corpus: one with stopwords included, another with stopwords excluded. We then used the aforementioned methods to assign labels to these topics. The results suggest that a close-reading approach is more effective not only in level of detail but even in terms of accuracy. In particular, we found that: assigning labels via eyeballing yields incomplete or incorrect topic labels; removing stopwords drastically affects the analysis outcome; topic labelling and interpretation depend considerably on the analysts’ specialist knowledge; and differences of perspective or construal are unlikely to be captured through a topic model. We conclude that an interpretive paradigm founded in close reading may make topic modelling more appealing to humanities researchers.

KW - Computer Science Applications

KW - Linguistics and Language

KW - Language and Linguistics

KW - Information Systems

U2 - 10.1093/llc/fqac075

DO - 10.1093/llc/fqac075

M3 - Journal article

JO - Digital Scholarship in the Humanities

JF - Digital Scholarship in the Humanities

SN - 2055-7671

ER -

Research

Text available via DOI:

Keywords