Processing internet-derived text - creating a corpus of usenet messages. - Research Portal

Home > Research > Publications & Outputs > Processing internet-derived text - creating a c...

Linguistics and English Language

Text available via DOI:

https://doi.org/10.1093/llc/fqm002
Final published version

View graph of relations

Processing internet-derived text - creating a corpus of usenet messages.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

S. Hoffmann

More...

<mark>Journal publication date</mark>	1/06/2007
<mark>Journal</mark>	Literary and Linguistic Computing
Issue number	2
Volume	22
Number of pages	21
Pages (from-to)	35-55
Publication Status	Published
<mark>Original language</mark>	English

Abstract

In recent years, linguists have become increasingly interested in the language of the Internet—both as an object of investigation as well as a source of authentic data to complement traditional electronic corpora. However, Internet-derived data is typically very messy data and a conversion process is often required in order to enable researchers to carry out a reliable quantitative investigation of the patterns observed with the help of standard corpus tools. In this article, I discuss the technical and methodological aspects involved in creating a large corpus of asynchronous computer-mediated communication by downloading and post-processing hundreds of thousands messages posted in twelve Usenet newsgroups. After describing how messages can be arranged into hierarchically structured discussion threads, I focus at some length on the strategies that are required to correctly assign authorship to the different textual elements in individual messages. My algorithms have a success rate of well over 90% for most newsgroups and the resulting corpus can thus serve as a suitable basis for an investigation into the interactive strategies employed in this particular type of written communication.

Bibliographic note

RAE_import_type : Journal article RAE_uoa_type : Linguistics

Research

Links

Text available via DOI:

Processing internet-derived text - creating a corpus of usenet messages.

Abstract

Bibliographic note

Quick Links

Connect With Us

Faculties & Depts

Contact Us