Processing internet-derived text - creating a corpus of usenet messages. - Research Portal

Linguistics and English Language

Text available via DOI:

https://doi.org/10.1093/llc/fqm002
Final published version

Processing internet-derived text - creating a corpus of usenet messages.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Processing internet-derived text - creating a corpus of usenet messages. / Hoffmann, S.
In: Literary and Linguistic Computing, Vol. 22, No. 2, 01.06.2007, p. 35-55.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Hoffmann, S 2007, 'Processing internet-derived text - creating a corpus of usenet messages.', Literary and Linguistic Computing, vol. 22, no. 2, pp. 35-55. https://doi.org/10.1093/llc/fqm002

APA

Hoffmann, S. (2007). Processing internet-derived text - creating a corpus of usenet messages. Literary and Linguistic Computing, 22(2), 35-55. https://doi.org/10.1093/llc/fqm002

Vancouver

Hoffmann S. Processing internet-derived text - creating a corpus of usenet messages. Literary and Linguistic Computing. 2007 Jun 1;22(2):35-55. doi: 10.1093/llc/fqm002

Author

Hoffmann, S. / Processing internet-derived text - creating a corpus of usenet messages. In: Literary and Linguistic Computing. 2007 ; Vol. 22, No. 2. pp. 35-55.

Bibtex

@article{bf089c2655b14f9daaf6bf866bbb1f3e,

title = "Processing internet-derived text - creating a corpus of usenet messages.",

abstract = "In recent years, linguists have become increasingly interested in the language of the Internet—both as an object of investigation as well as a source of authentic data to complement traditional electronic corpora. However, Internet-derived data is typically very messy data and a conversion process is often required in order to enable researchers to carry out a reliable quantitative investigation of the patterns observed with the help of standard corpus tools. In this article, I discuss the technical and methodological aspects involved in creating a large corpus of asynchronous computer-mediated communication by downloading and post-processing hundreds of thousands messages posted in twelve Usenet newsgroups. After describing how messages can be arranged into hierarchically structured discussion threads, I focus at some length on the strategies that are required to correctly assign authorship to the different textual elements in individual messages. My algorithms have a success rate of well over 90% for most newsgroups and the resulting corpus can thus serve as a suitable basis for an investigation into the interactive strategies employed in this particular type of written communication.",

author = "S. Hoffmann",

note = "RAE_import_type : Journal article RAE_uoa_type : Linguistics",

year = "2007",

month = jun,

day = "1",

doi = "10.1093/llc/fqm002",

language = "English",

volume = "22",

pages = "35--55",

journal = "Literary and Linguistic Computing",

issn = "1477-4615",

publisher = "Oxford University Press",

number = "2",

}

RIS

TY - JOUR

T1 - Processing internet-derived text - creating a corpus of usenet messages.

AU - Hoffmann, S.

N1 - RAE_import_type : Journal article RAE_uoa_type : Linguistics

PY - 2007/6/1

Y1 - 2007/6/1

N2 - In recent years, linguists have become increasingly interested in the language of the Internet—both as an object of investigation as well as a source of authentic data to complement traditional electronic corpora. However, Internet-derived data is typically very messy data and a conversion process is often required in order to enable researchers to carry out a reliable quantitative investigation of the patterns observed with the help of standard corpus tools. In this article, I discuss the technical and methodological aspects involved in creating a large corpus of asynchronous computer-mediated communication by downloading and post-processing hundreds of thousands messages posted in twelve Usenet newsgroups. After describing how messages can be arranged into hierarchically structured discussion threads, I focus at some length on the strategies that are required to correctly assign authorship to the different textual elements in individual messages. My algorithms have a success rate of well over 90% for most newsgroups and the resulting corpus can thus serve as a suitable basis for an investigation into the interactive strategies employed in this particular type of written communication.

AB - In recent years, linguists have become increasingly interested in the language of the Internet—both as an object of investigation as well as a source of authentic data to complement traditional electronic corpora. However, Internet-derived data is typically very messy data and a conversion process is often required in order to enable researchers to carry out a reliable quantitative investigation of the patterns observed with the help of standard corpus tools. In this article, I discuss the technical and methodological aspects involved in creating a large corpus of asynchronous computer-mediated communication by downloading and post-processing hundreds of thousands messages posted in twelve Usenet newsgroups. After describing how messages can be arranged into hierarchically structured discussion threads, I focus at some length on the strategies that are required to correctly assign authorship to the different textual elements in individual messages. My algorithms have a success rate of well over 90% for most newsgroups and the resulting corpus can thus serve as a suitable basis for an investigation into the interactive strategies employed in this particular type of written communication.

U2 - 10.1093/llc/fqm002

DO - 10.1093/llc/fqm002

M3 - Journal article

VL - 22

SP - 35

EP - 55

JO - Literary and Linguistic Computing

JF - Literary and Linguistic Computing

SN - 1477-4615

IS - 2

ER -

Research

Links

Text available via DOI: