Home > Research > Publications & Outputs > VARD2
View graph of relations

VARD2: a tool for dealing with spelling variation in historical corpora

Research output: Contribution to conference - Without ISBN/ISSN Conference paper

Published

Standard

VARD2: a tool for dealing with spelling variation in historical corpora. / Baron, Alistair; Rayson, Paul.
2008. Paper presented at Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham.

Research output: Contribution to conference - Without ISBN/ISSN Conference paper

Harvard

Baron, A & Rayson, P 2008, 'VARD2: a tool for dealing with spelling variation in historical corpora', Paper presented at Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, 22/05/08. <http://acorn.aston.ac.uk/conf_proceedings.html>

APA

Baron, A., & Rayson, P. (2008). VARD2: a tool for dealing with spelling variation in historical corpora. Paper presented at Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham. http://acorn.aston.ac.uk/conf_proceedings.html

Vancouver

Baron A, Rayson P. VARD2: a tool for dealing with spelling variation in historical corpora. 2008. Paper presented at Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham.

Author

Baron, Alistair ; Rayson, Paul. / VARD2 : a tool for dealing with spelling variation in historical corpora. Paper presented at Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham.15 p.

Bibtex

@conference{73773c9bbd1b4cbb9febf2ab1d6f5743,
title = "VARD2: a tool for dealing with spelling variation in historical corpora",
abstract = "When applying corpus linguistic techniques to historical corpora, the corpus researcher should be cautious about the results obtained. Corpus annotation techniques such as part of speech tagging, trained for modern languages, are particularly vulnerable to inaccuracy due to vocabulary and grammatical shifts in language over time. Basic corpus retrieval techniques such as frequency profiling and concordancing will also be affected, in addition to the more sophisticated techniques such as keywords, n-grams, clusters and lexical bundles which rely on word frequencies for their calculations. In this paper, we highlight these problems with particular focus on Early Modern English corpora. We also present an overview of the VARD tool, our proposed solution to this problem, which facilitates pre-processing of historical corpus data by inserting modern equivalents alongside historical spelling variants. Recent improvements to the VARD tool include the incorporation of techniques used in modern spell checking software.",
author = "Alistair Baron and Paul Rayson",
year = "2008",
month = may,
language = "English",
note = "Postgraduate Conference in Corpus Linguistics ; Conference date: 22-05-2008",

}

RIS

TY - CONF

T1 - VARD2

T2 - Postgraduate Conference in Corpus Linguistics

AU - Baron, Alistair

AU - Rayson, Paul

PY - 2008/5

Y1 - 2008/5

N2 - When applying corpus linguistic techniques to historical corpora, the corpus researcher should be cautious about the results obtained. Corpus annotation techniques such as part of speech tagging, trained for modern languages, are particularly vulnerable to inaccuracy due to vocabulary and grammatical shifts in language over time. Basic corpus retrieval techniques such as frequency profiling and concordancing will also be affected, in addition to the more sophisticated techniques such as keywords, n-grams, clusters and lexical bundles which rely on word frequencies for their calculations. In this paper, we highlight these problems with particular focus on Early Modern English corpora. We also present an overview of the VARD tool, our proposed solution to this problem, which facilitates pre-processing of historical corpus data by inserting modern equivalents alongside historical spelling variants. Recent improvements to the VARD tool include the incorporation of techniques used in modern spell checking software.

AB - When applying corpus linguistic techniques to historical corpora, the corpus researcher should be cautious about the results obtained. Corpus annotation techniques such as part of speech tagging, trained for modern languages, are particularly vulnerable to inaccuracy due to vocabulary and grammatical shifts in language over time. Basic corpus retrieval techniques such as frequency profiling and concordancing will also be affected, in addition to the more sophisticated techniques such as keywords, n-grams, clusters and lexical bundles which rely on word frequencies for their calculations. In this paper, we highlight these problems with particular focus on Early Modern English corpora. We also present an overview of the VARD tool, our proposed solution to this problem, which facilitates pre-processing of historical corpus data by inserting modern equivalents alongside historical spelling variants. Recent improvements to the VARD tool include the incorporation of techniques used in modern spell checking software.

M3 - Conference paper

Y2 - 22 May 2008

ER -