Home > Research > Publications & Outputs > Document attrition in web corpora
View graph of relations

Document attrition in web corpora: An exploration

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

Document attrition in web corpora : An exploration. / Wattam, Stephen; Rayson, Paul; Berridge, Damon.

Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012. European Language Resources Association (ELRA), 2012. p. 1486-1489.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Wattam, S, Rayson, P & Berridge, D 2012, Document attrition in web corpora: An exploration. in Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012. European Language Resources Association (ELRA), pp. 1486-1489, 8th International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, 21/05/12.

APA

Wattam, S., Rayson, P., & Berridge, D. (2012). Document attrition in web corpora: An exploration. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012 (pp. 1486-1489). European Language Resources Association (ELRA).

Vancouver

Wattam S, Rayson P, Berridge D. Document attrition in web corpora: An exploration. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012. European Language Resources Association (ELRA). 2012. p. 1486-1489

Author

Wattam, Stephen ; Rayson, Paul ; Berridge, Damon. / Document attrition in web corpora : An exploration. Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012. European Language Resources Association (ELRA), 2012. pp. 1486-1489

Bibtex

@inproceedings{52290300f94e4ce2a396dcd37dfeb04e,
title = "Document attrition in web corpora: An exploration",
abstract = "Increases in the use of web data for corpus-building, coupled with the use of specialist, single-use corpora, make for an increasing reliance on language that changes quickly, affecting the long-term validity of studies based on these methods. This 'drift' through time affects both users of open-source corpora and those attempting to interpret the results of studies based on web data. The attrition of documents online, also called link rot or document half-life, has been studied many times for the purposes of optimising search engine web crawlers, producing robust and reliable archival systems, and ensuring the integrity of distributed information stores, however, the affect that attrition has upon corpora of varying construction remains largely unknown. This paper presents a preliminary investigation into the differences in attrition rate between corpora selected using different corpus construction methods. It represents the first step in a larger longitudinal analysis, and as such presents URI-based content clues, chosen to relate to studies from other areas. The ultimate goal of this larger study is to produce a detailed enumeration of the primary biases online, and identify sampling strategies which control and minimise unwanted effects of document attrition.",
keywords = "Corpora, Sampling, Web-as-Corpus",
author = "Stephen Wattam and Paul Rayson and Damon Berridge",
year = "2012",
month = jan,
day = "1",
language = "English",
pages = "1486--1489",
booktitle = "Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012",
publisher = "European Language Resources Association (ELRA)",
note = "8th International Conference on Language Resources and Evaluation, LREC 2012 ; Conference date: 21-05-2012 Through 27-05-2012",

}

RIS

TY - GEN

T1 - Document attrition in web corpora

T2 - 8th International Conference on Language Resources and Evaluation, LREC 2012

AU - Wattam, Stephen

AU - Rayson, Paul

AU - Berridge, Damon

PY - 2012/1/1

Y1 - 2012/1/1

N2 - Increases in the use of web data for corpus-building, coupled with the use of specialist, single-use corpora, make for an increasing reliance on language that changes quickly, affecting the long-term validity of studies based on these methods. This 'drift' through time affects both users of open-source corpora and those attempting to interpret the results of studies based on web data. The attrition of documents online, also called link rot or document half-life, has been studied many times for the purposes of optimising search engine web crawlers, producing robust and reliable archival systems, and ensuring the integrity of distributed information stores, however, the affect that attrition has upon corpora of varying construction remains largely unknown. This paper presents a preliminary investigation into the differences in attrition rate between corpora selected using different corpus construction methods. It represents the first step in a larger longitudinal analysis, and as such presents URI-based content clues, chosen to relate to studies from other areas. The ultimate goal of this larger study is to produce a detailed enumeration of the primary biases online, and identify sampling strategies which control and minimise unwanted effects of document attrition.

AB - Increases in the use of web data for corpus-building, coupled with the use of specialist, single-use corpora, make for an increasing reliance on language that changes quickly, affecting the long-term validity of studies based on these methods. This 'drift' through time affects both users of open-source corpora and those attempting to interpret the results of studies based on web data. The attrition of documents online, also called link rot or document half-life, has been studied many times for the purposes of optimising search engine web crawlers, producing robust and reliable archival systems, and ensuring the integrity of distributed information stores, however, the affect that attrition has upon corpora of varying construction remains largely unknown. This paper presents a preliminary investigation into the differences in attrition rate between corpora selected using different corpus construction methods. It represents the first step in a larger longitudinal analysis, and as such presents URI-based content clues, chosen to relate to studies from other areas. The ultimate goal of this larger study is to produce a detailed enumeration of the primary biases online, and identify sampling strategies which control and minimise unwanted effects of document attrition.

KW - Corpora

KW - Sampling

KW - Web-as-Corpus

M3 - Conference contribution/Paper

AN - SCOPUS:85037350114

SP - 1486

EP - 1489

BT - Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012

PB - European Language Resources Association (ELRA)

Y2 - 21 May 2012 through 27 May 2012

ER -