A deeply annotated testbed for geographical text analysis - Research Portal

Associated organisational units

Text available via DOI:

https://doi.org/10.1145/3149858.3149865
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing. / Rayson, Paul Edward ; Reinhold, Alexander ; Butler, James et al.
GeoHumanities'17 Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. New York: Association for Computing Machinery (ACM), 2017. p. 9-15.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Rayson, PE , Reinhold, A , Butler, J , Donaldson, CE , Gregory, IN & Taylor, JE 2017, A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing. in GeoHumanities'17 Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. Association for Computing Machinery (ACM), New York, pp. 9-15. https://doi.org/10.1145/3149858.3149865

APA

Rayson, P. E., Reinhold, A., Butler, J., Donaldson, C. E., Gregory, I. N., & Taylor, J. E. (2017). A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing. In GeoHumanities'17 Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities (pp. 9-15). Association for Computing Machinery (ACM). https://doi.org/10.1145/3149858.3149865

Vancouver

Rayson PE , Reinhold A , Butler J , Donaldson CE , Gregory IN , Taylor JE. A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing. In GeoHumanities'17 Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. New York: Association for Computing Machinery (ACM). 2017. p. 9-15 doi: 10.1145/3149858.3149865

Author

Rayson, Paul Edward ; Reinhold, Alexander ; Butler, James et al. / A deeply annotated testbed for geographical text analysis : The Corpus of Lake District Writing. GeoHumanities'17 Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. New York : Association for Computing Machinery (ACM), 2017. pp. 9-15

Bibtex

@inproceedings{76fdf4fd3dd64cad9701bc7fd55f5b2c,

title = "A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing",

abstract = "This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.",

author = "Rayson, {Paul Edward} and Alexander Reinhold and James Butler and Donaldson, {Christopher Elliott} and Gregory, {Ian Norman} and Taylor, {Joanna Elizabeth}",

year = "2017",

month = nov,

day = "7",

doi = "10.1145/3149858.3149865",

language = "English",

isbn = "9781450354967",

pages = "9--15",

booktitle = "GeoHumanities'17 Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities",

publisher = "Association for Computing Machinery (ACM)",

}

RIS

TY - GEN

T1 - A deeply annotated testbed for geographical text analysis

T2 - The Corpus of Lake District Writing

AU - Rayson, Paul Edward

AU - Reinhold, Alexander

AU - Butler, James

AU - Donaldson, Christopher Elliott

AU - Gregory, Ian Norman

AU - Taylor, Joanna Elizabeth

PY - 2017/11/7

Y1 - 2017/11/7

N2 - This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.

AB - This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.

U2 - 10.1145/3149858.3149865

DO - 10.1145/3149858.3149865

M3 - Conference contribution/Paper

SN - 9781450354967

SP - 9

EP - 15

BT - GeoHumanities'17 Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities

PB - Association for Computing Machinery (ACM)

CY - New York

ER -

Research

Associated organisational units

Links

Text available via DOI:

A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing

Standard

Harvard

APA

Vancouver

Author

Bibtex

RIS

Quick Links

Connect With Us

Faculties & Depts

Contact Us