Accepted author manuscript, 468 KB, PDF document
Final published version, 525 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License
Final published version
Licence: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
}
TY - GEN
T1 - Unfinished Business
T2 - Construction and Maintenance of a Semantically Tagged Historical Parliamentary Corpus, UK Hansard from 1803 to the present day
AU - Coole, Matthew
AU - Rayson, Paul
AU - Mariani, John
PY - 2020/5/11
Y1 - 2020/5/11
N2 - Creating, curating and maintaining modern political corpora is becoming an ever more involved task. As interest from various socialbodies and the general public in political discourse grows so too does the need to enrich such datasets with metadata and linguisticannotations. Beyond this, such corpora must be easy to browse and search for linguists, social scientists, digital humanists and thegeneral public. We present our efforts to compile a linguistically annotated and semantically tagged version of the Hansard corpus from1803 right up to the present day. This involves combining multiple sources of documents and transcripts. We describe our toolchainfor tagging; using several existing tools that provide tokenisation, part-of-speech tagging and semantic annotations. We also provide anoverview of our bespoke web-based search interface built on LexiDB. In conclusion, we examine the completed corpus by looking atfour case studies making use of semantic categories made available by our toolchain.
AB - Creating, curating and maintaining modern political corpora is becoming an ever more involved task. As interest from various socialbodies and the general public in political discourse grows so too does the need to enrich such datasets with metadata and linguisticannotations. Beyond this, such corpora must be easy to browse and search for linguists, social scientists, digital humanists and thegeneral public. We present our efforts to compile a linguistically annotated and semantically tagged version of the Hansard corpus from1803 right up to the present day. This involves combining multiple sources of documents and transcripts. We describe our toolchainfor tagging; using several existing tools that provide tokenisation, part-of-speech tagging and semantic annotations. We also provide anoverview of our bespoke web-based search interface built on LexiDB. In conclusion, we examine the completed corpus by looking atfour case studies making use of semantic categories made available by our toolchain.
M3 - Conference contribution/Paper
SN - 9791095546474
SP - 23
EP - 27
BT - Proceedings of the Second ParlaCLARIN Workshop
A2 - Fišer, Darja
A2 - Eskevich, Maria
A2 - de Jong, Franciska
PB - European Language Resources Association (ELRA)
CY - Paris
ER -