Home > Research > Publications & Outputs > Experiences with parallelisation of an existing...
View graph of relations

Experiences with parallelisation of an existing NLP pipeline: tagging Hansard

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

Experiences with parallelisation of an existing NLP pipeline: tagging Hansard. / Wattam, Stephen; Rayson, Paul; Alexander, Marc et al.
LREC 2014, Ninth International Conference on Language Resources and Evaluation. ed. / Nicoletta Calzolari; Khalid Choukri; Thierry Declerck; Hrafn Loftsson; Bente Maegaard; Joseph Mariani; Asuncion Moreno; Jan Odijk; Stelios Piperidis. Paris: EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA, 2014. p. 4093-4096.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Wattam, S, Rayson, P, Alexander, M & Anderson, J 2014, Experiences with parallelisation of an existing NLP pipeline: tagging Hansard. in N Calzolari, K Choukri, T Declerck, H Loftsson, B Maegaard, J Mariani, A Moreno, J Odijk & S Piperidis (eds), LREC 2014, Ninth International Conference on Language Resources and Evaluation. EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA, Paris, pp. 4093-4096, 9th International Conference on Language Resources and Evaluation (LREC), Iceland, 26/05/14. <http://www.lrec-conf.org/proceedings/lrec2014/pdf/687_Paper.pdf>

APA

Wattam, S., Rayson, P., Alexander, M., & Anderson, J. (2014). Experiences with parallelisation of an existing NLP pipeline: tagging Hansard. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), LREC 2014, Ninth International Conference on Language Resources and Evaluation (pp. 4093-4096). EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA. http://www.lrec-conf.org/proceedings/lrec2014/pdf/687_Paper.pdf

Vancouver

Wattam S, Rayson P, Alexander M, Anderson J. Experiences with parallelisation of an existing NLP pipeline: tagging Hansard. In Calzolari N, Choukri K, Declerck T, Loftsson H, Maegaard B, Mariani J, Moreno A, Odijk J, Piperidis S, editors, LREC 2014, Ninth International Conference on Language Resources and Evaluation. Paris: EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA. 2014. p. 4093-4096

Author

Wattam, Stephen ; Rayson, Paul ; Alexander, Marc et al. / Experiences with parallelisation of an existing NLP pipeline : tagging Hansard. LREC 2014, Ninth International Conference on Language Resources and Evaluation. editor / Nicoletta Calzolari ; Khalid Choukri ; Thierry Declerck ; Hrafn Loftsson ; Bente Maegaard ; Joseph Mariani ; Asuncion Moreno ; Jan Odijk ; Stelios Piperidis. Paris : EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA, 2014. pp. 4093-4096

Bibtex

@inproceedings{fe434b57ae2e489e9361734971775ffb,
title = "Experiences with parallelisation of an existing NLP pipeline: tagging Hansard",
abstract = "This poster describes experiences processing the two-billion-word Hansard corpus using a fairly standard NLP pipeline on a high performance cluster. Herein we report how we were able to parallelise and apply a {"}traditional{"} single-threaded batch-oriented application to a platform that differs greatly from that for which it was originally designed. We start by discussing the tagging toolchain, its specific requirements and properties, and its performance characteristics. This is contrasted with a description of the cluster on which it was to run, and specific limitations are discussed such as the overhead of using SAN-based storage. We then go on to discuss the nature of the Hansard corpus, and describe which properties of this corpus in particular prove challenging for use on the system architecture used. The solution for tagging the corpus is then described, along with performance comparisons against a naive run on commodity hardware. We discuss the gains and benefits of using high-performance machinery rather than relatively cheap commodity hardware. Our poster provides a valuable scenario for large scale NLP pipelines and lessons learnt from the experience",
keywords = "High-performance Computing, Parallelisation, Tagging",
author = "Stephen Wattam and Paul Rayson and Marc Alexander and Jean Anderson",
year = "2014",
language = "English",
isbn = "9782951740884",
pages = "4093--4096",
editor = "Nicoletta Calzolari and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis",
booktitle = "LREC 2014, Ninth International Conference on Language Resources and Evaluation",
publisher = "EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA",
note = "9th International Conference on Language Resources and Evaluation (LREC) ; Conference date: 26-05-2014 Through 31-05-2014",

}

RIS

TY - GEN

T1 - Experiences with parallelisation of an existing NLP pipeline

T2 - 9th International Conference on Language Resources and Evaluation (LREC)

AU - Wattam, Stephen

AU - Rayson, Paul

AU - Alexander, Marc

AU - Anderson, Jean

PY - 2014

Y1 - 2014

N2 - This poster describes experiences processing the two-billion-word Hansard corpus using a fairly standard NLP pipeline on a high performance cluster. Herein we report how we were able to parallelise and apply a "traditional" single-threaded batch-oriented application to a platform that differs greatly from that for which it was originally designed. We start by discussing the tagging toolchain, its specific requirements and properties, and its performance characteristics. This is contrasted with a description of the cluster on which it was to run, and specific limitations are discussed such as the overhead of using SAN-based storage. We then go on to discuss the nature of the Hansard corpus, and describe which properties of this corpus in particular prove challenging for use on the system architecture used. The solution for tagging the corpus is then described, along with performance comparisons against a naive run on commodity hardware. We discuss the gains and benefits of using high-performance machinery rather than relatively cheap commodity hardware. Our poster provides a valuable scenario for large scale NLP pipelines and lessons learnt from the experience

AB - This poster describes experiences processing the two-billion-word Hansard corpus using a fairly standard NLP pipeline on a high performance cluster. Herein we report how we were able to parallelise and apply a "traditional" single-threaded batch-oriented application to a platform that differs greatly from that for which it was originally designed. We start by discussing the tagging toolchain, its specific requirements and properties, and its performance characteristics. This is contrasted with a description of the cluster on which it was to run, and specific limitations are discussed such as the overhead of using SAN-based storage. We then go on to discuss the nature of the Hansard corpus, and describe which properties of this corpus in particular prove challenging for use on the system architecture used. The solution for tagging the corpus is then described, along with performance comparisons against a naive run on commodity hardware. We discuss the gains and benefits of using high-performance machinery rather than relatively cheap commodity hardware. Our poster provides a valuable scenario for large scale NLP pipelines and lessons learnt from the experience

KW - High-performance Computing

KW - Parallelisation

KW - Tagging

M3 - Conference contribution/Paper

SN - 9782951740884

SP - 4093

EP - 4096

BT - LREC 2014, Ninth International Conference on Language Resources and Evaluation

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Declerck, Thierry

A2 - Loftsson, Hrafn

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Moreno, Asuncion

A2 - Odijk, Jan

A2 - Piperidis, Stelios

PB - EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA

CY - Paris

Y2 - 26 May 2014 through 31 May 2014

ER -