Experiences with parallelisation of an existing NLP pipeline - Research Portal

Associated organisational units

Keywords

High-performance Computing, Parallelisation, Tagging

View graph of relations

Experiences with parallelisation of an existing NLP pipeline: tagging Hansard

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Stephen Wattam
Paul Rayson
Marc Alexander
Jean Anderson

More...

Publication date	2014
Host publication	LREC 2014, Ninth International Conference on Language Resources and Evaluation
Editors	Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Place of Publication	Paris
Publisher	EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA
Pages	4093-4096
Number of pages	4
ISBN (print)	9782951740884
<mark>Original language</mark>	English
Event	9th International Conference on Language Resources and Evaluation (LREC) - Reykjavik, Iceland Duration: 26/05/2014 → 31/05/2014

Conference

Conference	9th International Conference on Language Resources and Evaluation (LREC)
Country/Territory	Iceland
Period	26/05/14 → 31/05/14

Conference

Conference	9th International Conference on Language Resources and Evaluation (LREC)
Country/Territory	Iceland
Period	26/05/14 → 31/05/14

Abstract

This poster describes experiences processing the two-billion-word Hansard corpus using a fairly standard NLP pipeline on a high performance cluster. Herein we report how we were able to parallelise and apply a "traditional" single-threaded batch-oriented application to a platform that differs greatly from that for which it was originally designed. We start by discussing the tagging toolchain, its specific requirements and properties, and its performance characteristics. This is contrasted with a description of the cluster on which it was to run, and specific limitations are discussed such as the overhead of using SAN-based storage. We then go on to discuss the nature of the Hansard corpus, and describe which properties of this corpus in particular prove challenging for use on the system architecture used. The solution for tagging the corpus is then described, along with performance comparisons against a naive run on commodity hardware. We discuss the gains and benefits of using high-performance machinery rather than relatively cheap commodity hardware. Our poster provides a valuable scenario for large scale NLP pipelines and lessons learnt from the experience

Research

Associated organisational units

Links

Keywords

Experiences with parallelisation of an existing NLP pipeline: tagging Hansard

Conference

Conference

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us