Home > Research > Publications & Outputs > Increasing Interoperability for Embedding Corpu...

Electronic data

  • wmatrix-interoperability-cmlc

    Accepted author manuscript, 228 KB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Links

View graph of relations

Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published
Publication date7/05/2018
Host publication6th Workshop on the Challenges in the Management of Large Corpora: Proceedings of the 11th Edition of the Language Resources and Evaluation Conference - Miyazaki, Japan
EditorsPiotr Banski, Marc Kupietz, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Andreas Witt
Pages33-36
Number of pages4
<mark>Original language</mark>English
Event6th Workshop on the Challenges in the Management of Large Corpora - Miyazaki, Japan
Duration: 7/05/20187/05/2018
http://corpora.ids-mannheim.de/cmlc-2018.html

Workshop

Workshop6th Workshop on the Challenges in the Management of Large Corpora
Abbreviated titleCMLC-2018
Country/TerritoryJapan
CityMiyazaki
Period7/05/187/05/18
Internet address

Workshop

Workshop6th Workshop on the Challenges in the Management of Large Corpora
Abbreviated titleCMLC-2018
Country/TerritoryJapan
CityMiyazaki
Period7/05/187/05/18
Internet address

Abstract

Computational tools and methods employed in corpus linguistics are split into three main types: compilation, annotation and retrieval. These mirror and support the usual corpus linguistics methodology of corpus collection, manual and/or automatic tagging, followed by query and analysis. Typically, corpus software to support retrieval implements some or all of the five major methods in corpus linguistics only at the word level: frequency list, concordance, keyword, collocation and n-gram, and such software may or may not provide support for text which has already been tagged, for example at the part-of-speech (POS) level. Wmatrix is currently one of the few retrieval tools which have annotation tools built in. However, annotation in Wmatrix is currently limited to the UCREL English POS and semantic tagging pipeline. In this paper, we describe an approach to extend support for embedding other tagging pipelines and tools in Wmatrix via the use of APIs, and describe how such an approach is also applicable to other retrieval tools, potentially enabling support for tagged data.