Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools

Associated organisational units

Electronic data

wmatrix-interoperability-cmlc
Accepted author manuscript, 228 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools. / Rayson, Paul Edward.
6th Workshop on the Challenges in the Management of Large Corpora: Proceedings of the 11th Edition of the Language Resources and Evaluation Conference - Miyazaki, Japan. ed. / Piotr Banski; Marc Kupietz; Hanno Biber; Evelyn Breiteneder; Simon Clematide; Andreas Witt. 2018. p. 33-36.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Rayson, PE 2018, Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools. in P Banski, M Kupietz, H Biber, E Breiteneder, S Clematide & A Witt (eds), 6th Workshop on the Challenges in the Management of Large Corpora: Proceedings of the 11th Edition of the Language Resources and Evaluation Conference - Miyazaki, Japan. pp. 33-36, 6th Workshop on the Challenges in the Management of Large Corpora, Miyazaki, Japan, 7/05/18. <http://lrec-conf.org/workshops/lrec2018/W17/index.html>

APA

Rayson, P. E. (2018). Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools. In P. Banski, M. Kupietz, H. Biber, E. Breiteneder, S. Clematide, & A. Witt (Eds.), 6th Workshop on the Challenges in the Management of Large Corpora: Proceedings of the 11th Edition of the Language Resources and Evaluation Conference - Miyazaki, Japan (pp. 33-36) http://lrec-conf.org/workshops/lrec2018/W17/index.html

Vancouver

Rayson PE. Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools. In Banski P, Kupietz M, Biber H, Breiteneder E, Clematide S, Witt A, editors, 6th Workshop on the Challenges in the Management of Large Corpora: Proceedings of the 11th Edition of the Language Resources and Evaluation Conference - Miyazaki, Japan. 2018. p. 33-36

Author

Rayson, Paul Edward. / Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools. 6th Workshop on the Challenges in the Management of Large Corpora: Proceedings of the 11th Edition of the Language Resources and Evaluation Conference - Miyazaki, Japan. editor / Piotr Banski ; Marc Kupietz ; Hanno Biber ; Evelyn Breiteneder ; Simon Clematide ; Andreas Witt. 2018. pp. 33-36

Bibtex

@inproceedings{a77a6af89d1f4764a8ba09cd98e53532,

title = "Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools",

abstract = "Computational tools and methods employed in corpus linguistics are split into three main types: compilation, annotation and retrieval. These mirror and support the usual corpus linguistics methodology of corpus collection, manual and/or automatic tagging, followed by query and analysis. Typically, corpus software to support retrieval implements some or all of the five major methods in corpus linguistics only at the word level: frequency list, concordance, keyword, collocation and n-gram, and such software may or may not provide support for text which has already been tagged, for example at the part-of-speech (POS) level. Wmatrix is currently one of the few retrieval tools which have annotation tools built in. However, annotation in Wmatrix is currently limited to the UCREL English POS and semantic tagging pipeline. In this paper, we describe an approach to extend support for embedding other tagging pipelines and tools in Wmatrix via the use of APIs, and describe how such an approach is also applicable to other retrieval tools, potentially enabling support for tagged data.",

author = "Rayson, {Paul Edward}",

year = "2018",

month = may,

day = "7",

language = "English",

isbn = "9791095546146",

pages = "33--36",

editor = "Piotr Banski and Marc Kupietz and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Andreas Witt",

booktitle = "6th Workshop on the Challenges in the Management of Large Corpora",

note = "6th Workshop on the Challenges in the Management of Large Corpora, CMLC-2018 ; Conference date: 07-05-2018 Through 07-05-2018",

url = "http://corpora.ids-mannheim.de/cmlc-2018.html",

}

RIS

TY - GEN

T1 - Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools

AU - Rayson, Paul Edward

PY - 2018/5/7

Y1 - 2018/5/7

N2 - Computational tools and methods employed in corpus linguistics are split into three main types: compilation, annotation and retrieval. These mirror and support the usual corpus linguistics methodology of corpus collection, manual and/or automatic tagging, followed by query and analysis. Typically, corpus software to support retrieval implements some or all of the five major methods in corpus linguistics only at the word level: frequency list, concordance, keyword, collocation and n-gram, and such software may or may not provide support for text which has already been tagged, for example at the part-of-speech (POS) level. Wmatrix is currently one of the few retrieval tools which have annotation tools built in. However, annotation in Wmatrix is currently limited to the UCREL English POS and semantic tagging pipeline. In this paper, we describe an approach to extend support for embedding other tagging pipelines and tools in Wmatrix via the use of APIs, and describe how such an approach is also applicable to other retrieval tools, potentially enabling support for tagged data.

AB - Computational tools and methods employed in corpus linguistics are split into three main types: compilation, annotation and retrieval. These mirror and support the usual corpus linguistics methodology of corpus collection, manual and/or automatic tagging, followed by query and analysis. Typically, corpus software to support retrieval implements some or all of the five major methods in corpus linguistics only at the word level: frequency list, concordance, keyword, collocation and n-gram, and such software may or may not provide support for text which has already been tagged, for example at the part-of-speech (POS) level. Wmatrix is currently one of the few retrieval tools which have annotation tools built in. However, annotation in Wmatrix is currently limited to the UCREL English POS and semantic tagging pipeline. In this paper, we describe an approach to extend support for embedding other tagging pipelines and tools in Wmatrix via the use of APIs, and describe how such an approach is also applicable to other retrieval tools, potentially enabling support for tagged data.

M3 - Conference contribution/Paper

SN - 9791095546146

SP - 33

EP - 36

BT - 6th Workshop on the Challenges in the Management of Large Corpora

A2 - Banski, Piotr

A2 - Kupietz, Marc

A2 - Biber, Hanno

A2 - Breiteneder, Evelyn

A2 - Clematide, Simon

A2 - Witt, Andreas

T2 - 6th Workshop on the Challenges in the Management of Large Corpora

Y2 - 7 May 2018 through 7 May 2018

ER -

Research

Associated organisational units

Electronic data

Links