Home > Research > Publications & Outputs > LexiDB: Patterns & Methods for Corpus Linguisti...

Electronic data

  • LREC2020 (1)

    Accepted author manuscript, 202 KB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

  • 2020.lrec-1.383

    Final published version, 263 KB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Links

View graph of relations

LexiDB: Patterns & Methods for Corpus Linguistic Database Management

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published
NullPointerException

Abstract

LexiDB is a tool for storing, managing and querying corpus data. In contrast to other database management systems (DBMSs), itis designed specifically for text corpora. It improves on other corpus management systems (CMSs) because data can be added anddeleted from corpora on the fly with the ability to add live data to existing corpora. LexiDB sits between these two categories ofDBMSs and CMSs, more specialised to language data than a general-purpose DBMS but more flexible than a traditional static corpusmanagement system. Previous work has demonstrated the scalability of LexiDB in response to the growing need to be able to scale outfor ever-growing corpus datasets. Here, we present the patterns and methods developed in LexiDB for storage, retrieval and querying ofmulti-level annotated corpus data. These techniques are evaluated and compared to an existing CMS (Corpus Workbench CWB - CQP)and indexer (Lucene). We find that LexiDB consistently outperforms existing tools for corpus queries. This is particularly apparent withlarge corpora and when handling queries with large result sets.