Novel database design for extreme scale corpus analysis

Associated organisational units

Electronic data

2021CoolePhD
Final published version, 1.67 MB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.17635/lancaster/thesis/1236
Final published version

View graph of relations

Research output: Thesis › Doctoral Thesis

Published

Standard

Novel database design for extreme scale corpus analysis. / Coole, Matthew.
Lancaster University, 2021. 172 p.

Research output: Thesis › Doctoral Thesis

Harvard

Coole, M 2021, 'Novel database design for extreme scale corpus analysis', PhD, Lancaster University. https://doi.org/10.17635/lancaster/thesis/1236

APA

Coole, M. (2021). Novel database design for extreme scale corpus analysis. [Doctoral Thesis, Lancaster University]. Lancaster University. https://doi.org/10.17635/lancaster/thesis/1236

Vancouver

Coole M. Novel database design for extreme scale corpus analysis. Lancaster University, 2021. 172 p. doi: 10.17635/lancaster/thesis/1236

Author

Coole, Matthew. / Novel database design for extreme scale corpus analysis. Lancaster University, 2021. 172 p.

Bibtex

@phdthesis{bf2f474d239f4bf9a80193ce6e91fb38,

title = "Novel database design for extreme scale corpus analysis",

abstract = "This thesis presents the patterns and methods uncovered in the development of a new scalable corpus database management system, LexiDB, which can handle the ever-growing size of modern corpus datasets. Initially, an exploration of existing corpus data systems is conducted which examines their usage in corpus linguistics as well as their underlying architectures. From this survey, it is identified that existing systems are designed primarily to be vertically scalable (i.e. scalable through the usage of bigger, better and faster hardware). This motivates a wider examination of modern distributable database management systems and information retrieval techniques used for indexing and retrieval. These techniques are modified and adapted into an architecture that can be horizontally scaled to handle ever bigger corpora. Based on this architecture several new methods for querying and retrieval that improve upon existing techniques are proposed as modern approaches to query extremely large annotated text collections for corpus analysis. The effectiveness of these techniques and the scalability of the architecture is evaluated where it is demonstrated that the architecture is comparably scalable to two modern No-SQL database management systems and outperforms existing corpus data systems in token level pattern querying whilst still supporting character level pattern matching.",

author = "Matthew Coole",

year = "2021",

doi = "10.17635/lancaster/thesis/1236",

language = "English",

publisher = "Lancaster University",

school = "Lancaster University",

}

RIS

TY - BOOK

T1 - Novel database design for extreme scale corpus analysis

AU - Coole, Matthew

PY - 2021

Y1 - 2021

N2 - This thesis presents the patterns and methods uncovered in the development of a new scalable corpus database management system, LexiDB, which can handle the ever-growing size of modern corpus datasets. Initially, an exploration of existing corpus data systems is conducted which examines their usage in corpus linguistics as well as their underlying architectures. From this survey, it is identified that existing systems are designed primarily to be vertically scalable (i.e. scalable through the usage of bigger, better and faster hardware). This motivates a wider examination of modern distributable database management systems and information retrieval techniques used for indexing and retrieval. These techniques are modified and adapted into an architecture that can be horizontally scaled to handle ever bigger corpora. Based on this architecture several new methods for querying and retrieval that improve upon existing techniques are proposed as modern approaches to query extremely large annotated text collections for corpus analysis. The effectiveness of these techniques and the scalability of the architecture is evaluated where it is demonstrated that the architecture is comparably scalable to two modern No-SQL database management systems and outperforms existing corpus data systems in token level pattern querying whilst still supporting character level pattern matching.

AB - This thesis presents the patterns and methods uncovered in the development of a new scalable corpus database management system, LexiDB, which can handle the ever-growing size of modern corpus datasets. Initially, an exploration of existing corpus data systems is conducted which examines their usage in corpus linguistics as well as their underlying architectures. From this survey, it is identified that existing systems are designed primarily to be vertically scalable (i.e. scalable through the usage of bigger, better and faster hardware). This motivates a wider examination of modern distributable database management systems and information retrieval techniques used for indexing and retrieval. These techniques are modified and adapted into an architecture that can be horizontally scaled to handle ever bigger corpora. Based on this architecture several new methods for querying and retrieval that improve upon existing techniques are proposed as modern approaches to query extremely large annotated text collections for corpus analysis. The effectiveness of these techniques and the scalability of the architecture is evaluated where it is demonstrated that the architecture is comparably scalable to two modern No-SQL database management systems and outperforms existing corpus data systems in token level pattern querying whilst still supporting character level pattern matching.

U2 - 10.17635/lancaster/thesis/1236

DO - 10.17635/lancaster/thesis/1236

M3 - Doctoral Thesis

PB - Lancaster University

ER -

Research

Associated organisational units

Electronic data

Text available via DOI: