Infrastructure for Semantic Annotation in the Genomics Domain

Associated organisational units

Electronic data

genomics
Accepted author manuscript, 1.06 MB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License
2020.lrec-1.855
Final published version, 1.12 MB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

Infrastructure for Semantic Annotation in the Genomics Domain. / El-Haj, Mahmoud; Rutherford, Nathan; Coole, Matthew et al.
LREC 2020, Twelfth International Conference on Language Resources and Evaluation: LREC'20. Paris: European Language Resources Association (ELRA), 2020. p. 6921-6929.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

El-Haj, M, Rutherford, N, Coole, M , Ezeani, I , Prentice, S, Ide, N, Knight, J , Piao, S , Mariani, J , Rayson, P & Suderman, K 2020, Infrastructure for Semantic Annotation in the Genomics Domain. in LREC 2020, Twelfth International Conference on Language Resources and Evaluation: LREC'20. European Language Resources Association (ELRA), Paris, pp. 6921-6929. <https://www.aclweb.org/anthology/2020.lrec-1.855>

APA

El-Haj, M., Rutherford, N., Coole, M., Ezeani, I., Prentice, S., Ide, N., Knight, J., Piao, S., Mariani, J., Rayson, P., & Suderman, K. (2020). Infrastructure for Semantic Annotation in the Genomics Domain. In LREC 2020, Twelfth International Conference on Language Resources and Evaluation: LREC'20 (pp. 6921-6929). European Language Resources Association (ELRA). https://www.aclweb.org/anthology/2020.lrec-1.855

Vancouver

El-Haj M, Rutherford N, Coole M , Ezeani I , Prentice S, Ide N et al. Infrastructure for Semantic Annotation in the Genomics Domain. In LREC 2020, Twelfth International Conference on Language Resources and Evaluation: LREC'20. Paris: European Language Resources Association (ELRA). 2020. p. 6921-6929

Author

El-Haj, Mahmoud ; Rutherford, Nathan ; Coole, Matthew et al. / Infrastructure for Semantic Annotation in the Genomics Domain. LREC 2020, Twelfth International Conference on Language Resources and Evaluation: LREC'20. Paris : European Language Resources Association (ELRA), 2020. pp. 6921-6929

Bibtex

@inproceedings{0eee330987564bd8b9448f8c121ffb06,

title = "Infrastructure for Semantic Annotation in the Genomics Domain",

abstract = "We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods.The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST isalso connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.",

author = "Mahmoud El-Haj and Nathan Rutherford and Matthew Coole and Ignatius Ezeani and Sheryl Prentice and Nancy Ide and Jo Knight and Scott Piao and John Mariani and Paul Rayson and Keith Suderman",

year = "2020",

month = may,

day = "11",

language = "English",

isbn = "9791095546344",

pages = "6921--6929",

booktitle = "LREC 2020, Twelfth International Conference on Language Resources and Evaluation",

publisher = "European Language Resources Association (ELRA)",

}

RIS

TY - GEN

T1 - Infrastructure for Semantic Annotation in the Genomics Domain

AU - El-Haj, Mahmoud

AU - Rutherford, Nathan

AU - Coole, Matthew

AU - Ezeani, Ignatius

AU - Prentice, Sheryl

AU - Ide, Nancy

AU - Knight, Jo

AU - Piao, Scott

AU - Mariani, John

AU - Rayson, Paul

AU - Suderman, Keith

PY - 2020/5/11

Y1 - 2020/5/11

N2 - We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods.The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST isalso connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.

AB - We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods.The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST isalso connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.

M3 - Conference contribution/Paper

SN - 9791095546344

SP - 6921

EP - 6929

BT - LREC 2020, Twelfth International Conference on Language Resources and Evaluation

PB - European Language Resources Association (ELRA)

CY - Paris

ER -

Research

Associated organisational units

Electronic data

Links