Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science

Associated organisational units

Text available via DOI:

https://doi.org/10.1016/j.heliyon.2022.e10710
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Keywords

Multidisciplinary

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science. / Nundloll, Vatsala ; Smail, Robert ; Stevens, Carly et al.
In: Heliyon, Vol. 8, No. 10, e10710, 10.10.2022.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Bibtex

@article{e89a4e6e14b546c48c6586e803aac3fc,

title = "Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science",

abstract = "Data heterogeneity is a pressing issue and is further compounded if we have to deal with data from textual documents. The unstructured nature of such documents implies that collating, comparing and analysing the information contained therein can be a challenging task. Automating these processes can help to unleash insightful knowledge that otherwise remains buried in them. Moreover, integrating the extracted information from the documents with other related information can help to make more information-rich queries. In this context, the paper presents a comprehensive review of text extraction and data integration techniques to enable this automation process in an ecological context. The paper investigates into extracting valuable floristic information from a historical Botany journal. The purpose behind this extraction is to bring to light relevant pieces of information contained within the document. In addition, the paper also explores the need to integrate the extracted information together with other related information from disparate sources. All the information is then rendered into a query-able form in order to make unified queries. Hence, the paper makes use of a combination of Machine Learning, Natural Language Processing and Semantic Web techniques to achieve this. The proposed approach is demonstrated through the information extracted from the journal and the information-rich queries made through the integration process. The paper shows that the approach has a merit in extracting relevant information from the journal, discusses how the machine learning models have been designed to classify complex information and also gives a measure of their performance. The paper also shows that the approach has a merit in query time in regard to querying floristic information from a multi-source linked data model.",

keywords = "Multidisciplinary",

author = "Vatsala Nundloll and Robert Smail and Carly Stevens and Gordon Blair",

year = "2022",

month = oct,

day = "10",

doi = "10.1016/j.heliyon.2022.e10710",

language = "English",

volume = "8",

journal = "Heliyon",

issn = "2405-8440",

publisher = "Elsevier",

number = "10",

}

RIS

TY - JOUR

T1 - Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science

AU - Nundloll, Vatsala

AU - Smail, Robert

AU - Stevens, Carly

AU - Blair, Gordon

PY - 2022/10/10

Y1 - 2022/10/10

N2 - Data heterogeneity is a pressing issue and is further compounded if we have to deal with data from textual documents. The unstructured nature of such documents implies that collating, comparing and analysing the information contained therein can be a challenging task. Automating these processes can help to unleash insightful knowledge that otherwise remains buried in them. Moreover, integrating the extracted information from the documents with other related information can help to make more information-rich queries. In this context, the paper presents a comprehensive review of text extraction and data integration techniques to enable this automation process in an ecological context. The paper investigates into extracting valuable floristic information from a historical Botany journal. The purpose behind this extraction is to bring to light relevant pieces of information contained within the document. In addition, the paper also explores the need to integrate the extracted information together with other related information from disparate sources. All the information is then rendered into a query-able form in order to make unified queries. Hence, the paper makes use of a combination of Machine Learning, Natural Language Processing and Semantic Web techniques to achieve this. The proposed approach is demonstrated through the information extracted from the journal and the information-rich queries made through the integration process. The paper shows that the approach has a merit in extracting relevant information from the journal, discusses how the machine learning models have been designed to classify complex information and also gives a measure of their performance. The paper also shows that the approach has a merit in query time in regard to querying floristic information from a multi-source linked data model.

AB - Data heterogeneity is a pressing issue and is further compounded if we have to deal with data from textual documents. The unstructured nature of such documents implies that collating, comparing and analysing the information contained therein can be a challenging task. Automating these processes can help to unleash insightful knowledge that otherwise remains buried in them. Moreover, integrating the extracted information from the documents with other related information can help to make more information-rich queries. In this context, the paper presents a comprehensive review of text extraction and data integration techniques to enable this automation process in an ecological context. The paper investigates into extracting valuable floristic information from a historical Botany journal. The purpose behind this extraction is to bring to light relevant pieces of information contained within the document. In addition, the paper also explores the need to integrate the extracted information together with other related information from disparate sources. All the information is then rendered into a query-able form in order to make unified queries. Hence, the paper makes use of a combination of Machine Learning, Natural Language Processing and Semantic Web techniques to achieve this. The proposed approach is demonstrated through the information extracted from the journal and the information-rich queries made through the integration process. The paper shows that the approach has a merit in extracting relevant information from the journal, discusses how the machine learning models have been designed to classify complex information and also gives a measure of their performance. The paper also shows that the approach has a merit in query time in regard to querying floristic information from a multi-source linked data model.

KW - Multidisciplinary

U2 - 10.1016/j.heliyon.2022.e10710

DO - 10.1016/j.heliyon.2022.e10710

M3 - Journal article

VL - 8

JO - Heliyon

JF - Heliyon

SN - 2405-8440

IS - 10

M1 - e10710

ER -

Research

Associated organisational units

Links

Text available via DOI:

Keywords