A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

Associated organisational units

Electronic data

1-s2.0-S0885230816302121-main
Final published version, 2.46 MB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.1016/j.csl.2017.04.010
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Keywords

Semantic Annotation, Natural Language Processing, Historical Thesaurus, Semantic Lexicon, Corpus Annotation , Language Technology

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation. / Piao, Scott Songlin; Dallachy, Fraser; Baron, Alistair et al.
In: Computer Speech and Language, Vol. 46, 11.2017, p. 113-135.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Piao, SS, Dallachy, F, Baron, A , Demmen, JE, Wattam, S, Durkin, P, McCracken, J, Rayson, PE & Alexander, M 2017, 'A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation', Computer Speech and Language, vol. 46, pp. 113-135. https://doi.org/10.1016/j.csl.2017.04.010

APA

Piao, S. S., Dallachy, F., Baron, A., Demmen, J. E., Wattam, S., Durkin, P., McCracken, J., Rayson, P. E., & Alexander, M. (2017). A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation. Computer Speech and Language, 46, 113-135. https://doi.org/10.1016/j.csl.2017.04.010

Vancouver

Piao SS, Dallachy F, Baron A , Demmen JE, Wattam S, Durkin P et al. A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation. Computer Speech and Language. 2017 Nov;46:113-135. Epub 2017 May 17. doi: 10.1016/j.csl.2017.04.010

Author

Piao, Scott Songlin ; Dallachy, Fraser ; Baron, Alistair et al. / A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation. In: Computer Speech and Language. 2017 ; Vol. 46. pp. 113-135.

Bibtex

@article{7e0cf9d6d3d94734aac4faaad8e15564,

title = "A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation",

abstract = "Automatic extraction and analysis of meaning-related information from natural language data hasbeen an important issue in a number of research areas, such as natural language processing (NLP),text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language data using a semantic tagger. In practice, various semantic annotation tools have been designed to carry out different levels of semantic annotation, such as topics of documents, semantic role labeling, named entities or events. Currently, the majority of existing semantic annotation tools identify and tag partial core semantic information in language data, but they tend to be applicable only for modern language corpora. While such semantic analyzers have proven useful for various purposes, a semantic annotation tool that is capable of annotating deep semantic senses of all lexical units, or all-words tagging, is still desirable for a deep, comprehensive semantic analysis of language data. With large-scale digitization efforts underway, delivering historical corpora with texts dating from the last 400 years, a particularly challenging aspect is the need to adapt the annotation in the face of significant word meaning change over time. In this paper, we report on the development of a new semantic tagger (the Historical Thesaurus Semantic Tagger), and discuss challenging issues we faced in this work. This new semantic tagger is built on existing NLP tools and incorporates a large-scale historical English thesaurus linked to the Oxford English Dictionary. Employing contextual disambiguation algorithms, this tool is capable of annotating lexical units with a historically-valid highly fine-grained semantic categorization scheme that contains about 225,000 semantic concepts and 4,033 thematic semantic categories. In terms of novelty, it is adapted for processing historical English data, with rich information about historical usage of words and a spelling variant normalizer for historical forms of English. Furthermore, it is able to make use ofknowledge about the publication date of a text to adapt its output. In our evaluation, the systemachieved encouraging accuracies ranging from 77.12% to 91.08% on individual test texts. Applyingtime-sensitive methods improved results by as much as 3.54% and by 1.72% on average.",

keywords = "Semantic Annotation, Natural Language Processing, Historical Thesaurus, Semantic Lexicon, Corpus Annotation , Language Technology",

author = "Piao, {Scott Songlin} and Fraser Dallachy and Alistair Baron and Demmen, {Jane Elizabeth} and Steve Wattam and Philip Durkin and James McCracken and Rayson, {Paul Edward} and Marc Alexander",

year = "2017",

month = nov,

doi = "10.1016/j.csl.2017.04.010",

language = "English",

volume = "46",

pages = "113--135",

journal = "Computer Speech and Language",

issn = "0885-2308",

publisher = "Academic Press Inc.",

}

RIS

TY - JOUR

T1 - A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

AU - Piao, Scott Songlin

AU - Dallachy, Fraser

AU - Baron, Alistair

AU - Demmen, Jane Elizabeth

AU - Wattam, Steve

AU - Durkin, Philip

AU - McCracken, James

AU - Rayson, Paul Edward

AU - Alexander, Marc

PY - 2017/11

Y1 - 2017/11

N2 - Automatic extraction and analysis of meaning-related information from natural language data hasbeen an important issue in a number of research areas, such as natural language processing (NLP),text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language data using a semantic tagger. In practice, various semantic annotation tools have been designed to carry out different levels of semantic annotation, such as topics of documents, semantic role labeling, named entities or events. Currently, the majority of existing semantic annotation tools identify and tag partial core semantic information in language data, but they tend to be applicable only for modern language corpora. While such semantic analyzers have proven useful for various purposes, a semantic annotation tool that is capable of annotating deep semantic senses of all lexical units, or all-words tagging, is still desirable for a deep, comprehensive semantic analysis of language data. With large-scale digitization efforts underway, delivering historical corpora with texts dating from the last 400 years, a particularly challenging aspect is the need to adapt the annotation in the face of significant word meaning change over time. In this paper, we report on the development of a new semantic tagger (the Historical Thesaurus Semantic Tagger), and discuss challenging issues we faced in this work. This new semantic tagger is built on existing NLP tools and incorporates a large-scale historical English thesaurus linked to the Oxford English Dictionary. Employing contextual disambiguation algorithms, this tool is capable of annotating lexical units with a historically-valid highly fine-grained semantic categorization scheme that contains about 225,000 semantic concepts and 4,033 thematic semantic categories. In terms of novelty, it is adapted for processing historical English data, with rich information about historical usage of words and a spelling variant normalizer for historical forms of English. Furthermore, it is able to make use ofknowledge about the publication date of a text to adapt its output. In our evaluation, the systemachieved encouraging accuracies ranging from 77.12% to 91.08% on individual test texts. Applyingtime-sensitive methods improved results by as much as 3.54% and by 1.72% on average.

AB - Automatic extraction and analysis of meaning-related information from natural language data hasbeen an important issue in a number of research areas, such as natural language processing (NLP),text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language data using a semantic tagger. In practice, various semantic annotation tools have been designed to carry out different levels of semantic annotation, such as topics of documents, semantic role labeling, named entities or events. Currently, the majority of existing semantic annotation tools identify and tag partial core semantic information in language data, but they tend to be applicable only for modern language corpora. While such semantic analyzers have proven useful for various purposes, a semantic annotation tool that is capable of annotating deep semantic senses of all lexical units, or all-words tagging, is still desirable for a deep, comprehensive semantic analysis of language data. With large-scale digitization efforts underway, delivering historical corpora with texts dating from the last 400 years, a particularly challenging aspect is the need to adapt the annotation in the face of significant word meaning change over time. In this paper, we report on the development of a new semantic tagger (the Historical Thesaurus Semantic Tagger), and discuss challenging issues we faced in this work. This new semantic tagger is built on existing NLP tools and incorporates a large-scale historical English thesaurus linked to the Oxford English Dictionary. Employing contextual disambiguation algorithms, this tool is capable of annotating lexical units with a historically-valid highly fine-grained semantic categorization scheme that contains about 225,000 semantic concepts and 4,033 thematic semantic categories. In terms of novelty, it is adapted for processing historical English data, with rich information about historical usage of words and a spelling variant normalizer for historical forms of English. Furthermore, it is able to make use ofknowledge about the publication date of a text to adapt its output. In our evaluation, the systemachieved encouraging accuracies ranging from 77.12% to 91.08% on individual test texts. Applyingtime-sensitive methods improved results by as much as 3.54% and by 1.72% on average.

KW - Semantic Annotation

KW - Natural Language Processing

KW - Historical Thesaurus

KW - Semantic Lexicon

KW - Corpus Annotation

KW - Language Technology

U2 - 10.1016/j.csl.2017.04.010

DO - 10.1016/j.csl.2017.04.010

M3 - Journal article

VL - 46

SP - 113

EP - 135

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

ER -

Research

Associated organisational units

Electronic data

Links

Text available via DOI:

Keywords