Biographical Semi-Supervised Relation Extraction Dataset

Computing and Communications

Text available via DOI:

https://doi.org/10.1145/3477495.3531742
Final published version

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

Biographical Semi-Supervised Relation Extraction Dataset. / Plum, Alistair; Ranasinghe, Tharindu; Jones, Spencer et al.
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: Association for Computing Machinery (ACM), 2022. p. 3121-3130.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Plum, A, Ranasinghe, T, Jones, S, Orasan, C & Mitkov, R 2022, Biographical Semi-Supervised Relation Extraction Dataset. in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery (ACM), New York, pp. 3121-3130, 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11/07/22. https://doi.org/10.1145/3477495.3531742

APA

Plum, A., Ranasinghe, T., Jones, S., Orasan, C., & Mitkov, R. (2022). Biographical Semi-Supervised Relation Extraction Dataset. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 3121-3130). Association for Computing Machinery (ACM). https://doi.org/10.1145/3477495.3531742

Vancouver

Plum A, Ranasinghe T, Jones S, Orasan C, Mitkov R. Biographical Semi-Supervised Relation Extraction Dataset. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: Association for Computing Machinery (ACM). 2022. p. 3121-3130 doi: 10.1145/3477495.3531742

Author

Plum, Alistair ; Ranasinghe, Tharindu ; Jones, Spencer et al. / Biographical Semi-Supervised Relation Extraction Dataset. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York : Association for Computing Machinery (ACM), 2022. pp. 3121-3130

Bibtex

@inproceedings{eb2c66159c77405395fa380f538609b1,

title = "Biographical Semi-Supervised Relation Extraction Dataset",

abstract = "Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developedBiographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set.Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.",

author = "Alistair Plum and Tharindu Ranasinghe and Spencer Jones and Constantin Orasan and Ruslan Mitkov",

year = "2022",

month = jul,

day = "7",

doi = "10.1145/3477495.3531742",

language = "English",

pages = "3121--3130",

booktitle = "Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval",

publisher = "Association for Computing Machinery (ACM)",

address = "United States",

note = "45th International ACM SIGIR Conference on Research and Development in Information Retrieval ; Conference date: 11-07-2022 Through 15-07-2022",

}

RIS

TY - GEN

T1 - Biographical Semi-Supervised Relation Extraction Dataset

AU - Plum, Alistair

AU - Ranasinghe, Tharindu

AU - Jones, Spencer

AU - Orasan, Constantin

AU - Mitkov, Ruslan

PY - 2022/7/7

Y1 - 2022/7/7

N2 - Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developedBiographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set.Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.

AB - Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developedBiographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set.Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.

U2 - 10.1145/3477495.3531742

DO - 10.1145/3477495.3531742

M3 - Conference contribution/Paper

SP - 3121

EP - 3130

BT - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

PB - Association for Computing Machinery (ACM)

CY - New York

T2 - 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Y2 - 11 July 2022 through 15 July 2022

ER -

Research

Links

Text available via DOI: