Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain

Home > Research > Publications & Outputs > Data augmentation and transfer learning for cro...

Computing and Communications

Text available via DOI:

https://doi.org/10.1007/s10579-024-09738-8
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

B.S. Lancheros
G. Corpas Pastor
R. Mitkov

More...

<mark>Journal publication date</mark>	30/06/2025
<mark>Journal</mark>	Language Resources and Evaluation
Volume	59
Number of pages	20
Pages (from-to)	665-684
Publication Status	Published
Early online date	10/05/24
<mark>Original language</mark>	English

Abstract

Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the Colorado Richly Annotated Full-Text (CRAFT) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. We evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.

Research

Links

Text available via DOI:

Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain

Abstract

Quick Links

Connect With Us

Faculties & Depts

Contact Us