Home > Research > Publications & Outputs > DORE

Electronic data

  • 2024.lrec-main.473

    Final published version, 248 KB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

View graph of relations

DORE: A Dataset for Portuguese Definition Generation

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published
Close
Publication date20/05/2024
Host publicationProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
PublisherELRA and ICCL
Pages5315-5322
Number of pages8
ISBN (print)9782493814104
<mark>Original language</mark>English
Event The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italy
Duration: 20/05/202425/05/2024
https://lrec-coling-2024.org/

Conference

Conference The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Abbreviated titleLREC-COLING 2024
Country/TerritoryItaly
CityTorino
Period20/05/2425/05/24
Internet address

Conference

Conference The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Abbreviated titleLREC-COLING 2024
Country/TerritoryItaly
CityTorino
Period20/05/2425/05/24
Internet address

Abstract

Definition modelling (DM) is the task of automatically generating a dictionary definition of a specific word. Computational systems that are capable of DM can have numerous applications benefiting a wide range of audiences. As DM is considered a supervised natural language generation problem, these systems require large annotated datasets to train the machine learning (ML) models. Several DM datasets have been released for English and other high-resource languages. While Portuguese is considered a mid/high-resource language in most natural language processing tasks and is spoken by more than 200 million native speakers, there is no DM dataset available for Portuguese. In this research, we fill this gap by introducing DORE; the first dataset for Definition MOdelling for PoRtuguEse containing more than 100,000 definitions. We also evaluate several deep learning based DM models on DORE and report the results. The dataset and the findings of this paper will facilitate research and study of Portuguese in wider contexts.