Home > Research > Datasets > GeoEDdA: A Gold Standard Dataset for Named Enti...
View graph of relations

GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span Categorization Annotations of Diderot & d'Alembert's Encyclopédie

Dataset

Description

This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries. The dataset is available in the following formats: JSONL format provided by Prodigy binary spaCy format (ready to use with the spaCy train pipeline) The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French. The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities. Tagset NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume. NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine. ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique. Relation: spatial relation, e.g. dans, sur, à 10 lieues de. Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44. NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs. NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline. ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine. NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671. Head: entry name Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie. HuggingFace The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda. Acknowledgement The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.
Date made available8/03/2024
PublisherZenodo

Contact person