Final published version
Licence: CC BY: Creative Commons Attribution 4.0 International License
Research output: Contribution to Journal/Magazine › Journal article › peer-review
Research output: Contribution to Journal/Magazine › Journal article › peer-review
}
TY - JOUR
T1 - ParlaMint II
T2 - advancing comparable parliamentary corpora across Europe
AU - Erjavec, Tomaž
AU - Kopp, Matyáš
AU - Ljubešić, Nikola
AU - Kuzman, Taja
AU - Rayson, Paul
AU - Osenova, Petya
AU - Ogrodniczuk, Maciej
AU - Çöltekin, Çağrı
AU - Koržinek, Danijel
AU - Meden, Katja
AU - Skubic, Jure
AU - Rupnik, Peter
AU - Agnoloni, Tommaso
AU - Aires, José
AU - Barkarson, Starkaður
AU - Bartolini, Roberto
AU - Bel, Núria
AU - Calzada Pérez, María
AU - Darģis, Roberts
AU - Diwersy, Sascha
AU - Gavriilidou, Maria
AU - van Heusden, Ruben
AU - Iruskieta, Mikel
AU - Kahusk, Neeme
AU - Kryvenko, Anna
AU - Ligeti-Nagy, Noémi
AU - Magariños, Carmen
AU - Mölder, Martin
AU - Navarretta, Costanza
AU - Simov, Kiril
AU - Tungland, Lars Magne
AU - Tuominen, Jouni
AU - Vidler, John
AU - Vladu, Adina Ioana
AU - Wissik, Tanja
AU - Yrjänäinen, Väinö
AU - Fišer, Darja
PY - 2024/12/28
Y1 - 2024/12/28
N2 - The paper presents the results of the ParlaMint II project, which comprise comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and are linguistically annotated up to the level of Universal Dependencies syntax and named entities. The paper focuses on the enhancement made since the ParlaMint I project and presents the compilation of the corpora, including the encoding infrastructure, use of GitHub, the production of individual corpora, the common pipeline for producing their distribution, and use of CLARIN services for dissemination. It then gives a quantitative overview of the produced corpora, followed by the qualitative additions made within the ParlaMint II project, namely metadata localisation, the addition of new metadata, such as the political orientation of political parties, the machine translation of the corpora to English and its tagging with semantic classes, and the production of pilot speech corpora. Finally, outreach activities and further work are discussed.
AB - The paper presents the results of the ParlaMint II project, which comprise comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and are linguistically annotated up to the level of Universal Dependencies syntax and named entities. The paper focuses on the enhancement made since the ParlaMint I project and presents the compilation of the corpora, including the encoding infrastructure, use of GitHub, the production of individual corpora, the common pipeline for producing their distribution, and use of CLARIN services for dissemination. It then gives a quantitative overview of the produced corpora, followed by the qualitative additions made within the ParlaMint II project, namely metadata localisation, the addition of new metadata, such as the political orientation of political parties, the machine translation of the corpora to English and its tagging with semantic classes, and the production of pilot speech corpora. Finally, outreach activities and further work are discussed.
KW - Comparable corpora
KW - Parliamentary proceedings
KW - TEI
U2 - 10.1007/s10579-024-09798-w
DO - 10.1007/s10579-024-09798-w
M3 - Journal article
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
SN - 1574-020X
ER -