Home > Research > Publications & Outputs > The Multilingual Corpus of World’s Constitution...

Electronic data

  • mcwc-elhaj

    Accepted author manuscript, 738 KB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

View graph of relations

The Multilingual Corpus of World’s Constitutions (MCWC)

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

The Multilingual Corpus of World’s Constitutions (MCWC). / El-Haj, Mo; Ezzini, Saad.
6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings. ed. / Hend Al-Khalifa; Kareem Darwish; Hamdy Mubarak; Mona Ali; Tamer Elsayed. Turin: European Language Resources Association (ELRA), 2024. p. 57-66 (6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings).

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

El-Haj, M & Ezzini, S 2024, The Multilingual Corpus of World’s Constitutions (MCWC). in H Al-Khalifa, K Darwish, H Mubarak, M Ali & T Elsayed (eds), 6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings. 6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings, European Language Resources Association (ELRA), Turin, pp. 57-66, 6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation, Torino, Italy, 25/05/24. <https://aclanthology.org/2024.osact-1.7/>

APA

El-Haj, M., & Ezzini, S. (2024). The Multilingual Corpus of World’s Constitutions (MCWC). In H. Al-Khalifa, K. Darwish, H. Mubarak, M. Ali, & T. Elsayed (Eds.), 6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings (pp. 57-66). (6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings). European Language Resources Association (ELRA). https://aclanthology.org/2024.osact-1.7/

Vancouver

El-Haj M, Ezzini S. The Multilingual Corpus of World’s Constitutions (MCWC). In Al-Khalifa H, Darwish K, Mubarak H, Ali M, Elsayed T, editors, 6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings. Turin: European Language Resources Association (ELRA). 2024. p. 57-66. (6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings).

Author

El-Haj, Mo ; Ezzini, Saad. / The Multilingual Corpus of World’s Constitutions (MCWC). 6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings. editor / Hend Al-Khalifa ; Kareem Darwish ; Hamdy Mubarak ; Mona Ali ; Tamer Elsayed. Turin : European Language Resources Association (ELRA), 2024. pp. 57-66 (6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings).

Bibtex

@inproceedings{dce8098a46e74e97bf0ae22b61c61fd7,
title = "The Multilingual Corpus of World{\textquoteright}s Constitutions (MCWC)",
abstract = "The “Multilingual Corpus of World{\textquoteright}s Constitutions” (MCWC) is a rich resource available in English, Arabic, and Spanish, encompassing constitutions from various nations. This corpus serves as a vital asset for the NLP community, facilitating advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. To ensure comprehensive coverage, for constitutions not originally available in Arabic and Spanish, we employed a fine-tuned state-of-the-art machine translation model. MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. MCWC{\textquoteright}s diverse multilingual content and commitment to data quality contribute to advancements in legal text analysis within the NLP community, facilitating exploration of constitutional texts and multilingual data analysis.",
keywords = "Constitutions, Corpus, Fine-tuning, Legal Documents, Machine Translation",
author = "Mo El-Haj and Saad Ezzini",
note = "Publisher Copyright: {\textcopyright} 2024 ELRA Language Resource Association.; 6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation ; Conference date: 25-05-2024",
year = "2024",
month = may,
day = "25",
language = "English",
series = "6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings",
publisher = "European Language Resources Association (ELRA)",
pages = "57--66",
editor = "Hend Al-Khalifa and Kareem Darwish and Hamdy Mubarak and Mona Ali and Tamer Elsayed",
booktitle = "6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings",

}

RIS

TY - GEN

T1 - The Multilingual Corpus of World’s Constitutions (MCWC)

AU - El-Haj, Mo

AU - Ezzini, Saad

N1 - Publisher Copyright: © 2024 ELRA Language Resource Association.

PY - 2024/5/25

Y1 - 2024/5/25

N2 - The “Multilingual Corpus of World’s Constitutions” (MCWC) is a rich resource available in English, Arabic, and Spanish, encompassing constitutions from various nations. This corpus serves as a vital asset for the NLP community, facilitating advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. To ensure comprehensive coverage, for constitutions not originally available in Arabic and Spanish, we employed a fine-tuned state-of-the-art machine translation model. MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. MCWC’s diverse multilingual content and commitment to data quality contribute to advancements in legal text analysis within the NLP community, facilitating exploration of constitutional texts and multilingual data analysis.

AB - The “Multilingual Corpus of World’s Constitutions” (MCWC) is a rich resource available in English, Arabic, and Spanish, encompassing constitutions from various nations. This corpus serves as a vital asset for the NLP community, facilitating advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. To ensure comprehensive coverage, for constitutions not originally available in Arabic and Spanish, we employed a fine-tuned state-of-the-art machine translation model. MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. MCWC’s diverse multilingual content and commitment to data quality contribute to advancements in legal text analysis within the NLP community, facilitating exploration of constitutional texts and multilingual data analysis.

KW - Constitutions

KW - Corpus

KW - Fine-tuning

KW - Legal Documents

KW - Machine Translation

M3 - Conference contribution/Paper

AN - SCOPUS:85195418248

T3 - 6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings

SP - 57

EP - 66

BT - 6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation at LREC-COLING 2024 - Workshop Proceedings

A2 - Al-Khalifa, Hend

A2 - Darwish, Kareem

A2 - Mubarak, Hamdy

A2 - Ali, Mona

A2 - Elsayed, Tamer

PB - European Language Resources Association (ELRA)

CY - Turin

T2 - 6th Workshop on Open-Source Arabic Corpora and Processing Tools, OSACT 2024 with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation

Y2 - 25 May 2024

ER -