The Multilingual Corpus of World’s Constitutions (MCWC) - Research Portal

Associated organisational units

Electronic data

mcwc-elhaj
Accepted author manuscript, 738 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Keywords

Constitutions, Corpus, Fine-tuning, Machine Translation, Legal Documents

View graph of relations

The Multilingual Corpus of World’s Constitutions (MCWC): MCWC

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review

Forthcoming

Standard

The Multilingual Corpus of World’s Constitutions (MCWC): MCWC. / El-Haj, Mahmoud ; Ezzini, Saad.
2024. Paper presented at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italy.

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review

Harvard

El-Haj, M & Ezzini, S 2024, 'The Multilingual Corpus of World’s Constitutions (MCWC): MCWC', Paper presented at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italy, 20/05/24 - 25/05/24.

APA

El-Haj, M., & Ezzini, S. (in press). The Multilingual Corpus of World’s Constitutions (MCWC): MCWC. Paper presented at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italy.

Vancouver

El-Haj M , Ezzini S. The Multilingual Corpus of World’s Constitutions (MCWC): MCWC. 2024. Paper presented at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italy.

Author

El-Haj, Mahmoud ; Ezzini, Saad. / The Multilingual Corpus of World’s Constitutions (MCWC) : MCWC. Paper presented at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italy.10 p.

Bibtex

@conference{cde22b6b202a4973bc564adb5925222f,

title = "The Multilingual Corpus of World{\textquoteright}s Constitutions (MCWC): MCWC",

abstract = "The “Multilingual Corpus of World{\textquoteright}s Constitutions” (MCWC) is a rich resource available in English, Arabic, and Spanish, encompassing constitutions from various nations. This corpus serves as a vital asset for the NLP community, facilitating advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. To ensure comprehensive coverage, for constitutions not originally available in Arabic and Spanish, we employed a fine-tuned state-of-the-art machine translation model. MCWC prepares its data to ensure highquality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments acrosslanguages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. MCWC{\textquoteright}s diverse multilingual content and commitment to data quality contribute to advancements in legal text analysis within the NLP community, facilitating exploration of constitutional texts and multilingual data analysis.",

keywords = "Constitutions, Corpus, Fine-tuning, Machine Translation, Legal Documents",

author = "Mahmoud El-Haj and Saad Ezzini",

year = "2024",

month = mar,

day = "25",

language = "English",

note = " The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 ; Conference date: 20-05-2024 Through 25-05-2024",

url = "https://lrec-coling-2024.org/",

}

RIS

TY - CONF

T1 - The Multilingual Corpus of World’s Constitutions (MCWC)

T2 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

AU - El-Haj, Mahmoud

AU - Ezzini, Saad

PY - 2024/3/25

Y1 - 2024/3/25

N2 - The “Multilingual Corpus of World’s Constitutions” (MCWC) is a rich resource available in English, Arabic, and Spanish, encompassing constitutions from various nations. This corpus serves as a vital asset for the NLP community, facilitating advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. To ensure comprehensive coverage, for constitutions not originally available in Arabic and Spanish, we employed a fine-tuned state-of-the-art machine translation model. MCWC prepares its data to ensure highquality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments acrosslanguages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. MCWC’s diverse multilingual content and commitment to data quality contribute to advancements in legal text analysis within the NLP community, facilitating exploration of constitutional texts and multilingual data analysis.

AB - The “Multilingual Corpus of World’s Constitutions” (MCWC) is a rich resource available in English, Arabic, and Spanish, encompassing constitutions from various nations. This corpus serves as a vital asset for the NLP community, facilitating advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. To ensure comprehensive coverage, for constitutions not originally available in Arabic and Spanish, we employed a fine-tuned state-of-the-art machine translation model. MCWC prepares its data to ensure highquality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments acrosslanguages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. MCWC’s diverse multilingual content and commitment to data quality contribute to advancements in legal text analysis within the NLP community, facilitating exploration of constitutional texts and multilingual data analysis.

KW - Constitutions

KW - Corpus

KW - Fine-tuning

KW - Machine Translation

KW - Legal Documents

M3 - Conference paper

Y2 - 20 May 2024 through 25 May 2024

ER -

Research

Associated organisational units

Electronic data

Keywords

The Multilingual Corpus of World’s Constitutions (MCWC): MCWC

Standard

Harvard

APA

Vancouver

Author

Bibtex

RIS

Quick Links

Connect With Us

Faculties & Depts

Contact Us