Final published version, 215 KB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License
Final published version
Licence: CC BY: Creative Commons Attribution 4.0 International License
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
}
TY - GEN
T1 - ALEXSIS-PT
T2 - The 29th International Conference on Computational Linguistics
AU - North, Kai
AU - Zampieri, Marcos
AU - Ranasinghe, Tharindu
N1 - Conference code: 29
PY - 2022/10/12
Y1 - 2022/10/12
N2 - Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions.To continue improving the performance of LS systems we introduce ALEXSISPT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605candidate substitutions for 387 complex words. ALEXSIS-PT has been compiledfollowing the ALEXSIS protocol for Spanish opening exciting new avenues for crosslingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, andBERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.
AB - Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions.To continue improving the performance of LS systems we introduce ALEXSISPT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605candidate substitutions for 387 complex words. ALEXSIS-PT has been compiledfollowing the ALEXSIS protocol for Spanish opening exciting new avenues for crosslingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, andBERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.
M3 - Conference contribution/Paper
T3 - COLING Proceedings
SP - 6057
EP - 6062
BT - Proceedings of the 29th International Conference on Computational Linguistics
PB - International Committee on Computational Linguistics
Y2 - 12 October 2022 through 17 October 2022
ER -