Rights statement: The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-014-9274-3
Accepted author manuscript, 1.48 MB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License
Research output: Contribution to Journal/Magazine › Journal article › peer-review
Research output: Contribution to Journal/Magazine › Journal article › peer-review
}
TY - JOUR
T1 - Creating language resources for under-resourced languages
T2 - methodologies, and experiments with Arabic
AU - El-Haj, Mahmoud
AU - Kruschwitz, Udo
AU - Fox, Chris
N1 - The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-014-9274-3
PY - 2015/9
Y1 - 2015/9
N2 - Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning,information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented.
AB - Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning,information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented.
KW - Resources
KW - Summarisation
KW - Arabic
KW - Under-resourced languages
U2 - 10.1007/s10579-014-9274-3
DO - 10.1007/s10579-014-9274-3
M3 - Journal article
VL - 49
SP - 549
EP - 580
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
SN - 1574-020X
IS - 3
ER -