Home > Research > Publications & Outputs > Semantic Tagging for the Urdu Language

Electronic data

  • 3582496 (1)

    Accepted author manuscript, 560 KB, PDF document

    Available under license: CC BY: Creative Commons Attribution 4.0 International License

Links

Text available via DOI:

View graph of relations

Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods. / Shafi, Jawad; Nawab, Rao Muhammad Adeel ; Rayson, Paul.
In: ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Vol. 22, No. 6, 175, 17.06.2023, p. 175:1-175:32.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

Shafi, J, Nawab, RMA & Rayson, P 2023, 'Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods', ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 22, no. 6, 175, pp. 175:1-175:32. https://doi.org/10.1145/3582496

APA

Shafi, J., Nawab, R. M. A., & Rayson, P. (2023). Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 22(6), 175:1-175:32. Article 175. https://doi.org/10.1145/3582496

Vancouver

Shafi J, Nawab RMA, Rayson P. Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). 2023 Jun 17;22(6):175:1-175:32. 175. Epub 2023 Feb 16. doi: 10.1145/3582496

Author

Shafi, Jawad ; Nawab, Rao Muhammad Adeel ; Rayson, Paul. / Semantic Tagging for the Urdu Language : Annotated Corpus and Multi-Target Classification Methods. In: ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). 2023 ; Vol. 22, No. 6. pp. 175:1-175:32.

Bibtex

@article{61f41cd9ed5a4188a849f5c015852a8f,
title = "Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods",
abstract = "Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download.",
author = "Jawad Shafi and Nawab, {Rao Muhammad Adeel} and Paul Rayson",
year = "2023",
month = jun,
day = "17",
doi = "10.1145/3582496",
language = "English",
volume = "22",
pages = "175:1--175:32",
journal = "ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)",
issn = "2375-4699",
publisher = "Association for Computing Machinery (ACM)",
number = "6",

}

RIS

TY - JOUR

T1 - Semantic Tagging for the Urdu Language

T2 - Annotated Corpus and Multi-Target Classification Methods

AU - Shafi, Jawad

AU - Nawab, Rao Muhammad Adeel

AU - Rayson, Paul

PY - 2023/6/17

Y1 - 2023/6/17

N2 - Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download.

AB - Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download.

U2 - 10.1145/3582496

DO - 10.1145/3582496

M3 - Journal article

VL - 22

SP - 175:1-175:32

JO - ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)

JF - ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)

SN - 2375-4699

IS - 6

M1 - 175

ER -