An Urdu semantic tagger - lexicons, corpora, methods and tools - Research Portal

Associated organisational units

Electronic data

2019jawadshafiphd
Final published version, 1.72 MB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Text available via DOI:

https://doi.org/10.17635/lancaster/thesis/831
Final published version

Keywords

Urdu NLP, semantic annotation tool, Semantic Tagger, Multi-Target classification, semantic lexicon, POS tagger, Urdu tokenizers, Urdu corpora, Urdu parallel corpus

View graph of relations

An Urdu semantic tagger - lexicons, corpora, methods and tools

Research output: Thesis › Doctoral Thesis

Published

Standard

An Urdu semantic tagger - lexicons, corpora, methods and tools. / Shafi, Jawad.
Lancaster University, 2019. 266 p.

Research output: Thesis › Doctoral Thesis

Harvard

Shafi, J 2019, 'An Urdu semantic tagger - lexicons, corpora, methods and tools', PhD, SCC (Data Science), Lancaster University. https://doi.org/10.17635/lancaster/thesis/831

APA

Shafi, J. (2019). An Urdu semantic tagger - lexicons, corpora, methods and tools. [Doctoral Thesis, SCC (Data Science), Lancaster University]. Lancaster University. https://doi.org/10.17635/lancaster/thesis/831

Vancouver

Shafi J. An Urdu semantic tagger - lexicons, corpora, methods and tools. Lancaster University, 2019. 266 p. doi: 10.17635/lancaster/thesis/831

Author

Shafi, Jawad. / An Urdu semantic tagger - lexicons, corpora, methods and tools. Lancaster University, 2019. 266 p.

Bibtex

@phdthesis{1131428ef346411caa2f10a019015cac,

title = "An Urdu semantic tagger - lexicons, corpora, methods and tools",

abstract = "Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports $F_1$ of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F$_1$ of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%.",

keywords = "Urdu NLP, semantic annotation tool, Semantic Tagger, Multi-Target classification, semantic lexicon, POS tagger, Urdu tokenizers, Urdu corpora, Urdu parallel corpus",

author = "Jawad Shafi",

year = "2019",

month = sep,

day = "30",

doi = "10.17635/lancaster/thesis/831",

language = "English",

publisher = "Lancaster University",

school = "SCC (Data Science), Lancaster University",

}

RIS

TY - BOOK

T1 - An Urdu semantic tagger - lexicons, corpora, methods and tools

AU - Shafi, Jawad

PY - 2019/9/30

Y1 - 2019/9/30

N2 - Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports $F_1$ of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F$_1$ of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%.

AB - Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports $F_1$ of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F$_1$ of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%.

KW - Urdu NLP

KW - semantic annotation tool

KW - Semantic Tagger

KW - Multi-Target classification

KW - semantic lexicon

KW - POS tagger

KW - Urdu tokenizers

KW - Urdu corpora

KW - Urdu parallel corpus

U2 - 10.17635/lancaster/thesis/831

DO - 10.17635/lancaster/thesis/831

M3 - Doctoral Thesis

PB - Lancaster University

ER -

Research