From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia.

Linguistics and English Language

Text available via DOI:

https://doi.org/10.1007/s10579-006-9003-7
Final published version

Keywords

Unicode - Font - Devanagari - South Asian languages/scripts - Legacy text - Encoding - Conversion - Virama - Conjunct consonant - Vowel diacritic

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia. / Hardie, Andrew.
In: Language Resources and Evaluation, Vol. 41, No. 1, 2007, p. 1-25.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Bibtex

@article{a5a89d8202724ddeb1f3ccd8d475660f,

title = "From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia.",

abstract = "Much electronic text in the languages of South Asia has been published on the Internet. However, while Unicode has emerged as the favoured encoding system of corpus and computational linguists, most South Asian language data on the web uses one of a wide range of non-standard legacy encodings. This paper describes the difficulties inherent in converting text in these encodings to Unicode. Among the various legacy encodings for South Asian scripts, the most problematic are 8-bit fonts based on graphical principles (as opposed to the logical principles of Unicode). Graphical fonts typically encode several features in ways highly incompatible with Unicode. For instance, half-form glyphs used to construct conjunct consonants are typically separate code points in 8-bit fonts; in Unicode they are represented by the full consonant followed by virama. There are many more such cases. The solution described here is an approach to text conversion based on mapping rules. A small number of generalised rules (plus the capacity for more specialised rules) captures the behaviour of each character in a font, building up a conversion algorithm for that encoding. This system is embedded in a font-mapping program, outputting CES-compliant SGML Unicode. This program, a generalised text-conversion tool, has been employed extensively in corpus-building for South Asian languages.",

keywords = "Unicode - Font - Devanagari - South Asian languages/scripts - Legacy text - Encoding - Conversion - Virama - Conjunct consonant - Vowel diacritic",

author = "Andrew Hardie",

year = "2007",

doi = "10.1007/s10579-006-9003-7",

language = "English",

volume = "41",

pages = "1--25",

journal = "Language Resources and Evaluation",

issn = "1574-020X",

publisher = "Springer Netherlands",

number = "1",

}

RIS

TY - JOUR

T1 - From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia.

AU - Hardie, Andrew

PY - 2007

Y1 - 2007

N2 - Much electronic text in the languages of South Asia has been published on the Internet. However, while Unicode has emerged as the favoured encoding system of corpus and computational linguists, most South Asian language data on the web uses one of a wide range of non-standard legacy encodings. This paper describes the difficulties inherent in converting text in these encodings to Unicode. Among the various legacy encodings for South Asian scripts, the most problematic are 8-bit fonts based on graphical principles (as opposed to the logical principles of Unicode). Graphical fonts typically encode several features in ways highly incompatible with Unicode. For instance, half-form glyphs used to construct conjunct consonants are typically separate code points in 8-bit fonts; in Unicode they are represented by the full consonant followed by virama. There are many more such cases. The solution described here is an approach to text conversion based on mapping rules. A small number of generalised rules (plus the capacity for more specialised rules) captures the behaviour of each character in a font, building up a conversion algorithm for that encoding. This system is embedded in a font-mapping program, outputting CES-compliant SGML Unicode. This program, a generalised text-conversion tool, has been employed extensively in corpus-building for South Asian languages.

AB - Much electronic text in the languages of South Asia has been published on the Internet. However, while Unicode has emerged as the favoured encoding system of corpus and computational linguists, most South Asian language data on the web uses one of a wide range of non-standard legacy encodings. This paper describes the difficulties inherent in converting text in these encodings to Unicode. Among the various legacy encodings for South Asian scripts, the most problematic are 8-bit fonts based on graphical principles (as opposed to the logical principles of Unicode). Graphical fonts typically encode several features in ways highly incompatible with Unicode. For instance, half-form glyphs used to construct conjunct consonants are typically separate code points in 8-bit fonts; in Unicode they are represented by the full consonant followed by virama. There are many more such cases. The solution described here is an approach to text conversion based on mapping rules. A small number of generalised rules (plus the capacity for more specialised rules) captures the behaviour of each character in a font, building up a conversion algorithm for that encoding. This system is embedded in a font-mapping program, outputting CES-compliant SGML Unicode. This program, a generalised text-conversion tool, has been employed extensively in corpus-building for South Asian languages.

KW - Unicode - Font - Devanagari - South Asian languages/scripts - Legacy text - Encoding - Conversion - Virama - Conjunct consonant - Vowel diacritic

U2 - 10.1007/s10579-006-9003-7

DO - 10.1007/s10579-006-9003-7

M3 - Journal article

VL - 41

SP - 1

EP - 25

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 1

ER -

Research

Links

Text available via DOI:

Keywords