Character encoding in corpus construction.

Linguistics and English Language

Electronic data

character_encoding.pdf
125 KB, PDF document

Keywords

character encoding, Unicode, corpus creation

View graph of relations

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Chapter

Published

Overview
Cite this

A. M. McEnery
R. Z. Xiao

More...

Publication date	2005
Host publication	Developing Linguistic Corpora : A Guide to Good Practice
Editors	M. Wynne
Place of Publication	Oxford, UK
Publisher	AHDS
Number of pages	0
<mark>Original language</mark>	English

Abstract

This chapter first briefly reviews the history of character encoding. Following from this is a discussion of standard and non-standard native encoding systems, and an evaluation of the efforts to unify these character codes. Then we move on to discuss Unicode as well as various Unicode Transformation Formats (UTFs). As a conclusion, we recommend that Unicode (UTF-8, to be precise) be used in corpus construction.

Bibliographic note

Standards Documentation

Research

Electronic data

Keywords