Unicode Basics
Contents
- Introduction to Unicode
- Traditional Character Sets and Unicode
- Glyphs versus Characters
- Overview of Unicode
- Character Encoding Forms and Schemes for Unicode
- Overview of UTF-16
- Overview of UTF-8
- Overview of UTF-32
- Overview of SCSU
- Other Unicode Encodings
- Programming using UTFs
- Serialized Formats
- The Unicode Standard Is An Industry Standard
Introduction to Unicode
Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application executable to work for a global audience. ICU, like Java™, Microsoft® Windows NT™, Windows™ 2000 and other modern systems, provides Internationalization solutions based on Unicode.
This chapter is intended as an introduction to codepages in general and Unicode in particular. For further information, see:
Go to the online ICU demos to see how a Unicode-based server application can handle text in many languages and many encodings.
Traditional Character Sets and Unicode
Representing text-format data in computers is a matter of defining a set of characters and assigning each of them a number and a bit representation. Underlying this basic idea are three related concepts:
-
A character set or repertoire is an unordered collection of characters that can be represented by numeric values.
-
A coded character set maps characters from a character set or repertoire to numeric values.
-
A character encoding scheme defines the representation of numeric values from one or more coded character sets in bits and bytes.
For simple encodings such as ASCII, the last two concepts are basically the same: ASCII assigns 128 characters and control codes to consecutive numbers from 0 to 127. These characters and control codes are encoded as simple, unsigned, binary integers. Therefore, ASCII is both a coded character set and a character encoding scheme.
ASCII only encodes 128 characters, 33 of which are control codes rather than graphic, displayable characters. It was designed to represent English-language text for an American user base, and is therefore insufficient for representing text in almost any language other than American English. In fact, most traditional encodings were limited to one or few languages and scripts.
ASCII offered a natural way to extend it: Designed in the 1960’s to work in systems with 7-bit bytes while most computers and Internet protocols since the 1970’s use 8-bit bytes, the extra bit allowed another 128 byte values to represent more characters. Various encodings were developed that supported different languages. Some of these were based on ASCII, others were not.
Languages such as Japanese need to encode considerably more than 256 characters. Various encoding schemes enable large character sets with thousands or tens of thousands of characters to be represented. Most of those encodings are still byte-based, which means that many characters require two or more bytes of storage space. A process must be developed to interpret some byte values.
Various character sets and encoding schemes have been developed independently, cover only one or few languages each, and are incompatible. This makes it very difficult for a single system to handle text in more than one language at a time, and especially difficult to do so in a way that is interoperable across different systems.
Generally, the minimum requirement for the interoperable exchange of text data is that the encoding (character set & encoding scheme) must be properly specified in the document and in the protocol. For example, email/SMTP and HTML/HTTP provide the means to specify the “charset”, as it is called in Internet standards. However, very often the encoding is not specified, specified incorrectly, or the sender and receiver disagree on its implementation.
The ISO 2022 encoding scheme was created to store text in many different languages. It allows other encodings to be embedded by first announcing them and then switching between them. Full support for all features and possible encodings with ISO 2022 requires complicated processing and the need to support many encodings. For East Asian languages, subsets were developed that cover only one language or a few at a time, but they are much more manageable. ISO 2022 is not well-suited for use in internal processing. It is designed for data exchange.
Glyphs versus Characters
Programmers often need to distinguish between characters and glyphs. A character is the smallest semantic unit in a writing system. It is an abstract concept such as the letter A or the exclamation point. A glyph is the visual presentation of one or more characters, and is often dependent on adjacent characters.
There is not always a one-to-one mapping between characters and glyphs. In many languages (Arabic is a prime example), the way a character looks depends heavily on the surrounding characters. Standard printed Arabic has as many as four different printed representations (glyphs) for every letter of the alphabet. In many languages, two or more letters may combine together into a single glyph (called a ligature), or a single character might be displayed with more than one glyph.
Despite the different visual variants of a particular letter, it still retains its identity. For example, the Arabic letter heh has four different visual representations in common use. Whichever one is used, it still keeps its identity as the letter heh. It is this identity that Unicode encodes, not the visual representation. This also cuts down on the number of independent character values required.
Overview of Unicode
Unicode was developed as a single-coded character set that contains support for all languages in the world. The first version of Unicode used 16-bit numbers, which allowed for encoding 65,536 characters without complicated multibyte schemes. With the inclusion of more characters, and following implementation needs of many different platforms, Unicode was extended to allow more than one million characters. Several other encoding schemes were added. This introduced more complexity into the Unicode standard, but far less than managing a large number of different encodings.
Starting with Unicode 2.0 (published in 1996), the Unicode standard began assigning numbers from 0 to 10ffff16,which requires 21 bits but does not use them completely. This gives more than enough room for all written languages in the world. The original repertoire covered all major languages commonly used in computing. Unicode continues to grow, and it includes more scripts.
The design of Unicode differs in several ways from traditional character sets and encoding schemes:
-
Its repertoire enables users to include text efficiently in almost all languages within a single document.
-
It can be encoded in a byte-based way with one or more bytes per character, but the default encoding scheme uses 16-bit units that allow much simpler processing for all common characters.
-
Many characters, such as letters with accents and umlauts, can be combined from the base character and accent or umlaut modifiers. This combining reduces the number of different characters that need to be encoded separately. “Precomposed” variants for characters that existed in common character sets at the time were included for compatibility.
-
Characters and their usage are well-defined and described. While traditional character sets typically only provide the name or a picture of a character and its number and byte encoding, Unicode has a comprehensive database of properties available for download. It also defines a number of processes and algorithms for dealing with many aspects of text processing to make it more interoperable.
The early inclusion of all characters of commonly used character sets makes Unicode a useful “pivot” point for converting between traditional character sets, and makes it feasible to process non-Unicode text by first converting into Unicode, process the text, and convert it back to the original encoding without loss of data.
The first 128 Unicode code point values are assigned to the same characters as in US-ASCII. For example, the same number is assigned to the same character. The same is true for the first 256 code point values of Unicode compared to ISO 8859-1 (Latin-1) which itself is a direct superset of US-ASCII. This makes it easy to adapt many applications to Unicode because the numbers for many syntactically important characters are the same.
Character Encoding Forms and Schemes for Unicode
Unicode assigns characters a number from 0 to 10FFFF16, giving enough elbow room to allow for unambiguous encoding of every character in common use. Such a character number is called a “code point”.
Unicode code points are just non-negative integer numbers in a certain range. They do not have an implicit binary representation or a width of 21 or 32 bits. Binary representation and unit widths are defined for encoding forms.
For internal processing, the standard defines three encoding forms, and for file storage and protocols, some of these encoding forms have encoding schemes that differ in their byte ordering. The difference between an encoding form and an encoding scheme is that an encoding form maps the character set codes to values that fit into internal data types (like a short in C), while an encoding scheme maps to bits and bytes. For traditional encodings, they are the same since the encoding forms already map to bytes.
The different Unicode encoding forms are optimized for a variety of different uses:
-
UTF-16, the default encoding form, maps a character code point to either one or two 16-bit integers.
-
UTF-8 is a byte-based encoding that offers backwards compatibility with ASCII-based, byte-oriented APIs and protocols. A character is stored with 1, 2, 3, or 4 bytes.
-
UTF-32 is the simplest, but most memory-intensive encoding form: It uses one 32-bit integer per Unicode character.
-
SCSU is an encoding scheme that provides a simple compression of Unicode text. It is designed only for input and output, not for internal use.
ICU uses UTF-16 internally. ICU 2.0 fully supports supplementary characters (with code points 1000016..10FFFF16). Older versions of ICU provided only partial support for supplementary characters.
For input/output, character encoding schemes define a byte serialization of text. UTF-8 is itself both an encoding form, and an encoding scheme because it is byte-based. For each of UTF-16 and UTF-32, there are two variants defined: one that serializes the code units in big-endian byte order (most significant byte first), and one that serializes the code units in little-endian byte order (least significant byte first). The corresponding encoding schemes are called UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.
The names “UTF-16” and “UTF-32” are ambiguous. Depending on context, they refer either to character encoding forms where 16/32-bit words are processed and are naturally stored in the platform endianness, or they refer to the IANA-registered charset names, i.e., to character encoding schemes or byte serializations. In addition to simple byte serialization, the charsets with these names also use optional Byte Order Marks (see Serialized Formats below).
Overview of UTF-16
The default encoding form of the Unicode Standard uses 16-bit code units. Code point values for the most common characters are in the range of 0 to FFFF16 and are encoded with just one 16-bit unit of the same value. Code points from 1000016 to 10FFFF16 are encoded with two code units that are often called “surrogates”, and they are called a “surrogate pair” when, together, they correctly encode one Unicode character. The first surrogate in a pair must be in the range D80016 to DBFF16, and the second one must be in the range DC0016 to DFFF16. Every Unicode code point has only one possible UTF-16 encoding with either one code unit that is not a surrogate or with a correct pair of surrogates. The code point values D80016 to DFFF16 are set aside just for this mechanism and will never, by themselves, be assigned any characters.
Most commonly used characters have code points below FFFF16, but Unicode 3.1 assigns more than 40,000 supplementary characters that make use of surrogate pairs in UTF-16.
Note that comparing UTF-16 strings lexically based on their 16-bit code units does not result in the same order as comparing the code points. This is not usually an issue since only rarely-used characters are affected. Most processes do not rely on the same results in such comparisons. Where necessary, a simple modification to a string comparison can be performed that still allows efficient code unit-based comparisons and makes them compatible with code point comparisons. ICU has C and C API functions for this.
Overview of UTF-8
To meet the requirements of byte-oriented, ASCII-based systems, the Unicode Standard defines UTF-8. UTF-8 is a variable-length, byte-based encoding that preserves ASCII transparency.
UTF-8 maintains transparency for all the ASCII code values (0..127). These values do not appear in any byte of a transformed result except as the direct representation of the ASCII values. Thus, ASCII text is also UTF-8 text.
Characteristics of UTF-8 include:
-
Unicode code points 0 to 7F16 are each encoded with a single byte of the same value. Therefore, ASCII characters take up 50% less space with UTF-8 encoding than with UTF-16.
-
All other code points are encoded with multibyte sequences, with the first byte (lead byte) indicating the number of bytes that follow (trail bytes). This results in very efficient parsing. The lead bytes are in the range c016 to fd16, the trail bytes are in the range 8016 to bf16. The byte values fe16 and FF16 are never used.
-
UTF-8 is relatively compact and resource conservative in its use of the bytes required for encoding text in European scripts, but uses 50% more space than UTF-16 for East Asian text. Code points up to 7FF16 take up two bytes, code points up to FFFF16 take up three (50% more memory than UTF-16), and all others four.
-
Binary comparisons of UTF-8 strings based on their bytes result in the same order as comparing code point values.
Overview of UTF-32
The UTF-32 encoding form always uses one single 32-bit integer per Unicode code point. This results in a very simple encoding.
The drawback is its memory consumption: Since code point values use only 21 bits, one-third of the memory is always unused, and since most commonly used characters have code point values of up to FFFF16, they take up only one 16-bit unit in UTF-16 (50% less) and up to three bytes in UTF-8 (25% less).
UTF-32 is mainly used in APIs that are defined with the same data type for both code points and code units. Modern versions of the C standard library that support Unicode use a 32-bit wchar_t
with UTF-32 semantics.
Overview of SCSU
SCSU (Standard Compression Scheme for Unicode) is designed to reduce the size of Unicode text for both input and output. It is a simple compression that transforms the text into a byte stream. It typically uses one byte per character in small scripts, and two bytes per character in large, East Asian scripts.
It is usually shorter than any of the UTFs. However, SCSU is stateful, which makes it unsuitable for internal processing. It also uses all possible byte values, which might require additional processing for protocols such as SMTP (email).
See also https://www.unicode.org/reports/tr6/ .
Other Unicode Encodings
Other Unicode encodings have been developed over time for various purposes. Most of them are implemented in ICU, see source/data/mappings/convrtrs.txt
-
BOCU-1: Binary-Ordered Compression of Unicode An encoding of Unicode that is about as compact as SCSU but has a much smaller amount of state. Unlike SCSU, it preserves code point order and can be used in 8bit emails without a transfer encoding. BOCU-1 does not preserve ASCII characters in ASCII-readable form. See Unicode Technical Note #6 .
-
UTF-7: Designed for 7bit emails; simple and not very compact. Since email systems have been 8-bit safe for several years, UTF-7 is not necessary any more and not recommended. Most ASCII characters are readable, others are base64-encoded. See RFC 2152 .
-
IMAP-mailbox-name: A variant of UTF-7 that is suitable for expressing Unicode strings as ASCII characters for Unix filenames. The name “IMAP-mailbox-name” is specific to ICU! See RFC 2060 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4rev1 section 5.1.3. Mailbox International Naming Convention.
-
UTF-EBCDIC: An EBCDIC-friendly encoding that is similar to UTF-8. See Unicode Technical Report #16 . As of ICU 2.6, UTF-EBCDIC is not implemented in ICU.
-
CESU-8: Compatibility Encoding Scheme for UTF-16: 8-Bit An incompatible variant of UTF-8 that preserves 16-bit-Unicode (UTF-16) string order instead of code point order. Not for open interchange. See Unicode Technical Report #26 .
Programming using UTFs
Programming using any of the UTFs is much more straightforward than with traditional multi-byte character encodings, even though UTF-8 and UTF-16 are also variable-width encodings.
Within each Unicode encoding form, the code unit values for singletons (code units that alone encode characters), lead units, and for trailing units are all disjointed. This has crucial implications for implementations. The following lists these implications:
-
Determines the number of units for one code point using the lead unit. This is especially important for UTF-8, where there can be up to 4 bytes per character.
-
Determines boundaries. If ICU users randomly access text, you can always determine the nearest code-point boundaries with a small number of machine instructions.
-
Does not have any overlap. If ICU users search for string A in string B, you never get a false match on code points. Users do not need to convert to code points for string searching. False matches never occurs since the end of one sequence is never the same as the start of another sequence. Overlap is one of the biggest problems with common multi-byte encodings like Shift-JIS. All the UTFs avoid this problem.
-
Uses simple iteration. Getting the next or previous code point is straightforward, and only takes a small number of machine instructions.
-
Can use UTF-16 encoding, which is actually fully symmetric. ICU users can determine from any single code unit whether it is the first, last, or only one for a code point. Moving (iterating) in either direction through UTF-16 text is equally fast and efficient.
-
Uses slow indexing by code points. This indexing procedure is a disadvantage of all variable-width encodings. Except in UTF-32, it is inefficient to find code unit boundaries corresponding to the nth code point or to find the code point offset containing the nth code unit. Both involve scanning from the start of the text or from a last known boundary. ICU, like most common APIs, always indexes by code units. It counts code units and not code points.
Conversion between different UTFs is very fast. Unlike converting to and from legacy encodings like Latin-2, conversion between UTFs does not require table look-ups.
ICU provides two basic data type definitions for Unicode. UChar32
is a 32-bit type for code points, and used for single Unicode characters. It may be signed or unsigned. It is the same as wchar_t
if it is 32 bits wide. UChar
is an unsigned 16-bit integer for UTF-16 code units. It is the base type for strings (UChar *
), and it is the same as wchar_t
if it is 16 bits wide.
Some higher-level APIs, used especially for formatting, use characters closer to a representation for a glyph. Such “user characters” are also called “graphemes” or “grapheme clusters” and require strings so that combining sequences can be included.
Serialized Formats
In files, input, output, and network protocols, text must be accompanied by the specification of its character encoding scheme for a client to be able to interpret it correctly. (This is called a “charset” in Internet protocols.) However, an encoding scheme specification is not necessary if the text is only used within a single platform, protocol, or application where it is otherwise clear what the encoding is. (The language and text directionality should usually be specified to enable spell checking, text-to-speech transformation, etc.)
The discussion of encoding specifications in this section applies to standard Internet protocols where charset name strings are used. Other protocols may use numeric encoding identifiers and assign different semantics to those identifiers than Internet protocols.
Typically, the encoding specification is done in a protocol- and document format-dependent way. However, the Unicode standard offers a mechanism for tagging text files with a “signature” for cases where protocols do not identify character encoding schemes.
The character ZERO WIDTH NO-BREAK SPACE (FEFF16) can be used as a signature by prepending it to a file or stream. The alternative function of U FEFF as a format control character has been copied to U 2060 WORD JOINER, and U FEFF should only be used for Unicode signatures.
The different character encoding schemes generate different, distinct byte sequences for U FEFF:
-
UTF-8: EF BB BF
-
UTF-16BE: FE FF
-
UTF-16LE: FF FE
-
UTF-32BE: 00 00 FE FF
-
UTF-32LE: FF FE 00 00
-
SCSU: 0E FE FF
-
BOCU-1: FB EE 28
-
UTF-7: 2B 2F 76 ( 38 39 2B 2F ) - UTF-EBCDIC: DD 73 66 73
ICU provides the function ucnv_detectUnicodeSignature()
for Unicode signature detection.
There is no signature for CESU-8 separate from the one for UTF-8. UTF-8 and CESU-8 encode U FEFF and in fact all BMP code points with the same bytes. The opportunity for misidentification of one as the other is one of the reasons why CESU-8 should only be used in limited, closed, specific environments.
In UTF-16 and UTF-32, where the signature also distinguishes between big-endian and little-endian byte orders, it is also called a byte order mark (BOM). The signature works for UTF-16 since the code point that has the byte-swapped encoding, FFFE16, will never be a valid Unicode character. (It is a “non-character” code point.) In Internet protocols, if an encoding specification of “UTF-16” or “UTF-32” is used, it is expected that there is a signature byte sequence (BOM) that identifies the byte ordering, which is not the case for the encoding scheme/charset names with “BE” or “LE”.
If text is specified to be encoded in the UTF-16 or UTF-32 charset and does not begin with a BOM, then it must be interpreted as UTF-16BE or UTF-32BE, respectively.
A signature is not part of the content, and must be stripped when processing. For example, blindly concatenating two files will give an incorrect result.
If a signature was detected, then the signature “character” U FEFF should be removed from the Unicode stream after conversion. Removing the signature bytes before conversion could cause the conversion to fail for stateful encodings like BOCU-1 and UTF-7.
Whether a signature is to be recognized or not depends on the protocol or application.
-
If a protocol specifies a charset name, then the byte stream must be interpreted according to how that name is defined. Only the “UTF-16” and “UTF-32” names include recognition of the byte order marks that are specific to them (and the ICU converters for these names do this automatically). None of the other Unicode charsets are defined to include any signature/BOM handling.
-
If no charset name is provided, for example for text files in most filesystems, then applications must usually rely on heuristics to determine the file encoding. Many document formats contain an embedded or implicit encoding declaration, but for plain text files it is reasonable to use Unicode signatures as simple and reliable heuristics. This is especially common on Windows systems. However, some tools for plain text file handling (e.g., many Unix command line tools) are not prepared for Unicode signatures.
The Unicode Standard Is An Industry Standard
The Unicode standard is an industry standard and parallels ISO 10646-1. Around 1993, these two standards were effectively merged into the same character set standard. Both standards have the same character repertoire and the same encoding forms and schemes.
One difference used to be that the ISO standard defined code point values to be from 0 to 7FFFFFFF16, not just up to 10FFFF16. The ISO work group decided to add an amendment to the standard. The amendment removes this difference by declaring that no characters will ever be assigned code points above 10FFFF16. The main reason for the ISO work group’s decision is interoperability between the UTFs. UTF-16 can not encode any code points above this limit.
This means that the code point space for both Unicode and ISO 10646 is now the same! These changes to ISO 10646 have been made recently and should be complete in the edition ISO 10646:2003 which also combines all parts of the standard into one.
The former, larger code space is the reason why the ISO definition of UTF-8 specifies sequences of five and six bytes to cover that whole range.
Another difference is that the ISO standard defines encoding forms “UCS-4” and “UCS-2”. UCS-4 is essentially UTF-32 with a theoretical upper limit of 7FFFFFFF16, using 31 out of the 32 bits. However, in practice, the ISO committee has accepted that the characters above 10FFFF will not be encoded, so there is essentially no difference between the forms. The “4” stands for “four-byte form”.
UCS-2 is a subset of UTF-16 that is limited to code points from 0 to FFFF, excluding the surrogate code points. Thus, it cannot represent the characters with code points above FFFF (called supplementary characters).
There is no conversion necessary between UCS-2 and UTF-16. The difference is only in the interpretation of surrogates.
The standards differ in what kind of information they provide: The Unicode standard provides more character properties and describes algorithms etc., while the ISO standard defines collections, subsets and similar.
The standards are synchronized, and the respective committees work together to add new characters and assign code point values.