Unicode 学习笔记

Unicode encoding model

The four levels of the Unicode Character Encoding Model can be summarized as:

ACR: Abstract Character Repertoire
the set of characters to be encoded, for example, some alphabet or symbol set

CCS: Coded Character Set
a mapping from an abstract character repertoire to a set of nonnegative integers

CEF: Character Encoding Form
a mapping from a set of nonnegative integers that are elements of a CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers

CES: Character Encoding Scheme
a reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes)

In addition to the four individual levels, there are two other useful concepts:

CM: Character Map
a mapping from sequences of members of an abstract character repertoire to serialized sequences of bytes bridging all four levels in a single operation

TES: Transfer Encoding Syntax
a reversible transform of encoded data, which may or may not contain textual data

Unicode Character Encoding Model -- unicode.org

四个层次
1. 抽象字符层(ACR)。比如我们平时使用的文字就是抽象字符。
2. 码化字符集(CCS)。所有抽象字符映射为一系列非负数（Code point）。
3. 字符编码方式(CEF)。将上一层的整数转为代码单元（code unit）的集合。
4. 字符编码模式(CES)。由一系列代码单元（code unit）组成的模式。例如 UTF-8、UTF-16 等等

基本概念

Code point（0x0 - 0x10FFFF）
- 表示法 U 1FFFF. （U 一个十六进制数）
- 一个代码点，是一个数字，代表一个字符。
- 同一个代码点在不同的编码格式中占用的空间大小有可能不同。比如，UTF-32 中所有代码点都是 4 个字节， UTF-8 中代码点是可变长的 1-4 字节
code unit

Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.
- 在一个 CES 中，能表示一个字符的最小位组合数。UTF-8 为 8 bits，UTF-16 为 16 bits，UTF-32 为 32 bits。
UTF-32(UCS-4)
- 固定 4 个字节（32 bits）长度，不管是 BMP，还是 SMP，不足 4 字节使用前导 0 代替。
UCS-2
- 固定 2 字节（16 bits）。只能表示 BMP
UTF-16
- 前身为 UCS-2，但其不能表示 SMP，为弥补该缺陷产生了 UTF-16。
- 2 或 4 字节（16 or 32 bits）。
- 构成：
  1. U 0000..U D7FF && U E000..U FFFF // 用来表示部分 BMP 字符
  2. U D800..U DFFF // surrogate pairs 用来表示 SMP
    - 2 个 2 字节
    - high surrogate，第一个 2 字节，范围为 0xD800..0xDBFF.
    - low surrogate，第二个 2 字节，范围为 U DC00..U DFFF
    - 计算方式（栗子： 0x10437）：
      1. 减去 0x10000。 result = 0x00437, 二进制表示：0000 0000 0100 0011 0111。
      2. 将结果分为高 10 bits 0x0001 和低 10 bits 0x0037
      3. high surrogate = 0x0001 0xD800 = 0xD801
      4. low surrogate = 0x0037 0xDC00 = 0xDC37
      5. 所以 0x10437 的 UTF 16 表示为 0xD801DC37
- 存储方式
  - 因为是多字节存储的所以会有两种方式
  - UTF-16BE // 大端法（默认）
  - UTF-16LE // 小端法
UTF-8
- 模式
- 例子
Byte order mark(BOM)

The byte order mark (BOM) is a Unicode character, U FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program consuming the text:

What byte order, or endianness, the text stream is stored in;

The fact that the text stream is Unicode, to a high level of confidence;

Which of several Unicode encodings that text stream is encoded as.

-- from Byte order mark(BOM)

尽量不用

参考

Unicode Character Encoding Model -- unicode.org
Code point
代码点(Code Point)和代码单元(Code Unit)
Plane
Basic Multilingual Plane
Supplementary Multilingual Plane
UTF-32
UTF-16
What is a “surrogate pair” in Java? -- stackoverflow
Surrogates
UTF-8
Byte order mark(BOM)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicodeStandard.md

unicodeStandard.md

Unicode 学习笔记

Unicode encoding model

基本概念

参考

Files

unicodeStandard.md

Latest commit

History

unicodeStandard.md

File metadata and controls

Unicode 学习笔记

Unicode encoding model

基本概念

参考