Corrigendum #1: UTF-8 Shortest Form
Corrigendum |
Effective Date |
Applicable Versions |
Fixed Version |
Result Documented In: |
Corrigendum #1: UTF-8 Shortest Form |
2000-Nov-09 [85-M12] |
3.0.0 and 3.0.1 |
3.1.0 2001-March |
Chapter 3, Conformance |
The conformance clause C12 in
The Unicode Standard,
Version 3.0 forbids the generation of "non-shortest form"
UTF-8, and forbids the interpretation of illegal sequences, but
not the interpretation of "non-shortest form". Where software does
interpret the non-shortest forms, security issues can arise. For
example:
- Process A performs security checks, but does not check for
non-shortest forms.
- Process B accepts the byte sequence from process A,
and transforms it into UTF-16 while interpreting non-shortest forms.
- The UTF-16 text may then contain characters that should have been
filtered out by process A.
To address this issue, the Unicode Technical Committee has modified
the definition of UTF-8 to forbid conformant implementations from
interpreting non-shortest forms for
BMP characters,
and clarified some of the conformance clauses.
These modifications make use of updated notation: see the
Glossary for any
unfamiliar terms.
Change C12 to the following:
C12 |
(a) When a process generates
data in a Unicode Transformation Format, it shall not emit
ill-formed byte code unit sequences.
(b) When a process interprets data in a Unicode
Transformation Format, it shall treat illegal byte
code unit sequences as an error condition.
(c) A conformant process shall not interpret illegal UTF code
unit sequences as characters.
(d) Irregular UTF code unit sequences shall not be used for encoding
any other information. |
Add the following notes after C12:
- The definition of each UTF specifies the illegal code unit
sequences in that UTF. For example, the definition of UTF-8 (D36)
specifies that code unit sequences such as <C0, AF> are illegal.
- Internally, a particular function might be used that does not
check for illegal code unit sequences. However, a conformant process
can use that function only on data that has already been
certified to not contain any illegal code unit sequences.
- Processes that require unique representation must not interpret
irregular UTF code unit sequences as characters. They may, for
example, reject or remove those sequences.
- Processes may transform irregular code unit sequences into the
equivalent well-formed code unit sequences.
- Conformant processes cannot interpret illegal code unit
sequences. However, the conformance clauses do not, for example,
prevent utility programs from operating on "mangled" text. For
example, a UTF-8 file could have had CRLF sequences introduced at
every 80 bytes by a bad mailer program. This could result in some
UTF-8 byte sequences being interrupted by CRLFs, producing illegal
byte sequences. This mangled text is no longer UTF-8. It is
permissible for a conformant program to repair such text, recognizing
that the mangled text was originally well-formed UTF-8 byte sequences.
However, such repair of mangled data is a special case, and must not
be used in circumstances where it would cause security problems.
Delete the second sentence in the note under D32:
For example, UTF-8 allows nonshortest code value sequences
to be interpreted: a UTF-8 conformant mayt map the code value sequence
C0 80 (110000002 100000002) to the Unicode value
U 0000, even though a UTF-8 conformant process shall never
generate that code value sequence -- it shall generate the sequence 00
(000000002) instead.
Modify D36 as follows, and add a note:
D36 |
(a) UTF-8 is the Unicode
Transformation Format that serializes a Unicode code point as a
sequence of one to four bytes, as specified in Table 3.1, UTF-8
Bit Distribution.
(b) An illegal UTF-8 code unit sequence is any byte sequence that
does not match the patterns listed in Table 3.1B, Legal UTF-8
Byte Sequences.
(c) An irregular UTF-8 code unit sequence is a six-byte sequence
where the first three bytes correspond to a high surrogate, and the
next three bytes correspond to a low surrogate. As a consequence of
C12, these irregular UTF-8 sequences shall not be generated by a
conformant process. |
- In UTF-8, <004D 0061 0072 006B> is serialized as <4D 61 72 6B>.
- The problematic "non-shortest form" byte sequences in UTF-8
were those where BMP characters could be represented in more than one
way. These sequences are illegal, since they are not allowed by Table
3.1B.
Retain the paragraph and table immediately below D36, but
replace the last sentence in the paragraph.
Table 3.1 specifies the bit distribution from a Unicode character
(or surrogate pair) into the one- to four-byte values of the
corresponding UTF-8 sequence. Note that the four-byte form for
surrogate pairs involves an addition of 1000016, to account
for the starting offset to the encoded values referenced by
surrogates. For a discussion of the difference in the formulation
of UTF-8 in ISO/IEC 10646, see Section C.3, UCS Transformation
Formats. The definition of UTF-8 in Annex D of ISO/IEC
10646-1:2000 also allows for the use of five- and six-byte sequences
to encode characters that are outside the range of the Unicode
character set; those five- and six-byte sequences are illegal for the
use of UTF-8 as a transformation of Unicode characters.
Table 3.1. UTF-8 Bit Distribution
Scalar Value |
UTF-16 |
1st Byte |
2nd Byte |
3rd Byte |
4th Byte |
00000000 0xxxxxxx |
00000000 0xxxxxxx |
0xxxxxxx |
|
|
|
00000yyy yyxxxxxx |
00000yyy yyxxxxxx |
110yyyyy |
10xxxxxx |
|
|
zzzzyyyy yyxxxxxx |
zzzzyyyy yyxxxxxx |
1110zzzz |
10yyyyyy |
10xxxxxx |
|
000uuuuu zzzzyyyy
yyxxxxxx |
110110ww wwzzzzyy
110111yy yyxxxxxx |
11110uuu |
10uuzzzz |
10yyyyyy |
10xxxxxx |
- Where uuuuu = wwww 1 (to account for addition
of 1000016 as in Section 3.7, Surrogates).
Delete the two text paragraphs after Table 3.1. (The relevant
portions have been elevated into definitions or conformance clauses.)
When converting a Unicode scalar value to UTF-8, the
shortest form that can represent those values shall be used. This
practice preserves uniqueness of encoding. For example, the Unicode
binary value <0000000000000001> is encoded as <00000001>, not as
<11000000 10000001>. The latter is an example of an irregular UTF-8
byte sequence. Irregular UTF-8 sequences shall not be used for
encoding any other information.
When converting from UTF-8 to a Unicode scalar value,
implementations do not need to check that the shortest encoding is
being used. This simplifies the conversion algorithm.
Replace them by the following table and text:
Table 3.1B. Legal UTF-8 Byte Sequences
Code Points |
1st Byte |
2nd Byte |
3rd Byte |
4th Byte |
U 0000..U 007F |
00..7F |
|
|
|
U 0080..U 07FF |
C2..DF |
80..BF |
|
|
U 0800..U 0FFF |
E0 |
A0..BF |
80..BF |
|
U 1000..U FFFF |
E1..EF |
80..BF |
80..BF |
|
U 10000..U 3FFFF |
F0 |
90..BF |
80..BF |
80..BF |
U 40000..U FFFFF |
F1..F3 |
80..BF |
80..BF |
80..BF |
U 100000..U 10FFFF |
F4 |
80..8F |
80..BF |
80..BF |
Table 3.1B. lists all of the byte sequences that are legal in
UTF-8. A range of byte values such as A0..BF indicates that any byte
from A0 to BF (inclusive) is legal in that position. Any byte value
outside of the ranges listed is illegal. For example, the byte sequence
<C0, AF> is illegal since C0 is not legal in the 1st Byte column.
The byte sequence <E0, 9F, 80> is illegal since in the row where
E0 is legal as a first byte, 9F is not legal as a second byte. The byte
sequence <F4, 80, 83, 92> is legal, since every byte in that
sequence matches a byte range in a row of the table (the last row).
Add to Appendix C: Relationship to ISO/IEC 10646, Section C.3:
UCS Transformation Formats, at the end of the subsection UTF-8:
The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also
allows for the use of five- and six-byte sequences to encode
characters that are outside the range of the Unicode character set;
those five- and six-byte sequences are illegal for the use of UTF-8 as
a transformation of Unicode characters. ISO/IEC 10646 does not allow
mapping of unpaired surrogates, nor U FFFE and U FFFF (but it does
allow other
noncharacters).