Technical Reports | |
Editors | Ken Whistler ([email protected]), Asmus Freytag ([email protected]) |
Date | 2022-11-09 |
This Version | https://www.unicode.org/reports/tr23/tr23-15.html |
Previous Version | https://www.unicode.org/reports/tr23/tr23-13.html |
Latest Version | https://www.unicode.org/reports/tr23/ |
Revision | 15 |
This document presents a conceptual model of character properties defined in the Unicode Standard. The model also covers properties for enumerated character sequences as well as string functions.
This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
This report presents a general overview and typology of character properties and property values, as well as those of properties of enumerated character sequences and string functions. This description of the Unicode character property model is not intended to supersede the normative information on properties in The Unicode Standard [Unicode], nor the existing body of technical reports and documentation files in the Unicode Character Database [UCDDoc] that provide detailed descriptions for particular character properties or properties of enumerated character sequences and string functions. Instead it focuses on the overall model behind and common aspects of all of these.
This report specifically covers formal character properties, which are those attributes of characters specified according to the definitions set forth in this report. Such formal character properties are only a subset of character properties in the generic sense, and they further subdivide into those properties defined in the Unicode Standard or Unicode Character Database, and those defined by related standards. Also included in the scope are formal properties of enumerated character sequences and string functions.
At its most basic, a character property relates a character to a value. Thus, a property can be considered a function that maps from code points to specific property values. These concepts can be readily extended to mapping a specific sequence of characters to a property value, or to generic string functions that algorithmically map arbitrary strings or substrings to property values. To keep the discussion simple, the basic concepts are introduced in the context of properties of individual characters or code points.
The Unicode Standard views character semantics as inherent to the definition of a character, and conformant processes are required to take these into account when interpreting characters.
D3 Character semantics: The semantics of a character are determined by its identity, normative properties, and behavior.
Note: Quotations from the core specification of the Unicode Standard are cited in this indented boxed style for clarity. Definition numbers or conformance clause numbers in those citations are as in the core specification.
The assignment of character semantics in the Unicode Standard is based on character behavior. Other character set standards leave it to the implementer, or to unrelated secondary standards, to assign character semantics to characters. In contrast, the Unicode Standard supplies a rich set of character attributes, called properties, for each character contained in it. Many properties are specified in relation to processes or algorithms that interpret them, in order to implement the character behavior. There are character behaviors that are specific to a particular text process and that have not been formally defined in the Unicode Standard. Implementations often provide internal definitions of character properties to achieve the desired behavior. Implementers may find many of the concepts discussed here applicable to such cases.
The interpretation of some properties (such as whether a character is a digit or not) is largely independent of context, whereas the interpretation of others (such as directionality) is applicable to a character sequence as a whole, rather than to the individual characters that compose the sequence.
Other examples that require context include title casing, and the classification of punctuation or symbols for script assignments. The line breaking rules of UAX #14 Unicode Line Breaking Algorithm [LineBreak] involve character pairs and triples, and in certain cases, longer sequences. The glyph(s) defined by a combining character sequence are the result of contextual analysis in the display shaping engine. Isolated character properties typically only tell part of the story. Characters that are constituent elements of an enumerated list of character sequences obviously exist in the context of such sequences. However, the property defined for specific, enumerated lists of sequences discussed below is different from the kind of algorithmic context discussed here. In fact, algorithms may be defined to evaluate the contexts surrounding not only individual characters or code points, but also the context surrounding certain enumerated character sequences.
In some cases, the expected character behavior depends on external context, such as the type and nature of the document, the language of the text, or the cultural expectations of the user. Properties modeling such behaviors may be specified in separate standards, as is the case for the UTS #10 Unicode Collation Algorithm [UCA]. Where a reasonably generic set of property values can be assigned, for example for [LineBreak], such properties may be defined as part of [Unicode]. Such properties and any algorithms related to them define useful default behavior, which can be further customized or tailored to meet more specific requirements.
When modeling character behavior with computer processes, formal character properties are assigned to achieve the expected results. Such modeling depends heavily on the algorithms used to produce these results. In some cases, a given character property is specified in close conjunction with a detailed specification of an algorithm. In other cases, algorithms are implied but not specified, or there are several algorithms that can make use of the same general character property, such as the classification of characters by General_Category or Indic_Syllabic_Type. Such general properties may require occasional implementation-specific adjustments in character property assignment to make all algorithms work correctly. This can usually be achieved by overriding specific properties for specific algorithms. (See also Section 4.3 "Overriding Properties via Higher-level Protocols")
When assigning character properties for use with a given algorithm, it may be tempting to assign somewhat arbitrary values to some characters, as long as the algorithm happens to produce the expected results. Proceeding in this way hides the nature of the character and limits the re-use of character properties by related processes. Therefore, instead of tweaking the properties to simply make a particular algorithm easier, the Unicode Standard pays careful attention to the essential underlying linguistic identity of the character. However, not all aspects of a character’s identity are relevant in all circumstances, and some characters can be used in many different ways, depending on context or circumstance. This means the formal character properties alone are not sufficient to describe the complete range of desirable or acceptable character behaviors.
Note: In some cases, the relevant algorithm is not defined in the Unicode standard. For example, the algorithm that converts strings of digits into numerical values is not defined in the Unicode Standard, but implementations will nevertheless refer to the numeric_value property.
Code point properties are properties of code points per se: in a character encoding standard these are independent of any assignment of actual abstract characters to those code points. In most character encoding standards, these are trivial, but in the Unicode Standard they are not.
Examples of code point properties include:
These statements remain true of a code point whether or not there is a particular abstract character assigned to them. For example, they track status of the code points: whether any abstract character is assigned to them or can be assigned to them, and so on. Essentially, whenever code points are designated or ranges are reserved in some way, code point properties are assigned.
Character properties are those properties that abstract characters have independent of any consideration of their encoding.
Examples of character properties, not limited to formal properties, include:
By virtue of encoding the abstract character LATIN CAPITAL LETTER G at the code point U 0047, this universe of character properties, some known and obvious, others obscure or even undiscovered, are associated with that code point.
Some of those character properties are generic and systematic enough to be useful or even necessary in the implementation of general text processing algorithms — those are the ones that the Unicode Standard formalizes as properties in the Unicode Character Database.
General text processing algorithms and the programming APIs through which they are accessed must be prepared to deal with any code point, even one that is unassigned to any characters at the time the implementation was created. As a result, they nearly always need to properly handle each and every code point for any character property, even if they only associate a property value of 'unknown' or 'inapplicable' to unassigned or unsupported code points.
This requirement leads to the use of the unifying concept of Encoded Character Property in the Unicode character property model. An encoded character property combines the concept of a code point property associating ranges of code points with default values of a property, with the concept of a character property associating specific values to the assigned characters. This unified model correlates well with the reality of Unicode-based implementations, which must supply some value for each and every code point. In addition, this unified concept simplifies most of the definitions that are built on top of it, since it is no longer necessary to separately account for definitions applying to character properties vs. code point properties.
Character and code point properties are defined such that all assigned characters and all code points have a defined property value, even if that value is "N/A" ("does not apply"). Assigned characters and code points each form a finite set. This is generally not true for strings. Because there is no inherent, fixed limit to the length of a string, the number of possible sequences is in principle not bounded. Some properties for strings can be described algorithmically, via String Functions, and such properties can be said to apply to every possible string. Other properties apply only to a specific set of strings which is listed explicitly. In this latter case, the properties are referred to as properties of an enumerated set of strings. These concepts are elaborated below in Section 3.6, Strings, and Section 3.7, Properties of Strings.
In Chapter 3, Conformance, The Unicode Standard [Unicode] defines a Normative Property as "a Unicode character property used in the specification of the standard" (definition D33) and provides the following explanation:
Specification that a character property is normative means that implementations which claim conformance to a particular version of the Unicode Standard and which make use of that particular property must follow the specifications of the standard for that property for the implementation to be conformant. For example, the Bidi_Class property is required for conformance whenever rendering text that requires bidirectional layout, such as Arabic or Hebrew.
Whenever a normative process depends on a property in a specified way, that property is designated as normative.
The fact that a given Unicode character property is normative does not mean that the values of the property will never change for particular characters. Corrections and extensions to the standard in the future may require minor changes to normative values, even though the Unicode Technical Committee strives to minimize such changes...
Some of the normative Unicode algorithms depend critically on particular property values for their behavior. Normalization, for example, defines an aspect of textual interoperability that many applications rely on to be absolutely stable. As a result, some of the normative properties disallow any kind of overriding by higher-level protocols. Thus the decomposition of Unicode characters is both normative and not overridable; no higher-level protocol may override these values, because to do so would result in non-interoperable results for the normalization of Unicode text. Other normative properties, such as case mapping, are overridable by higher-level protocols, because their intent is to provide a common basis for behavior. Nevertheless, they may require tailoring for particular local cultural conventions or particular implementations.
By making a property normative and non-overridable, the Unicode Standard guarantees that conformant implementations can rely on other conformant implementations to interpret the character in the same way. This is most useful for those properties where the Unicode Standard provides precise rules for the interpretation of characters based on their properties, such as the decompositions and their use by the Normalization forms [Normal].
Note: One trivial, but important example of conformant implementation is runtime access to information from the Unicode Character Database [UCD]. For normative properties exposed by a conformant implementation, conformance requires the returned values to match the values defined by the Unicode Consortium.
For some character properties, such as the general category, the Unicode standard does not define what model of processing the property is intended to support, nor does it specify the required consequences of a character being defined as "Letter Other" as opposed to "Symbol Other", for example. In the absence of such definition, the only effect of conformance that can be rigorously tested is whether a conformant implementation of a character property function returns the correct value to its caller. However, many implementations use such normative properties for their own purposes and guaranteed access to this information helps interoperability.
For information on which properties are normative, see the documentation file for the Unicode Character Database [UCDDoc].
For more information on overriding normative properties, see Section 4.3 Overriding properties via Higher-level Protocols.
The Unicode Standard [Unicode] defines an Informative Property as "a Unicode character property whose values are provided for information only" (definition D35) and provides the following explanation:
A conformant implementation is free to use or change informative property values as it may require, while remaining conformant to the standard. An implementer has the option of establishing a protocol to convey that particular informative properties are being used in distinct ways.
Informative properties capture expert implementation experience. When an informative property is explicitly specified in the Unicode Character Database, its use is strongly recommended for implementations to encourage comparable behavior between implementations. Note that it is possible for an informative property in one version of the Unicode Standard to become a normative property in a subsequent version of the standard if its use starts to acquire conformance implications in some part of the standard. [emphasis added].
Properties may be informative for two main reasons:
In some cases, properties are too tentative to be published as informative properties. In that case they may be explicitly designated as provisional.
The Property Aliases [Alias] and Property Value Aliases [ValueAlias] define a set of names and abbreviations, called aliases, that are used to refer to properties and property values. These names can be used for XML formats of data in the Unicode Character Database [UCD], for regular-expression property tests, and other programmatic textual descriptions of Unicode data. The names themselves are not normative, except where they correspond to normative properties in the UCD. However, other standards may make normative references to both normative and informative aliases. For more information, see UTS #18: Unicode Regular Expressions [RegEx].
There is one abbreviated name and one long name for most of the properties. Additional aliases may be added at any time. The property value names are not unique across properties. For example, AL means Arabic Letter for the Bidi_Class property, and AL means Alpha_Left for the Combining_Class property, and AL means Alphabetic for the Line_Break property. In addition, some property names may be the same as some property value names. For example, cc means Combining_Class property, and cc means the General_Category property value Control. The combination of property value and property name is, however, unique.
The aliases may be translated in appropriate environments, and additional aliases may be used. The case distinctions, whitespace, and '_' in the property names are not normative. Unless a specific form is required in a particular application, all forms are equivalent. For further information see Section 5.9 Matching Rules in UAX #44 Unicode Character Database [UCDDoc].
[Unicode] Section 3.1 gives a prescription for referencing properties:
References to Unicode Character Properties
Properties and property values have defined names and abbreviations, such as
Property: General_Category (gc)
Property Value: Uppercase_Letter (Lu)To reference a given property and property value, these aliases are used, as in this example:
The property value Uppercase_Letter from the General_Category property, as specified in Version 14.0.0 of the Unicode Standard.
Then cite that version of the standard, using the standard citation format that is provided for each version of the Unicode Standard.
Additional reference examples are available online.
The Unicode Character Database [UCD] is the main repository for machine-readable character properties. It consists of a number of files containing property data along with a documentation file explaining the organization of the database and the format and meaning of the property data. The main file, "The Unicode Character Database" [UCDDoc] explains the overall organization of the current version of the UCD and tells which files contain which properties.
While the Unicode Consortium strives to minimize changes to character property data, occasionally the character properties for already encoded characters must be updated. When this situation occurs, the relevant data files of the Unicode Character Database are revised. The revised data files are posted on the Unicode Web site as an update version of the standard.
A visual documentation of character code point, character name and reference glyph, together with excerpts from some of the character properties and augmented by additional annotations can be found in the Character Code [Charts].
In the rest of this document, as in the Unicode Standard, the term 'character property', or the term 'property' without qualifier includes both character and code point properties and their combined form, the encoded character properties.
Note: Properties classed in [UCDDoc] as type "String-valued" are string-valued properties. However, some properties classed as "Miscellaneous" are also string-valued properties.
Note: Actually, some properties classed in [UCDDoc] as type "Miscellaneous" can also be considered string-valued properties. The Jamo_Short_Name property is such an example. The distinction is that most properties currently designated to be of type "String-valued" are conceived of as mapping from some Unicode character to some other Unicode character (or sequence of characters) for the purposes of such operations as case mapping, case folding, or normalization of strings, whereas the string values of Miscellaneous properties tend to be just arbitrary strings.
The following definitions do not define character or code point properties, but properties of such properties. In the definitions in this section, the term 'code point' is used inclusively to mean code point for a code point property and character for a character property, respectively.
This section introduces definitions for strings, which are needed for the discussion of properties of strings and the role of string functions in the character property model.
The following three string-related definitions are specified in Chapter 3, Conformance, of the Unicode Standard [Unicode].
Those definitions were originally developed to focus on the identity of encoded characters and of sequences of encoded characters, in the context of specifying Unicode encoding forms and other concepts of the Unicode Standard. As such, the formal definitions do not include zero-length sequences as part of their definitions. Where these definitions are used in Chapter 3, the absence of a character is generally not pertinent to the explication.
In programming contexts, however, strings are almost always defined to include the empty string as part of the class or type definition. This is more elegant for implementations of strings and for the design of string-based APIs, including those supporting the implementation of character properties. This distinction is important for the discussion of the Unicode character property model. When the concept of character properties is extended to deal with the properties of Unicode strings, as well as single characters, implementations need to take the empty string into account.
In the Unicode character property model, the primary concern is with properties of characters (or code points), rather than the very limited concept of properties which might apply directly to code units. To avoid clumsiness of terminology, instead of using the formal definition, "coded character sequence," the term Unicode string is simply stipulated, in this context, to also refer to a coded character sequence, instead of only to a code unit sequence.
Furthermore, in the subsequent discussion of properties of strings, for simplicity of presentation, any mention of a Unicode string is also stipulated to extend to include the empty string.
None of the following definitions are found in the Unicode Standard at this point; they extend the existing definitions to cover properties for character sequences.
None of the following definitions is found in the Unicode Standard at this point, however, they are useful in the context of discussing Unicode algorithms and their relation to properties.
Dealing with offsets at the level of code units is the concern of lower-level implementation processes, which must deal with the details of character encoding forms. For the purposes of the character property model, strings are simply defined abstractly in terms of encoded character sequences and code points.
The notation toX(s) may be used for the folding, and isX(s) for the corresponding binary function, defined such that isX(s) if and only if toX(s) = s. For example, toNFC() is the folding that converts to NFC format, while isNFC() is the test for whether a string is in that format.
This technical report does not define conformance requirements, but the following subsections discuss and summarize the conformance requirements related to character properties stated in the Unicode Standard. Where applicable, the number of the corresponding conformance clause or definition is given in square brackets.
In Chapter 3, Conformance, The Unicode Standard [Unicode] states that "A process shall interpret a coded character sequence according to the character semantics established by this standard, if that process does interpret that coded character sequence." [C4] The semantics of a character are established by taking its coded representation, character name and representative glyph in context and are further defined by its normative properties and behavior. Neither character name nor representative glyphs can be relied upon absolutely; a character may have a broader range of use than the most literal interpretation of its character name, and the representative glyph is only indicative of one of a range of typical glyphs representing the same character.
Unicode algorithms are specified as an idealized series of steps (rules) performed on an input of character codes and their associated properties. [Unicode] states:
- An implementation claiming conformance to a Unicode algorithm need only guarantee that it produces the same results as those specified in the logical description of the process; it is not required to follow the actual described procedure in detail. This allows room for alternative strategies and optimizations in implementation. See [C18].
As long as the same results are achieved, the implementation is also not required to use the actual properties published in the [UCD]. Overriding a property value therefore does not necessarily imply an actual change in property assignments, merely that the conformant implementation of an algorithm now produces the same results as if the property values had been changed in the description of the ideal algorithm.
In discussing character semantics, the Unicode Standard [Unicode] makes this statement about overriding properties and character behavior:
Some normative behavior is default behavior; this behavior can be overridden by higher-level protocols. However, in the absence of such protocols, the behavior must be observed so as to follow the character semantics. See [D3].
Overrides by a higher-level protocol can conceptually take many forms, including, but not limited to:
Where overrides involve normative properties, specific restrictions apply, for example:
• The character combination properties and the canonical ordering behavior cannot be overridden by higher-level protocols. See [D3].
For additional examples of higher-level protocols as well as restrictions on them see section 4.3 in UAX #9: Unicode Bidirectional Algorithm [Bidi]. There are some normative properties that are fully overridable, for example General Category.
On the other hand, any and all informative properties may be overridden. However, if doing so changes the result of a Unicode Algorithm, any implementation wishing to conform to that algorithm must indicate that overrides have been applied.
Updates to properties of the Unicode Character Database can be required for three reasons:
While the Unicode Consortium endeavors to keep the values of all character properties as stable as possible, some circumstances may arise that require changing them. Changing a character's property assignment may impact existing implementations and is therefore done judiciously and with great care, only when there is no better alternative.
In particular, as Unicode encodes less well-documented scripts, such as those for minority languages, the exact character properties and behavior may not be known when the script is first encoded. The properties for such characters are expected to be changed as information becomes available.
As implementation experience grows, it may become necessary to readjust property values. As much as possible, such readjustments are compatible with established practice. Occasionally, a character property is changed to prevent incorrect generalizations of a character's use based on its nominal property values. For example, U 200B ZERO WIDTH SPACE was originally classified as a space character (General Category=Zs), but is now classified as a Formal Control (gc=Cf) to distinguish this line break control from space characters.
In other cases, there may have been unintentional mistakes in the original information that require corrections.
The [UTC] carefully weighs the costs of a change against the benefit of the correction. In addition, all updates to properties are subject to the stability guarantees described in the next section.
Unicode guarantees the stability of character assignments; that is, the identity of a character encoded at a given location will remain the same. Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character.
For example, the representative glyph for U 0041 "A" could not be changed to "B"; the general category for U 0041 "A" could not be changed to Ll (lowercase letter); and the decomposition mapping for U 00C1 (Á) could not be changed to <U 0042, U 0301> (B, ´).
In addition, for some properties, one or more of the following aspects are guaranteed to be invariant:
For the most up-to-date specification of all stability guarantees in effect see the Unicode Character Encoding Stability Policy [Stability]. Note that the status of a property as normative does not imply a stability guarantee.
Stability of assignment is the characteristic of an immutable property. For example, once a character is encoded, its code point and name are immutable properties. An immutable property allows software and documents to refer to its values without needing to track future updates to the Standard. One side effect of an immutable property is that errors in property values cannot be fixed. For example, mistakes in naming are annotated in the Unicode character names list in a note or by using an alias, but the formal name remains unchanged, even in cases of clear-cut typographical errors.
Because Code_Point is an immutable property, if a character is ever found to be unnecessary, or a mistaken duplicate of an existing character, it will not be removed. Instead, it can be given an additional property, deprecated, and its use strongly discouraged. However, the interpretation of all existing documents containing the character remains the same.
Stability of result is the characteristic of a stable property. For example, once a character is encoded, its canonical combining class and decomposition (canonical or compatibility) are stable with respect to normalization. Stability with respect to normalization is defined in such a way that if a string contains only characters from a given version of the Unicode Standard (say Unicode 3.2), and it is put into a normalized form in accordance with that version of Unicode, then it will be in normalized form when normalized according to any future version of Unicode.
However, unlike character code and character name, some properties that are guaranteed to be stable may be corrected in exceptional circumstances that are clearly defined by the Unicode Character Encoding Stability Policy [Stability]. In addition to other requirements, the correction must be of an obvious mistake, such as a typographical error, and any alternative would have to violate the stability of the identity of the character in question. Allowing such carefully restricted exceptions obviates the need for encoding duplicate characters simply to correct clerical or other clear-cut errors in property assignments.
For most properties, additional property values may be created and assigned to both new and existing characters. For example additional line breaking classes will be assigned if characters are discovered to require line breaking behavior that cannot be expressed with the existing set of classes. For other properties the set of values is guaranteed to be fixed, or their range is limited. For example, the set of values for the General_Category or Bidirectional_Class is fixed, while combining classes are limited to the values 0 to 254.
For example, all characters other than those of General Category M* have the combining class 0.
In principle, the way the property information is presented in the Unicode Character Database is independent of the way this information is defined. However, as the Unicode Standard gets updated, it becomes easier for implementations to track updates if file formats remain unchanged and other aspects of the way the data are organized can remain stable. For the majority of properties, such stability is an informal goal of the development process, but in a few cases, some aspects of the data organization are covered by formal stability guarantees.
For example, Canonical and Compatibility mappings are always in canonical order, and the resulting recursive decomposition will also be in canonical order. Canonical mappings are also always limited either to a single value or to a pair. The second character in the pair cannot itself have a canonical mapping.
As an alternative to the legacy conventions of semicolon-separated text files, the Unicode Character Database is now also available as a single XML file. See UAX #42 Unicode Character Database in XML [XML].
In an ideal world, all character properties would be perfectly self-consistent, and related properties would be consistent with each other over the entire range of code points. However, The Unicode Standard is the product of many compromises. It has to strike a balance between uniformity of treatment for similar characters, and compatibility with existing practice for characters inherited from legacy encodings. Because of this balancing act, one can expect a certain number of anomalies in character properties.
Sometimes it may be advantageous for an implementation to purposefully override some of the anomalous property values, increasing the efficiency and uniformity of algorithms—as long as the results they produce do not conflict with those specified by the normative properties of this standard. See Chapter 4, Character Properties in [Unicode] for some examples.
Property values assigned to new characters added to the Unicode Standard are generally defined so that related characters are given consistent values, unless deliberate exceptions are needed. For some properties, definite links between that property and one or more other properties are defined. For example, for the LineBreak property, many line break classes are defined in relation to General Category values.
There are some properties that are interrelated or that are derived from a combination of other properties, with or without a list of explicit exceptions. When properties are assigned to newly assigned characters, or when properties are adjusted, it is necessary to take into account all existing relevant properties, any derivational relations to derived properties, and all property stability guarantees.
Some of the information provided about characters in the Unicode Character Database constitutes provisional data. Provisional property data may capture partial or preliminary information. Such data may contain errors or omissions, or otherwise not be ready for systematic use; however, provisional property data are included in the data files for distribution partly to encourage review and improvement of the information. For example, a number of the tags in the Unihan database provide provisional property values of various sorts about Han characters.
Occasionally, as the standard matures, and new characters, properties or algorithms are defined, the information presented in an existing property may be better represented via other properties, or it may no longer make sense to extend the property to new characters. Such a property may then no longer be maintained in future versions of the Unicode Standard. In that case, it will be designated as stabilized. For backwards compatibility, a stabilized property will remain part of the Unicode Character Database, but will not be updated or corrected.
An example of a stabilized property is Hyphen.
Limited properties apply to only a subset of characters. Where these properties are implemented as a partition of the Unicode code space, the characters to which the property does not apply are given a special value denoting that the property does not apply. The "not applicable" value may be the explicit value "NA" or, for some properties, take other values such as "XX".
Implementations often need specific properties for all code points, including those that are unassigned. To meet this need, the Unicode standard assigns default properties to ranges of unassigned code points.
All implementations of the Unicode Standard should endeavor to handle additions to the character repertoire gracefully. In some cases this may require that an implementation attempts to 'anticipate' likely property values for code points for which characters have not yet been defined, but where surrounding characters exist that make it probable that similar characters will be assigned to the code point in question.
There are three strategies:
Each of these strategies has advantages and drawbacks, and none can guarantee that the behavior of an implementation that is conformant to a prior version of the Unicode Standard will support characters added in a later version of the Unicode Standard in precisely the same way as an implementation that is conformant to the later version. The most that can be hoped for is that the earlier implementation will behave more gracefully in such circumstances.
In principle, default values are temporary: they are superseded by final assignments once characters are assigned to a given code point.
For noncharacter code points, a character property function would return the same value as the default value for unassigned characters.
Sometimes, a determination and assignment of property values can be made, but the information on which it was based may be incomplete or preliminary. In such cases, the property value may be changed when better information becomes available. Currently, there is no machine readable way to provide information about the confidence of a property assignment; however, the text of the Standard or a Technical Report defining the property may provide general indications of preliminary status of property assignments where they are known.
This is distinct from provisional properties, where the entire property is preliminary.
[Alias] | Property Aliases https://www.unicode.org/unicode/Public/UCD/latest/ucd/PropertyAliases.txt |
[Bidi] | Unicode
Standard Annex #9: The Unicode Bidirectional Algorithm https://www.unicode.org/reports/tr9/ |
[Charts] | The online code charts can be found at https://www.unicode.org/charts/ An index to characters names with links to the corresponding chart is found at https://www.unicode.org/charts/charindex.html |
[EAW] | Unicode Standard Annex #11: East Asian
Width https://www.unicode.org/reports/tr11/ |
[FAQ] | Unicode Frequently Asked Questions https://www.unicode.org/faq/ For answers to common questions on technical issues. |
[Glossary] | Unicode Glossary https://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[LineBreak] | Unicode Standard Annex #14: Unicode Line Breaking
Algorithm https://www.unicode.org/reports/tr14/ |
[Normal] | Unicode Standard Annex #15: Unicode Normalization Forms https://www.unicode.org/unicode/reports/tr15/ |
[RegEx] | Unicode Technical Standard #18: Unicode Regular Expressions https://www.unicode.org/unicode/reports/tr18/ |
[Stability] | Unicode Character Encoding Stability Policy https://www.unicode.org/policies/stability_policy.html |
[UCA] | Unicode Technical Standard #10: Unicode Collation Algorithm https://www.unicode.org/reports/tr10/ |
[UCD] | About the Unicode Character Database https://www.unicode.org/ucd/ For an overview of the Unicode Character Database |
[UCDDoc] | Unicode Standard Annex #44: Unicode Character Database https://www.unicode.org/reports/tr44/ For documentation of the contents of the Unicode Character Database and its associated files |
[Unicode] | The Unicode Standard For the latest version see: https://www.unicode.org/versions/latest/ For Version 15.0 see: The Unicode Consortium. The Unicode Standard, Version 15.0.0 (Mountain View, CA: The Unicode Consortium, 2022. ISBN 978-1-936213-32-0). https://www.unicode.org/versions/Unicode15.0.0/ |
[Unihan] | Unicode Standard Annex #38: Unicode Han Database (Unihan) https://www.unicode.org/reports/tr38/ The database itself is available online at https://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip (large download) |
[UTC] | The Unicode Technical Committee For more information see https://www.unicode.org/consortium/utc.html |
[UTS51] | Unicode Technical Standard #51: Unicode Emoji https://www.unicode.org/reports/tr51/ |
[ValueAlias] | Property Value Aliases https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt |
[XML] | Unicode Standard Annex #42: Unicode Character Database in XML https://www.unicode.org/reports/tr42/ The XML version of the database is available online at https://www.unicode.org/Public/UCD/latest/ucdxml/ |
Asmus Freytag was the initial author of this report, with additional content provided by Ken Whistler.
The editors wish to thank Mark Davis for his extensive contributions and insightful comments, and Dr. Julie Allen for extensive copy-editing. Ivan Panchenko provided a careful copyedit and list of typos to fix for Revision 15.
The following summarizes modifications from the previous version of this document.
Revision 15 [AF, KW]
Previous revisions can be accessed with the “Previous Version” link in the header.
© 2022 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.