ICU Data
Contents
- Overview
- ICU and CLDR Data
- ICU Data Directory
- Default ICU Data
- Building and Linking against ICU data
- Time Zone Data
- Application Data
- Alignment
- Flexibility vs. Installation vs. Performance
- How Data Loading Works
- User Data Caching
- Directory Separator Characters
- Sharing ICU Data Between Platforms
- Customizing ICU’s Data Library
- Customizing ICU’s Data Library for ICU 63 or earlier
-
ICU Data File Formats
-
Public Data Files
- ICU.dat package files
- Resource bundles
- Unicode conversion mapping tables
- Conversion (charset) aliases
- Unicode Character Data (Properties; for Java only: hardcoded in C common library)
- Unicode Character Data (Case mappings; for Java only: hardcoded in C common library)
- Unicode Character Data (BiDi, and Arabic shaping; for Java only: hardcoded in C common library)
- Unicode Character Data (Normalization since ICU 4.4) & custom normalization data
- Unicode Character Data (Character names)
- Unicode Character Data (Property [value] aliases since ICU 4.8; for Java only: hardcoded in C common library since ICU 4.8)
- Unicode Character Data (Text layout properties since ICU 64)
- Unicode Character Data (Emoji properties since ICU 70)
- Collation data (root collation & tailorings; ICU 53 & later)
- Rule-based break iterator data
- Dictionary-based break iterator data (ICU 50 & later)
- Rule-based transform (transliterator) data
- Time zone data (ICU 4.4 & later)
- StringPrep profile data
- Confusables data
-
Public Data Files (old versions)
- Unicode Character Data (Normalization before ICU 4.4; for Java only: was hardcoded in C common library)
- Unicode Character Data (Property [value] aliases before ICU 4.8)
- Collation data (UCA, code points to weights; ICU 52 & earlier)
- Collation data (Inverse UCA, weights->code points; ICU 52 & earlier)
- Dictionary-based break iterator data (ICU 49 & earlier)
- Time zone data (Before ICU 4.4)
- Non-File API Binary Data
- Test-Only Data Files
- Other Data Structures
-
Public Data Files
- ICU4J Resource Information
Overview
ICU makes use of a wide variety of data tables to provide many of its services. Examples include converter mapping tables, collation rules, transliteration rules, break iterator rules and dictionaries, and other locale data. Additional data can be provided by users, either as customizations of ICU’s data or as new data altogether.
This section describes how ICU data is stored and located at run time. It also describes how ICU data can be customized to suit the needs of a particular application.
For simple use of ICU’s predefined data, this section on data management can safely be skipped. The data is built into a library that is loaded along with the rest of ICU. No specific action or setup is required of either the application program or the execution environment.
Update: as of ICU 64, the standard data library is over 20 MB in size. We have introduced a new tool, the ICU Data Build Tool, to give you more control over what goes into your ICU locale data file.
Note: ICU for C by default comes with pre-built data. The source data files are included as an “icu*data.zip” file starting in ICU4C 49. Previously, they were not included unless ICU is downloaded from the source repository.
ICU and CLDR Data
Most of ICU’s data is sourced from CLDR, the Common Locale Data Repository project. Do not file bugs against ICU to request data changes in CLDR, see the CLDR project’s page itself. Also note that most ICU data files are therefore autogenerated from CLDR, and so manually editing them is not usually recommended.
Data which is NOT sourced from CLDR includes:
- Conversion Data
- Break Iterator Dictionary Data ( Thai, CJK, etc )
- Break Iterator Rule Data (as of this writing, it is manually kept in sync with the CLDR datasets)
For information on building ICU data from CLDR, see the cldr-icu-readme.
ICU Data Directory
The ICU data directory is the default location for all ICU data. Any requests for data items that do not include an explicit directory path will be resolved to files located in the ICU data directory.
The ICU data directory is determined as follows:
-
If the application has called the function
u_setDataDirectory()
, use the directory specified there, otherwise: -
If the environment variable
ICU_DATA
is set, use that, otherwise: -
If the C preprocessor variable
ICU_DATA_DIR
was set at the time ICU was built, use its compiled-in value. -
Otherwise, the ICU data directory is an empty string. This is the default behavior for ICU using a shared library for its data and provides the highest data loading performance.
Note:
u_setDataDirectory()
is not thread-safe. Call it before calling ICU APIs from multiple threads. If you use bothu_setDataDirectory()
andu_init()
, then useu_setDataDirectory()
first.Earlier versions of ICU supported two additional schemes: setting a data directory relative to the location of the ICU shared libraries, and on Windows, taking a location from the registry. These have both been removed to make the behavior more predictable and easier to understand.
The ICU data directory does not need to be set in order to reference the standard built-in ICU data. Applications that just use standard ICU capabilities (converters, locales, collation, etc.) but do not build and reference their own data do not need to specify an ICU data directory.
Multiple-Item ICU Data Directory Values
The ICU data directory string can contain multiple directories as well as .dat path/filenames. They must be separated by the path separator that is used on the platform, for example a semicolon (;
) on Windows. Data files will be searched in all directories and .dat package files in the order of the directory string. For details, see the example below.
Default ICU Data
The default ICU data consists of the data needed for the converters, collators, locales, etc. that are provided with ICU. Default data must be present in order for ICU to function.
The default data is most commonly built into a shared library that is installed with the other ICU libraries. Nothing is required of the application for this mechanism to work. ICU provides additional options for loading the default data if more flexibility is required.
Here are the steps followed by ICU to locate its default data. This procedure happens only once per process, at the time an ICU data item is first requested.
-
If the application has called the function
udata_setCommonData()
, use the data that was provided. The application specifies the address in memory of an image of an ICU common format data file (either in shared-library format or .dat package file format). -
Examine the contents of the default ICU data shared library. If it contains data, use that data. If the data library is empty, a stub library, proceed to the next step. (A data shared library must always be present in order for ICU to successfully link and load. A stub data library is used when the actual ICU common data is to be provided from another source).
-
Dynamically load (memory map, typically) a common format (.dat) file containing the default ICU data. Loading is described in the section How Data Loading Works. The path to the data is of the form “icudt<version><flag>”, where <version> is the two-digit ICU version number, and <flag> is a letter indicating the internal format of the file (see the Sharing ICU Data Between Platforms section).
Once the default ICU data has been located, loading of individual data items proceeds as described in the section How Data Loading Works.
Building and Linking against ICU data
When using ICU’s configure or runConfigureICU tool to build, several different methods of packging are available.
Note: in all cases, you must link all ICU tools and applications against a “data library”: either a data library containing the ICU data, or against the “stubdata” library located in icu/source/stubdata. For example, even if ICU is built in “files” mode, you must still link against the “stubdata” library or an undefined symbol error occurs.
-
--with-data-packaging=library
This mode builds a shared library (DLL or .so). This is the simplest mode to use, and is the default. To use: link your application against the common and data libraries. This is the only directly supported behavior on Windows builds. -
--with-data-packaging=static
This option builds ICU data as a single (large) static library. This mode is more complex to use. If you encounter errors, you may need to build ICU multiple times. -
--with-data-packaging=auto
With this option,configure
will picklibrary
unless the options--enable-static
and--disable-shared
are also given, in which case it’ll picstatic
instead. -
--with-data-packaging=files
With this option, ICU outputs separate individual files (.res, .cnv, etc) which will be loaded at runtime. Read the rest of this document, especially the sections that discuss the ICU directory path. -
--with-data-packaging=archive
With this option, ICU outputs a single “icudt__.dat” file containing ICU data. Read the rest of this document, especially the sections that discuss the ICU directory path.
Time Zone Data
Because time zone data requires frequent updates in response to countries changing their transition dates for daylight saving time, ICU provides additional options for loading time zone data from separate files, thus avoiding the need to update a combined ICU data package. Further information is found under Time Zones.
Application Data
ICU-based applications can ship and use their own data for localized strings, custom conversion tables, etc. Each data item file must have a package name as a prefix, and this package name must match the basename of a .dat package file, if one is used. The package name must be used in ICU APIs, for example in udata_setAppData()
(instead of udata_setCommonData()
which is only used for ICU’s own data) and in the pathname argument of ures_open()
.
The only real difference to ICU’s own data is that application data cannot be simply loaded by specifying a NULL value for the path arguments of ICU APIs, and application data will not be used by APIs that do not have path/package name arguments at all.
The most important APIs that allow application data to be used are for Resource Bundles, which are most often used for localized strings and other data. There are also functions like ucnv_openPackage()
that allow to specify application data, and the udata.h
API can be used to load any data with minimum requirements on the binary format, and without ICU interpreting the contents of the data.
The pkgdata
tool, which is used to package the data into various formats (e.g. shared library), has an option (--without-assembly
or -w
) to not use assembly code when building and packaging the application specific data into a shared library. Building the data with assembly code, which is enabled by default, is faster and more efficient; however, there are some platform specific issues that may arise. The --without-assembly
option may be necessary on certain platforms (e.g. Linux) which have trouble properly loading application data when it was built with assembly code and is packaged as a shared library.
Alignment
ICU data is designed to be 16-aligned, with natural alignment of values inside the data structure, so that the data is usable as is when memory-mapped. (“16-aligned” means that the start address is a multiple of 16 bytes.)
Memory-mapping (as well as memory allocation) provides at least 16-alignment on modern platforms. Some CPUs require n-alignment of types of size n bytes (and crash on unaligned reads), other CPUs usually operate faster on data that is aligned properly.
Some of the ICU code explicitly checks for proper alignment.
The icupkg
tool places data items into the .dat file at start offsets that are multiples of 16 bytes.
When using genccode
to directly write a .o/.obj file, or to write assembler code, it specifies at least 16-alignment. When using genccode
to write C code, it prepends the data with a double value which should yield at least 8-alignment on most platforms (usually sizeof(double)=8
).
Flexibility vs. Installation vs. Performance
There are choices that affect ICU data loading and depend on application requirements.
Data in Shared Libraries/DLLs vs. .dat package files
Building ICU data into shared libraries (--with-data-packaging=library
) is the most convenient packaging method because shared libraries (DLLs) are easily found if they are in the same directory as the application libraries, or if they are on the system library path. The application installer usually just copies the ICU shared libraries in the same place. On the other hand, shared libraries are not portable.
Packaging data into .dat files (--with-data-packaging=archive
) allows them to be shared across platforms, but they must either be loaded by the application and set with udata_setCommonData()
or udata_setAppData()
, or they must be in a known location that is included in the ICU data directory string. This requires the application installer, or the application itself at runtime, to locate the ICU and/or application data by setting the ICU data directory (see the ICU Data Directory section above) or by loading the data and providing it to one of the udata_setXYZData()
functions.
Unlike shared libraries, .dat package files can be taken apart into separate data item files with the decmn ICU tool. This allows post-installation modification of a package file. The gencmn
and pkgdata
ICU tools can then be used to reassemble the .dat package file.
For more information about .dat package files see the section Sharing ICU Data Between Platforms below.
Data Overriding vs. Loading Performance
If the ICU data directory string is empty, then ICU will not attempt to load data from the file system. It is then only possible to load data from the linked-in shared library or via udata_setCommonData()
and udata_setAppData()
. This is inflexible but provides the highest performance.
If the ICU data directory string is not empty, then data items are searched in all directories and matching .dat files mentioned before checking in already-loaded package files. This allows overriding of packaged data items with single files after installation but costs some time for filesystem accesses. This is usually done only once per data item; see User Data Caching below.
Single Data Files vs. Packages
Single data files (--with-data-packaging=files
) are easy to replace and can override items inside data packages. However, it is usually desirable to reduce the number of files during installation, and package files use less disk space than many small files.
How Data Loading Works
ICU data items are referenced by three names - a path, a name and a type. The following are some examples:
path | name | type |
---|---|---|
c:\some\path\dataLibName | test | dat |
no path | cnvalias | icu |
no path | cp1252 | cnv |
no path | en | res |
no path | uprops | icu |
Items with ‘no path’ specified are loaded from the default ICU data.
Application data items include a path, and will be loaded from user data files, not from the ICU default data. For application data, the path argument need not contain an actual directory, but must contain the application data’s package name after the last directory separator character (or by itself if there is no directory). If the path argument contains a directory, then it is logically prepended to the ICU data directory string and searched first for data. The path argument can contain at most one directory. (Path separators like semicolon (;) are not handled here.)
Note: The ICU data directory string itself may contain multiple directories and path/filenames to .dat package files. See the ICU Data Directory section.
It is recommended to not include the directory in the path argument but to make sure via setting the application data or the ICU data directory string that the data can be located. This simplifies program maintenance and improves robustness.
See the API descriptions for the functions udata_open()
and udata_openChoice()
for additional information on opening ICU data from within an application.
Data items can exist as individual files, or a number of them can be packaged together in a single file for greater efficiency in loading and convenience of distribution. The combined files are called Common Files.
Based on the supplied path and name, ICU searches several possible locations when opening data. To make things more concrete in the following descriptions, the following values of path, name and type are used:
path = "c:\\some\\path\\dataLibName"
name = "test"
type = "res"
In this case, “dataLibName” is the “package name” part of the path argument, and “c:\some\path\” is the directory part of it.
The search sequence for the data for “test.res” is as follows (the first successful loading attempt wins):
-
Try to load the file “dataLibName_test.res” from c:\some\data\.
-
Try to load the file “dataLibName_test.res” from each of the directories in the ICU data directory string.
-
Try to locate the data package for the package name “dataLibName”.
-
Try to locate the data package in the internal cache.
-
Try to load the package file “dataLibName.dat” from c:\some\data\.
-
Try to load the package file “dataLibName.dat” from each of the directories in the ICU data directory string.
The first steps, loading the data item from an individual file, are omitted if no directory is specified in either the path argument or the ICU data directory string.
Package files are loaded at most once and then cached. They are identified only by their package name. Whenever a data item is requested from a package and that package has been loaded before, then the cached package is used immediately instead of searching through the filesystem.
Note: ICU versions before 2.2 always searched data packages before looking for individual files, which made it impossible to override packaged data items. See the ICU 2.2 download page and the readme for more information about the changes.
User Data Caching
Once loaded, data package files are cached, and stay loaded for the duration of the process. Any requests for data items from an already loaded data package file are routed directly to the cached data. No additional search for loadable files is made.
The user data cache is keyed by the base file name portion of the requested path, with any directory portion stripped off and ignored. Using the previous example, for the path name “c:\some\path\dataLibName”, the cache key is “dataLibName”. After this is cached, a subsequent request for “dataLibName”, no matter what directory path is specified, will resolve to the cached data.
Data can be explicitly added to the cache of common format data by means of the udata_setAppData()
function. This function takes as input the path (name) and a pointer to a memory image of a .dat file. The data is added to the cache, causing any subsequent requests for data items from that file name to be routed to the cache.
Only data package files are cached. Separate data files that contain just a single data item are not cached; for these, multiple requests to ICU to open the data will result in multiple requests to the operating system to open the underlying file.
However, most ICU services (Resource Bundles, conversion, etc.) themselves cache loaded data, so that data is usually loaded only once until the end of the process (or until u_cleanup()
or ucnv_flushCache()
or similar are called.)
There is no mechanism for removing or updating cached data files.
Directory Separator Characters
If a directory separator (generally ‘/’ or ‘\’) is needed in a path parameter, use the form that is native to the platform. The ICU header "putil.h"
defines U_FILE_SEP_CHAR
appropriately for the platform.
Note: On Windows, the directory separator must be ‘\’ for any paths passed to ICU APIs. This is different from native Windows APIs, which generally allow either ‘/’ or ‘\’.
Sharing ICU Data Between Platforms
ICU’s default data is (at the time of this writing) about 8 MB in size. Because it is normally built as a shared library, the file format is specific to each platform (operating system). The data libraries can not be shared between platforms even though the actual data contents are identical.
By distributing the default data in the form of common format .dat files rather than as shared libraries, a single data file can be shared among multiple platforms. This is beneficial if a single distribution of the application (a CD, for example) includes binaries for many platforms, and the size requirements for replicating the ICU data for each platform are a problem.
ICU common format data files are not completely interchangeable between platforms. The format depends on these properties of the platform:
-
Byte Ordering (little endian vs. big endian)
-
Base character set - ASCII or EBCDIC
This means, for example, that ICU data files are interchangeable between Windows and Linux on X86 (both are ASCII little endian), or between Macintosh and Solaris on SPARC (both are ASCII big endian), but not between Solaris on SPARC and Solaris on X86 (different byte ordering).
The single letter following the version number in the file name of the default ICU data file encodes the properties of the file as follows:
icudt19l.dat Little Endian, ASCII
icudt19b.dat Big Endian, ASCII
icudt19e.dat Big Endian, EBCDIC
(There are no little endian EBCDIC systems. All non-EBCDIC encodings include an invariant subset of ASCII that is sufficient to enable these files to interoperate.)
The packaging of the default ICU data as a .dat file rather than as a shared library is requested by using an option in the configure script at build time. Nothing is required at run time; ICU finds and uses whatever form of the data is available.
Note: When the ICU data is built in the form of shared libraries, the library names have platform-specific prefixes and suffixes. On Unix-style platforms, all the libraries have the “lib” prefix and one of the usual (“.dll”, “.so”, “.sl”, etc.) suffixes. Other than these prefixes and suffixes, the library names are the same as the above .dat files.
Customizing ICU’s Data Library
ICU includes a standard library of data that is about 16 MB in size. Most of this consists of conversion tables and locale information. The data itself is normally placed into a single shared library.
Update: as of ICU 64, the standard data library is over 20 MB in size. We have introduced a new tool, the ICU Data Build Tool, to replace the makefiles explained below and give you more control over what goes into your ICU locale data file.
Adding Converters to ICU
The first step is to obtain or create a .ucm (source) mapping data file for the desired converter. A large archive of converter data is maintained by the ICU team at https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm
We will use solaris-eucJP-2.7.ucm
, available from the repository mentioned above, as an example.
Build the Converter
Converter source files are compiled into binary converter files (.cnv files) by using the icu tool makeconv. For the example, you can use this command
makeconv -v solaris-eucJP-2.7.ucm
Some of the .ucm files from the repository will need additional header information before they can be built. Use the error messages from the makeconv tool, .ucm files for similar converters, and the ICU user guide documentation of .ucm files as a guide when making changes. For the solaris-eucJP-2.7.ucm
example, we will borrow the missing header fields from source/data/mappings/ibm-33722_P12A-2000.ucm
, which is the standard ICU eucJP converter data.
The ucm file format is described in the “Conversion Data” chapter of this user guide.
After adjustment, the header of the solaris-eucJP-2.7.ucm
file contains these items:
<code_set_name> "solaris-eucJP-2.7"
<subchar> \\x3F
<uconv_class> "MBCS"
<mb_cur_max> 3
<mb_cur_min> 1
<icu:state> 0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
<icu:state> a1-fe
<icu:state> a1-e4
<icu:state> a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
<icu:state> a1-fe
The binary converter file produced by the makeconv
tool is solaris-eucJP-2.7.cnv
.
Installation
Copy the new .cnv file to the desired location for use. Set the environment variable ICU_DATA
to the directory containing the data, or, alternatively, from within an application, tell ICU the location of the new data with the function u_setDataDirectory()
before using the new converter.
If ICU is already obtaining data from files rather than a shared library, install the new file in the same location as the existing ICU data file(s), and don’t change/set the environment variable or data directory.
If you do not want to add a converter to ICU’s base data, you can also generate a conversion table with makeconv
, use pkgdata to generate your own package and use the ucnv_openPackage()
to open up a converter with that conversion table from the generated package.
Building the new converter into ICU
The need to install a separate file and inform ICU of the data directory can be avoided by building the new converter into ICU’s standard data library. Here is the procedure for doing so:
-
Move the .ucm file(s) for the converter(s) to be added (
solaris-eucJP-2.7.ucm
for our example) into the directorysource/data/mappings/
-
Create, or edit, if it already exists, the file
source/data/mappings/ucmlocal.mk
. Add this line:UCM_SOURCE_LOCAL = solaris-eucJP-2.7.ucm
Any number of converters can be listed. Extend the list to new lines with a back slash at the end of the line. The
ucmlocal.mk
file is described in more detail insource/data/mappings/ucmfiles.mk
(Even though they use very different build systems,ucmlocal.mk
is used for both the Windows and UNIX builds.) -
Add the converter name and aliases to
source/data/mappings/convrtrs.txt
. This will allow your converter to be shown in the list of available converters when you call theucnv_getAvailableName(
) function. The file syntax is described within the file. -
Rebuild the ICU data. For Windows, from MSVC choose the makedata project from the GUI, then build the project. For UNIX,
cd icu/source/data; gmake
When opening an ICU converter (ucnv_open()
), the converter name can not be qualified with a path that indicates the directory or common data file containing the corresponding converter data. The required data must be present either in the main ICU data library or as a separate .cnv file located in the ICU data directory. This is different from opening resources or other types of ICU data, which do allow a path.
Adding Locale Data to ICU’s Data
If you have data for a locale that is not included in ICU’s standard build, then you can add it to the build in a very similar way as with conversion tables above. The ICU project provides a large number of additional locales in its locale repository on the web. Most of this locale data is derived from the CLDR (Common Locale Data Repository) project.
Dropping the txt file into the correct place in the source tree is sufficient to add it to your ICU build. You will need to re-configure in order to pick it up.
Customizing ICU’s Data Library for ICU 63 or earlier
The ICU data library can be easily customized, either by adding additional converters or locales, or by removing some of the standard ones for the purpose of saving space.
Note: ICU for C by default comes with pre-built data. The source data files are included as an “icu*data.zip” file starting in ICU4C
- Previously, they were not included unless ICU is downloaded from the source repository. Alternatively, the Data Customizer may be used to customize the pre-built data.
ICU can load data from individual data files as well as from its default library, so building a customized library when adding additional data is not strictly necessary. Adding to ICU’s library can simplify application installation by eliminating the need to include separate files with an application distribution, and the need to tell ICU where they are installed.
Reducing the size of ICU’s data by eliminating unneeded resources can make sense on small systems with limited or no disk, but for desktop or server systems there is no real advantage to trimming. ICU’s data is memory mapped into an application’s address space, and only those portions of the data actually being used are ever paged in, so there are no significant RAM savings. As for disk space, with the large size of today’s hard drives, saving a few MB is not worth the bother.
By default, ICU builds with a large set of converters and with all available locales. This means that any extra items added must be provided by the application developer. There is no extra ICU-supplied data that could be specified.
Details
The converters and resources that ICU builds are in the following configuration files. They are only available when building from ICU’s source code repository. Normally, the standard ICU distribution do not include these files.
File | Description |
---|---|
source/data/locales/resfiles.mk | The standard set of locale data resource bundles |
source/data/locales/reslocal.mk | User-provided file with additional resource bundles |
source/data/coll/colfiles.mk | The standard set of collation data resource bundles |
source/data/coll/collocal.mk | User-provided file with additional collation resource bundles |
source/data/brkitr/brkfiles.mk | The standard set of break iterator data resource bundles |
source/data/brkitr/brklocal.mk | User-provided file with additional break iterator resource bundles |
source/data/translit/trnsfiles.mk | The standard set of transliterator resource files |
source/data/translit/trnslocal.mk | User-provided file with a set of additional transliterator resource files |
source/data/mappings/ucmcore.mk | Core set of conversion tables for MIME/Unix/Windows |
source/data/mappings/ucmfiles.mk | Additional, large set of conversion tables for a wide range of uses |
source/data/mappings/ucmebcdic.mk | Large set of EBCDIC conversion tables |
source/data/mappings/ucmlocal.mk | User-provided file with additional conversion tables |
source/data/misc/miscfiles.mk | Miscellaneous data, like timezone information |
These files function identically for both Windows and UNIX builds of ICU. ICU will automatically update the list of installed locales returned by uloc_getAvailable()
whenever resfiles.mk
or reslocal.mk
are updated and the ICU data library is rebuilt. These files are only needed while building ICU. If any of these files are removed or renamed, the size of the ICU data library will be reduced.
The optional files reslocal.mk
and ucmlocal.mk
are not included as part of a standard ICU distribution. Thus these customization files do not need to be merged or updated when updating versions of ICU.
Both reslocal.mk
and ucmlocal.mk
are makefile includes. So the usual rules for makefiles apply. Lines may be continued by preceding the end of the line to be continued with a back slash. Lines beginning with a # are comments. See ucmfiles.mk
and resfiles.mk
for additional information.
Reducing the Size of ICU’s Data: Conversion Tables
The size of the ICU data file in the standard build configuration is about 8 MB. The majority of this is used for conversion tables. ICU comes with so many conversion tables because many ICU users need to support many encodings from many platforms. There are conversion tables for EBCDIC and DOS codepages, for ISO 2022 variants, and for small variations of popular encodings.
Important: ICU provides full internationalization functionality without any conversion table data. The common library contains code to handle several important encodings algorithmically: US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8, and IMAP-mailbox-name (i.e., US-ASCII, ISO-8859-1, and all Unicode charsets; see source/data/mappings/convrtrs.txt for the current list).
Therefore, the easiest way to reduce the size of ICU’s data by a lot (without limitation of I18N support) is to reduce the number of conversion tables that are built into the data file.
The conversion tables are listed for the build process in several makefiles source/data/mappings/ucm\*.mk
, roughly grouped by how commonly they are used. If you remove or rename any of these files, then the ICU build will exclude the conversion tables that are listed in that file. Beginning with ICU 2.0, all of these makefiles including the main one are optional. If you remove all of them, then ICU will include only very few conversion tables for “fallback” encodings (see note below).
If you remove or rename all ucm\*.mk
files, then ICU’s data is reduced to about 3.6 MB. If you remove all these files except for ucmcore.mk
, then ICU’s data is reduced to about 4.7 MB, while keeping support for a core set of common MIME/Unix/Windows encodings.
Note: If you remove the conversion table for an encoding that could be a default encoding on one of your platforms, then ICU will not be able to instantiate a default converter. In this case, ICU 2.0 and up will automatically fall back to a “lowest common denominator” and load a converter for US-ASCII (or, on EBCDIC platforms, for codepages 37 or 1047). This will be good enough for converting strings that contain only “ASCII” characters (see the comment about “invariant characters” in
utypes.h
). When ICU is built with a reduced set of conversion tables, then some tests will fail that test the behavior of the converters based on known features of some encodings. Also, building the testdata will fail if you remove some conversion tables that are necessary for that (to test non-ASCII/Unicode resource bundle source files, for example). You can ignore these failures. Build with the standard set of conversion tables, if you want to run the tests.
Reducing the Size of ICU’s Data: Locale Data
If you need to reduce the size of ICU’s data even further, then you need to remove other files or parts of files from the build as well.
There are a number of different subdirectories of ‘data’ containing locale data split out by section. Each subdirectory has its own .mk file listing the locales which will be built. Subdirectories include lang for language names and curr for currency names.
You can remove data for entire locales by removing their files from source/data/locales/resfiles.mk
or the appropriate other .mk file. ICU will then use the data of the parent locale instead, which is root.txt. If you remove all resource bundles for a given language and its country/region/variant sublocales, do not remove root.txt! Also, do not remove a parent locale if child locales exist. For example, do not remove “en” while retaining “en_US”.
Reducing the Size of ICU’s Data: Collation Data
Collation data (for sorting, searching and alphabetic indexes) is also large, especially the collation data for East Asian languages because they define multiple orderings of tens of thousands of Han characters. You can remove the collation data for those languages by removing references to those locales from source/data/coll/colfiles.mk
files. When you do that, the collation for those languages will fall back to the root collator, that is, you lose language-specific behavior.
A much less radical approach is to keep the collation data tables but remove the tailoring rule strings from which they were built. Those rule strings are rarely used at runtime. For documentation about their use and how to remove them see the section “Building on Existing Locales” in the Collation Customization chapter.
Adding Locale Data to ICU’s Data
You need to write a resource bundle file for it with a structure like the existing locale resource bundles (e.g. source/data/locales/ja.txt, ru_RU.txt
, kok_IN.txt
) and add it by writing a file source/data/locales/reslocal.mk
just like above. In this file, define the list of additional resource bundles as
GENRB_SOURCE_LOCAL=myLocale.txt other.txt ...
Starting in ICU 2.2, these added locales are automatically listed by uloc_getAvailable()
.
ICU Data File Formats
ICU uses several kinds of data files with specific source (plain text) and binary data formats. The following lists provides links to descriptions of those formats.
Each ICU data object begins with a header before the actual, specific data. The header consists of a 16-bit header length value, the two “magic” bytes DA 27 and a UDataInfo structure which specifies the data object’s endianness, charset family, format, data version, etc.
(This is not the case for the trie structures, which are not stand-alone, loadable data objects.)
Public Data Files
ICU.dat package files
- Source format: (list of files provided as input to the icupkg tool, or on the gencmn tool command line)
- Binary format: .dat: source/tools/toolutil/pkg_gencmn.cpp
- Generator tool: icupkg or gencmn
Resource bundles
- Source format: .txt: icuhtml/design/bnf_rb.txt
- Binary format: .res: source/common/uresdata.h
- Generator tool: genrb
Unicode conversion mapping tables
- Source format: .ucm: Conversion Data chapter
- Binary format: .cnv: source/common/ucnvmbcs.h
- Generator tool: makeconv
Conversion (charset) aliases
- Source format: source/data/mappings/convrtrs.txt: contains format description. The command “uconv -l –canon” will also generate the alias table from the currently used copy of ICU.
- Binary format: cnvalias.icu: source/common/ucnv_io.cpp
- Generator tool: gencnval
Unicode Character Data (Properties; for Java only: hardcoded in C common library)
- Source format: source/data/unidata/ppucd.txt: Preparsed UCD
- Binary format: uprops.icu: tools/unicode/c/genprops/corepropsbuilder.cpp
- Generator tool: genprops
Unicode Character Data (Case mappings; for Java only: hardcoded in C common library)
- Source format: source/data/unidata/*.txt: Unicode Character Database
- Binary format: ucase.icu: tools/unicode/c/genprops/casepropsbuilder.cpp
- Generator tool: genprops
Unicode Character Data (BiDi, and Arabic shaping; for Java only: hardcoded in C common library)
- Source format: source/data/unidata/*.txt: Unicode Character Database
- Binary format: ubidi.icu: tools/unicode/c/genprops/bidipropsbuilder.cpp
- Generator tool: genprops
Unicode Character Data (Normalization since ICU 4.4) & custom normalization data
- Source format: source/data/unidata/norm2/*.txt: Files derived from the Unicode Character Database, or custom data.
- Binary format: .nrm: source/common/normalizer2impl.h
- Generator tool: gennorm2
Unicode Character Data (Character names)
- Source format: source/data/unidata/UnicodeData.txt: Unicode Character Database
- Binary format: unames.icu: tools/unicode/c/genprops/namespropsbuilder.cpp
- Generator tool: genprops
Unicode Character Data (Property [value] aliases since ICU 4.8; for Java only: hardcoded in C common library since ICU 4.8)
- Source format: UCD Property*Aliases.txt: Unicode Character Database
- Binary format: pnames.icu: source/common/propname.h
- Generator tool: genprops
Unicode Character Data (Text layout properties since ICU 64)
- Source format: source/data/unidata/ppucd.txt: Preparsed UCD
- Binary format: ulayout.icu: tools/unicode/c/genprops/layoutpropsbuilder.cpp
- Generator tool: genprops
Unicode Character Data (Emoji properties since ICU 70)
Emoji properties of code points moved out of uprops.icu. Emoji properties of strings added.
- Source format: source/data/unidata/emoji-sequences.txt and source/data/unidata/emoji-zwj-sequences.txt: UTS #51 Data Files
- Binary format: uemoji.icu: tools/unicode/c/genprops/emojipropsbuilder.cpp
- Generator tool: genprops
Collation data (root collation & tailorings; ICU 53 & later)
- Source format: Original data from allkeys_CLDR.txt in CLDR Root Collation Data Files processed into source/data/unidata/FractionalUCA.txt by tool at unicode.org maintained by Mark Davis (call the Main class with option writeFractionalUCA); source tailorings (text rules) in source/data/coll/*.txt resource bundles: Collation Customization chapter.
- Binary format: ucadata.icu & binary tailorings in resource bundles: source/i18n/collationdatareader.h
- Generator tool: genuca, genrb
Rule-based break iterator data
- Source format: .txt: Boundary Analysis chapter
- Binary format: .brk: source/common/rbbidata.h
- Generator tool: genbrk
Dictionary-based break iterator data (ICU 50 & later)
- Source format: txt: gendict.cpp comments
- Binary format: .dict: see [source/common/dictionarydata.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/dictionarydata.h
- Generator tool: gendict
Rule-based transform (transliterator) data
- Source format: .txt (in resource bundles): Transform Rule Tutorial chapter
- Binary format: Uses genrb to make binary format
- Generator tool: Does not apply
Time zone data (ICU 4.4 & later)
- Source format: source/data/misc/zoneinfo64.txt: ftp://elsie.nci.nih.gov/pub/ tzdata
.tar.gz - Binary format: zoneinfo64.res (generated by genrb and tzcode tools).
- Generator tool: Does not apply
StringPrep profile data
- Source format: source/data/sprep/rfc3491.txt:
- Binary format: .spp: source/tools/gensprep/store.c
- Generator tool: gensprep
Confusables data
- Source format: source/data/unidata/confusables.txt, source/data/unidata/confusablesWholeScript.txt
- Binary format: .spp: confusables.cfu: source/i18n/uspoof_impl.h
- Generator tool: gencfu
Public Data Files (old versions)
Unicode Character Data (Normalization before ICU 4.4; for Java only: was hardcoded in C common library)
- Source format: [source/data/unidata/*.txt]((https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata): Unicode Character Database
- Binary format: unorm.icu: source/common/unormimp.h
- Generator tool: gennorm
Unicode Character Data (Property [value] aliases before ICU 4.8)
- Source format: source/data/unidata/Property*Aliases.txt: Unicode Character Database
- Binary format: pnames.icu: source/common/propname.h (ICU 4.6)
- Generator tool: genpname
Collation data (UCA, code points to weights; ICU 52 & earlier)
- Source format: Same as in ICU 53
- Binary format: ucadata.icu & binary tailorings in resource bundles: source/i18n/ucol_imp.h (ICU 52)
- Generator tool: genuca, genrb
Collation data (Inverse UCA, weights->code points; ICU 52 & earlier)
- Source format: Processed from FractionalUCA.txt like ICU 52 ucadata.icu
- Binary format: invuca.icu: source/i18n/ucol_imp.h (ICU 52)
- Generator tool: genuca
Dictionary-based break iterator data (ICU 49 & earlier)
- Source format: .txt: genctd.cpp comments
- Binary format: ctd: see CompactTrieHeader in source/common/triedict.cpp
- Generator tool: genctd
Time zone data (Before ICU 4.4)
- Source format: .source/data/misc/zoneinfo.txt (ICU 4.2): ftp://elsie.nci.nih.gov/pub/ tzdata
.tar.gz - Binary format: zoneinfo64.res (generated by genrb and tzcode tools).
- Generator tool: Does not apply
Non-File API Binary Data
Converter selector data
- Source format: none
- Binary format: source/common/ucnvsel.cpp
- Generator tool: ucnvsel_open()
Test-Only Data Files
test.icu (for udata API testing)
- Source format: none (fixed output from gentest when not using -r or -j options)
- Binary format: test.icu: see
createData()
in source/tools/gentest/gentest.c - Generator tool: gentest
Other Data Structures
UCPTrie (C)/CodePointTrie (Java) (maps code points to integers)
- Source format: (public builder API)
- Binary format: ICU Code Point Tries design doc, icu4c/source/common/ucptrie_impl.h
- Generator tool: (builder class)
UTrie2 (C)/Trie2 (Java) (maps code points to integers)
- Source format: (internal builder API)
- Binary format: ICU Code Point Tries design doc, icu4c/source/common/utrie2_impl.h
- Generator tool: (builder class)
BytesTrie (maps byte sequences to 32-bit integers)
- Source format: (public builder API)
- Binary format: BytesTrie design doc, icu4c/source/common/unicode/bytestrie.h
- Generator tool: (builder class)
UCharsTrie (C )/CharsTrie (Java) (maps 16-bit-Unicode strings to 32-bit integers)
- Source format: (public builder API)
- Binary format: UCharsTrie design doc, icu4c/source/common/unicode/ucharstrie.h
- Generator tool: (builder class)
ICU4J Resource Information
Starting with release 2.1, ICU4J includes its own resource information which is completely independent of the JRE resource information. (Note, ICU4J 2.8 to 3.4, time zone information depends on the underlying JRE). The new ICU4J information is equivalent to the information in ICU4C and many resources are, in fact, the same binary files that ICU4C uses.
By default the ICU4J distribution includes all of the standard resource information. It is located under the directory com/ibm/icu/impl/data
. Depending on the service, the data is in different locations and in different formats. Note: This will continue to change from release to release, so clients should not depend on the exact organization of the data in ICU4J.
-
The primary locale data is under the directory icudt38b, as a set of “.res” files whose names are the locale identifiers. Locale naming is documented in the
com.ibm.icu.util.ULocale
class, and the use of these names in searching for resources is documented incom.ibm.icu.util.UResourceBundle
. -
The collation data is under the directory
icudt38b/coll
, as a set of “.res” files. -
The rule-based transliterator data is under the directory
icudt38b/translit
as a set of “.res” files. (Note: the Han transliterator test data is no longer included in the core icu4j.jar file by default.) -
The rule-based number format data is under the directory
icudt38b/rbnf
as a set of “.res” files. -
The break iterator data is directly under the data directory, as a set of “.brk” files, named according to the type of break and the locale where there are locale-specific versions.
-
The holiday data is under the data directory, as a set of “.class” files, named “HolidayBundle_” followed by the locale ID.
-
The character property data as well as assorted normalization data and default unicode collation algorithm (UCA) data is found under the data directory as a set of “.icu” files.
-
The character set converter data is under the directory
icudt38b/
, as a set of “.cnv” files. These files are currently included only in icu-charset.jar. -
The time zone data is named
zoneinfo.res
under the directoryicudt38b
.
Some of the data files alias or otherwise reference data from other data files. One reason for this is because some locale names have changed. For example, he_IL used to be iw_IL. In order to support both names but not duplicate the data, one of the resource files refers to the other file’s data. In other cases, a file may alias a portion of another file’s data in order to save space. Currently ICU4J provides no tool for revealing these dependencies.
Note: Java’s Locale class silently converts the language code “he” to “iw” when you construct the Locale (for versions of Java through Java 5). Thus Java cannot be used to locate resources that use the “he” language code. ICU, on the other hand, does not perform this conversion in ULocale, and instead uses aliasing in the locale data to represent the same set of data under different locale ids.
Resource files that use locale ids form a hierarchy, with up to four levels: a root, language, region (country), and variant. Searches for locale data attempt to match as far down the hierarchy as possible, for example, “he_IL” will match he_IL, but “he_US” will match he (since there is no US variant for he, and “xx_YY will match root (the default fallback locale) since there is no xx language code in the locale hierarchy. Again, see java.util.ResourceBundle
for more information.
Currently ICU4J provides no tool for revealing these dependencies between data files, so trimming the data directly in the ICU4J project is a hit-or-miss affair. The key point when you remove data is to make sure to remove all dependencies on that data as well. For example, if you remove he.res, you need to remove he_IL.res, since it is lower in the hierarchy, and you must remove iw.res, since it references he.res, and iw_IL.res, since it depends on it (and also references he_IL.res).
Unfortunately, the jar tool in the JDK provides no way to remove items from a jar file. Thus you have to extract the resources, remove the ones you don’t want, and then create a new jar file with the remaining resources. See the jar tool information for how to do this. Before ‘rejaring’ the files, be sure to thoroughly test your application with the remaining resources, making sure each required resource is present.
Using additional resource files with ICU4J
Note: Resource file formats can change across releases of ICU4J!
The format of ICU4J resources is not part of the API. Clients who develop their own resources for use with ICU4J should be prepared to regenerate them when they move to new releases of ICU4J.
We are still developing ICU4J’s resource mechanism. Currently it is not possible to mix icu’s new binary .res resources with traditional java-style .class or .txt resources. We might allow for this in a future release, but since the resource data and format is not formally supported, you run the risk of incompatibilities with future releases of ICU4J.
Resource data in ICU4J is checked in to the repository as a jar file containing the resource binaries, icudata.jar. This means that inspecting the contents of these resources is difficult. They currently are compiled from ICU4C .txt file data. You can view the contents of the ICU4C text resource files to understand the contents of the ICU4J resources.
The files in icudata.jar get extracted to com/ibm/icu/impl/data in the build directory when the ‘core’ target is built. Building the ‘resources’ target will force the resources to once again be extracted. Extraction will overwrite any corresponding resource files already in that directory.
Building ICU4J Resources from ICU4C
Requirements
-
Compilers and tools required for building ICU4C.
-
J2SE SDK version 5 or above
Procedure
-
Download and build ICU4C on a Windows or Linux machine. For instructions on downloading and building ICU4C, please click here.
-
Follow the remaining instructions in the ICU4J Readme.