diff --git a/.gitattributes b/.gitattributes index 27993ee855e..a1ca97c2592 100644 --- a/.gitattributes +++ b/.gitattributes @@ -17,6 +17,7 @@ icu4c/docs/nameConv.html svneol=native#text/html icu4c/docs/number.html svneol=native#text/html icu4c/docs/supp_loc.html svneol=native#text/html icu4c/docs/tzClasses.html svneol=native#text/html +icu4c/docs/udata.html svneol=native#text/html icu4c/docs/utilCL.html svneol=native#text/html icu4c/license.html svneol=native#text/html icu4c/readme.html svneol=native#text/html diff --git a/icu4c/docs/udata.html b/icu4c/docs/udata.html new file mode 100644 index 00000000000..09f49e136fa --- /dev/null +++ b/icu4c/docs/udata.html @@ -0,0 +1,153 @@ + + +
+This is a raw draft.
+ +ICU data, when stored in files, is loaded from the file system
+directory that is returned by u_getDataDirectory()
.
+That directory is determined sequentially by
+
getenv("ICU_DATA")
-
+ the contents of the ICU_DATA environment variable"Path"
of the registry key
+ HKEY_LOCAL_MACHINE "SOFTWARE\\IBM\\Unicode\\Data"
icuuc.dll
or libicu-uc.so
or similar
+ is loaded from: if it is loaded from /some/path/lib/libicu-uc.so
, then
+ the path will be /some/path/lib/../share/icu/1.3.1/
+ where "1.3.1"
is an example for the version of the ICU library that
+ is trying to locate the data directoryicuuc.dll
or libicu-uc.so
or similar
+ is found by searching the PATH
or LIBPATH
+ as appropriate; the relative path is determined as above(system drive)/share/icu/1.3.1/
,
+ where (system drive)
is empty or a path to the system drive, like
+ "D:\"
on Windows or OS/2When ICU data is loaded using the udata
API functions, then
+there is a defined sequence of file locations and entry point names that are
+used to locate the data. See the description in icu/source/common/udata.h
for
+details. Note that the exact data finding depends on the implementation
+of this API and may differ by platform and by build configuration.
+See also icu/source/common/udata.c
for implementation details.
Data files for ICU and for applications loading their data with ICU, +should have a memory-mappable format. This means that the data should be +layed out in the file in an immediately useful way, so that the code that uses +the data does not need to parse it or copy it to allocated memory and +build additional structures (like Hashtables). +Here are some points to consider:
+ +unewdata.h/.c
+ to write the data.int32_t
, not using an ambiguous int
.bool_t
, bool
) values
+ and use explictly sized integer values instead
+ because the size of the boolean type may vary.char[]
strings, write only "invariant"
+ characters - avoid anything that is not common among all ASCII-
+ or EBCDIC-based encodings. This avoids incompatibilities and
+ real, heavyweight codepage conversions.
+ Even on the same platform, the default encoding may not always
+ be the same one, and every "non-invariant" character
+ may change.Data files with formats as described above should be portable among +machines with the same set of relevant properties:
+ +uint16_t
, int32_t
.char[]
.
+ Such strings should contain only "invariant characters", but
+ are even so only portable among machines with the same character set
+ family, i.e., they must share for example the ASCII or EBCDIC
+ graphic characters.UChar[]
.
+ In principle, Unicode characters are stored using UTF-8, UTF-16, or UTF-32.
+ Thus, Unicode strings are directly compatible if the code unit size is the same.
+ ICU uses only UTF-16 at this point.All of these properties can be verified by checking the
+UDataInfo
structure of the data, which is done
+best in a UDataMemoryIsAcceptable()
function passed into
+the udata_openChoice()
API function.
If a data file is loaded on a machine with different relevant properties +than the machine where the data file was generated, then the using +code could adapt by detecting the differences and reformatting the +data on the fly or in a copy in memory. +This would improve portability of the data files but significantly +decrease performance.
+ +"Relevant" properties are those that affect the portability of the +data in the particular file.
+ +For example, a flat (memory-mapped) binary data file
+that contains 16-bit and 32-bit integers and is
+created for a typical, big-endian Unix machine, can be used
+on an OS/390 system or any other big-endian machine.
+If the file also contains char[]
strings,
+then it can be easily shared among all big-endian and
+ASCII-based machines, but not with (e.g.) an OS/390.
+OS/390 and OS/400 systems, however, could easily share such
+a data file.
To make sure that the relevant platform properties of
+the data file and the loading machine match, the
+udata_openChoice()
API function should be used with a
+UDataMemoryIsAcceptable()
function that checks for
+these properties.
Some data file loading mechanisms prevent using data files generated on +a different platform to begin with, especially data files packaged as DLLs +(shared libraries).
+ + +... Use icu/source/tools/toolutil/unewdata.h|.c
to write data files,
+can include a copyright statement or other comment...