ICU-20088 Move User Guide to Markdown

See #919
This commit is contained in:
Craig Cornelius 2020-08-05 18:00:48 +00:00 committed by Markus Scherer
parent 0b815fb8c3
commit ec45aaf1a2
82 changed files with 26506 additions and 11 deletions

View file

@ -110,7 +110,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
For this example, the rule file is `icu4c/source/data/brkitr/rules/char.txt`.
(If the change is for word or line break, which have multiple rule files for tailorings, only update the root file at this time.)
Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](http://userguide.icu-project.org/boundaryanalysis/break-rules) for an explanation of rule syntax and behavior.
Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](../userguide/boundaryanalysis/break-rules.md) for an explanation of rule syntax and behavior.
The transformation from UAX or CLDR style rules to ICU rules can be non-trivial. Sources of difficulties include:

View file

@ -0,0 +1,437 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Break Rules
## Introduction
ICU locates boundary positions within text by means of rules, which are a form
of regular expressions. The form of the rules is similar, but not identical,
to the boundary rules from the Unicode specifications
[[UAX-14](https://unicode.org/reports/tr14/),
[UAX-29](https://unicode.org/reports/tr29/)], and there is a reasonably close
correspondence between the two.
Taken as a set, the ICU rules describe how to move forward to the next boundary,
starting from a known boundary.
ICU includes rules for the standard boundary types (word, line, etc.).
Applications may also create customized break iterators from their own rules.
ICU's built-in rules are located at
[icu/icu4c/source/data/brkitr/rules/](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules).
These can serve as examples when writing your own, and as starting point for
customizations.
### Rule Tutorial
Rules most commonly describe a range of text that should remain together,
unbroken. For example, this rule
[\p{Letter}]+;
matches a run of one or more letters, and would cause them to remain unbroken.
The part within `[`brackets`]` follows normal ICU [UnicodeSet pattern
syntax](../strings/unicodeset.md).
The qualifier, '`+`' in this case, can be one of
| Qualifier | Meaning |
| --------- | ------------------------ |
| empty | Match exactly once |
| `?` | Match zero or one time |
| `+` | Match one or more times |
| `*` | Match zero or more times |
#### Variables
A variable names a set or rule sub-expression. They are useful for documenting
what something represents, and for simplifying complex expressions by breaking
them up.
"Variable" is something if a misnomer; they cannot be reassigned, but are more
of a constant expression.
They start with a '`$`', both in the definition and use.
# Variable Definition
$ASCIILetNum = [A-Za-z0-9];
# Variable Use
$ASCIILetNum+;
#### Comments and Semicolons
'`#`' begins a comment, which extends to the end of a line.
Comments may stand alone, or appear after another statement on a line.
All rule statements or expressions are terminated by semicolons.
#### Chained Matching
Most ICU rule sets use the concept of "chained matching". The idea is that
complete match can be composed from multiple pieces, with each piece coming from
an individual rule of a rule set.
This idea is unique to ICU break rules, it is not a concept found in other
regular expression based matchers. Some of the Unicode standard break rules
would be difficult to implement without it.
Starting with an example,
!!chain;
word_char = [\p{Letter}];
word_joiner = [_-];
$word_char+;
$word_char $word_joiner $word_char;
These rules will match "`abc`", "`hello_world`", `"hi-there"`,
"`a-bunch_of-joiners-here`".
They will not match "`-abc`", "`multiple__joiners`", "`tail-`"
A full match is composed of pieces or submatches, possibly from different rules,
with adjacent submatches linked by at least one overlapping character.
In the example below, matching "`hello_world`",
* '`1`' shows matches of the first rule, `word_char+`
* '`2`' shows matches of the second rule, `$word_char $word_joiner $word_char`
hello_world
11111 11111
    222
There is an overlap of the matched regions, which causes the chaining mechanism
to join them into a single overall match.
The mechanism is a good match to, for example, [Unicode's word break
rules](http://www.unicode.org/reports/tr29/#Word_Boundary_Rules), where rules
WB5 through WB13 combine to piece together longer words from multiple short
segments.
`!!chain;` enables chaining in a rule set. It is disabled by default for back
compatibility—very old versions of ICU did not support it, and it was
originally introduced as an option.
#### Parentheses and Alternation
Rule expressions can contain parentheses and '`|`' operators, representing
alternation or "or" operations. This follows conventional regular expression
behavior.
For example, the following would match a simplified identifier:
$Letter ($Letter | $Digit)*;
#### String and Character Literals
Similarly to common regular expressions, literal characters that do not have
other special meaning represent themselves. So the rule
Hello;
would match the literal input "`Hello`".
In practice, nearly all break rules are composed from `[`sets`]` based on Unicode
character properties; literal characters in rules are very rare.
To prevent random typos in rules from being treated as literals, use this
option:
!!quoted_literals_only;
With the option, the naked `Hello` becomes a rule syntax error while a quoted
`"hello"` still matches a literal hello.
`!!quoted_literals_only` is strongly recommended for all rule sets. The random
typo problem is very real, and surprisingly hard to recognize and debug.
#### Explicit Break Rules
A rule containing a slash (`/`) will force a boundary when it matches, even when
other rules or chaining would otherwise lead to a longer match. Also called Hard
Break Rules, these have the form
pre-context / post-context;
where the pre and post-context look like normal break rules. Both the pre and
post context are required, and must not allow a zero-length match. There should
be no overlap between characters that end a match of the pre-context and those
that begin a match of the post-context.
Chaining into a hard break rule operates normally. There is no chaining out of a
hard break rule; when the post-context matches a break is forced immediately.
Note: future versions of ICU may loosen the restrictions on explicit break
rules. The behavior of rules with missing or overlapping contexts is subject to
change.
#### Chaining Control
Chaining into a rule can be dis-allowed by beginning that rule with a '`^`'. Rules
so marked can begin a match after a preceding boundary or at the start of text,
but cannot extend a match via chaining from another rule.
~~The !!LBCMNoChain; statement modifies chaining behavior by preventing chaining
from one rule to another from occurring on any character whose Line Break
property is Combining Mark. This option is subject to change or removal, and
should not be used in general. Within ICU, it is used only with the line break
rules. We hope to replace it with something more general.~~
> :point_right: **Note**: `!!LBCMNoChain` is deprecated, and will be removed completely from a future
version of ICU.
## Rule Status Values
Break rules can be tagged with a number, which is called the *rule status*.
After a boundary has been located, the status number of the specific rule that
determined the boundary position is available to the application through the
function `getRuleStatus()`.
For the predefined word boundary rules, status values are available to
distinguish between boundaries associated with words, numbers, and those around
spaces or punctuation. Similarly for line break boundaries, status values
distinguish between mandatory line endings (new line characters) and break
opportunities that are appropriate points for line wrapping. Refer to the ICU
API documentation for the C header file `ubrk.h` or to Java class
`RuleBasedBreakIterator` for a complete list of the predefined boundary
classifications.
When creating custom sets of break rules, integer status values can be
associated with boundary rules in whatever way will be convenient for the
application. There is no need to remain restricted to the predefined values and
classifications from the standard rules.
It is possible for a set of break rules to contain more than a single rule that
produces some boundary in an input text. In this event, `getRuleStatus()` will
return the numerically largest status value from the matching rules, and the
alternate function `getRuleStatusVec()` will return a vector of the values from
all of the matching rules.
In the source form of the break rules, status numbers appear at end of a rule,
and are enclosed in `{`braces`}`.
Hard break rules that also have a status value place the status at the end, for
example
pre-context / post-context {1234};
### Word Dictionaries
For some languages that don't normally use spaces between words, break iterators
are able to supplement the rules with dictionary based breaking. Some languages,
Thai or Lao, for example, use a dictionary for both word and line breaking.
Others, such as Japanese, use a dictionary for word breaking, but not for line
breaking.
To enable dictionary use,
1. The break rules must select, as unbroken chunks, ranges of text to be passed
off to the word dictionary for further subdivision.
2. The break rules must define a character class named `$dictionary` that
contains the characters (letters) to be handled by the dictionary.
The dictionary implementation, on receiving a range of text, will map it to a
specific dictionary based on script, and then delegate to that dictionary for
subdividing the range into words.
See, for example, this snippet from the [line break
rules](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/brkitr/rules/line.txt):
#   Dictionary character set, for triggering language-based break engines. Currently
#   limited to LineBreak=Complex_Context (SA).
$dictionary = [$SA];
## Rule Options
| Option | Description |
| --------------- | ----------- |
| `!!chain` | Enable rule chaining. Default is no chaining. |
| `!!forward` | The rules that follow are for forward iteration. Forward rules are now the only type of rules needed or used. |
### Deprecated Rule Options
| Deprecated Option | Description |
| --------------- | ----------- |
| ~~`!!reverse`~~ | ~~*[deprecated]* The rules that follow are for reverse iteration. No longer needed; any rules in a Reverse rule section are ignored.~~ |
| ~~`!!safe_forward`~~ | ~~*[deprecated]* The rules that follow are for safe forward iteration. No longer needed; any rules in such a section are ignored.~~ |
| ~~`!!safe_reverse`~~ | ~~*[deprecated]* The rules that follow are for safe reverse iteration. No longer needed; any rules in such a section are ignored.~~ |
| ~~`!!LBCMNoChain`~~ | ~~*[deprecated]* Disable chaining when the overlap character matches `\p{Line_Break=Combining_Mark}`~~ |
## Rule Syntax
Here is the syntax for the boundary rules. (The EBNF Syntax is given below.)
| Rule Name | Rule Values | Notes |
| ---------- | ----------- | ----- |
| rules | statement+ | |
| statement | assignment \| rule \| control |
| control | (`!!forward` \| `!!reverse` \| `!!safe_forward` \| `!!safe_reverse` \| `!!chain`) `;`
| assignment | variable `=` expr `;` | 5 |
| rule | `^`? expr (`{`number`}`)? `;` | 8,9 |
| number | [0-9]+ | 1 |
| break-point | `/` | 10 |
| expr | expr-q \| expr `\|` expr \| expr expr | 3 |
| expr-q | term \| term `*` \| term `?` \| term `+` |
| term | rule-char \| unicode-set \| variable \| quoted-sequence \| `(` expr `)` \| break-point |
| rule-special | *any printing ascii character except letters or numbers* \| white-space |
| rule-char | *any non-escaped character that is not rule-special* \| `.` \| *any escaped character except* `\p` *or* `\P` |
| variable | `$` name-start-char name-char* | 7 |
| name-start-char | `_` \| \p{L} |
| name-char | name-start-char \| \\p{N} |
| quoted-sequence | `'` *(any char except single quote or line terminator or two adjacent single quotes)*+ `'` |
| escaped-char | *See “Character Quoting and Escaping” in the [UnicodeSet](../strings/unicodeset.md) chapter* |
| unicode-set | See [UnicodeSet](../strings/unicodeset.md) | 4 |
| comment | unescaped `#` *(any char except new-line)** new-line | 2 |
| s | unescaped \p{Z}, tab, LF, FF, CR, NEL | 6 |
| new-line | LF, CR, NEL | 2 |
### Rule Syntax Notes
1. The number associated with a rule that actually determined a break position
is available to the application after the break has been returned. These
numbers are *not* Perl regular expression repeat counts.
2. Comments are recognized and removed separately from otherwise parsing the
rules. They may appear wherever a space would be allowed (and ignored.)
3. The implicit concatenation of adjacent terms has higher precedence than the
`|` operation. "`ab|cd`" is interpreted as "`(ab)|(cd)`", not as "`a(b|c)d`" or
"`(((ab)|c)d)`"
4. The syntax for [unicode-set](../strings/unicodeset.md) is defined (and parsed) by the `UnicodeSet` class.
It is not repeated here.
5. For `$`variables that will be referenced from inside of a `UnicodeSet`, the
definition must consist only of a Unicode Set. For example, when variable `$a`
is used in a rule like `[$a$b$c]`, then this definition of `$a` is ok:
`$a=[:Lu:];`” while this one “`$a=abcd;`” would cause an error when `$a` was
used.
6. Spaces are allowed nearly anywhere, and are not significant unless escaped.
Exceptions to this are noted.
7. No spaces are allowed within a variable name. The variable name `$dictionary`
is special. If defined, it must be a Unicode Set, the characters of which
will trigger the use of word dictionary based boundaries.
8. A leading `^` on a rule prevents chaining into that rule. It can only match
immediately after a preceding boundary, or at the start of text.
9. `{`nnn`}` appearing at the end of a rule is a Rule Status number, not a repeat
count as it would be with conventional regular expression syntax.
10. A `/` in a rule specifies a hard break point. If the rule matches, a
boundary will be forced at the position of the `/` within the match.
### EBNF Syntax used for the RBBI rules syntax description
| syntax | description |
| -- | ------------------------- |
| a? | zero or one instance of a |
| a+ | one or more instances of a |
| a* | zero or more instances of a |
| a \| b | either a or b, but not both |
| `a` "`a`" | the literal string between the quotes or displayed as `monospace` |
## Planned Changes and Removed or Deprecated Rule Features
1. Reverse rules could formerly be indicated by beginning them with an
exclamation `!`. This syntax is deprecated, and will be removed from a
future version of ICU.
2. `!!LBCMNoChain` was a global option that specified that characters with the
line break property of "Combining Character" would not participate in rule
chaining. This option was always considered internal, is deprecated and will
be removed from a future version of ICU.
3. Naked rule characters. Plain text, in the context of a rule, is treated as
literal text to be matched, much like normal regular expressions. This turns
out to be very error prone, has been the source of bugs in released versions
of ICU, and is not useful in implementing normal text boundary rules. A
future version will reject literal text that is not escaped.
4. Exact reverse rules and safe forward rules: planned changes to the break
engine implementation will remove the need for exact reverse rules and safe
forward rules.
5. `{bof}` and `{eof}`, appearing within `[`sets`]`, match the beginning or ending of
the input text, respectively. This is an internal (not documented) feature
that will probably be removed in a future version of ICU. They are currently
used by the standard rules for word, line and sentence breaking. An
alternative is probably needed. The existing implementation is incomplete.
## Additional Sample Code
**C/C++**: See
[icu/source/samples/break/](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/break/)
in the ICU source distribution for code samples showing the use of ICU boundary
analysis.
## Details about Dictionary-Based Break Iteration
> :point_right: **Note**: This section originally from August 2012.
> It is probably out of date, for example `brkfiles.mk` does not exist anyore.
Certain Unicode characters have a "dictionary" bit set in the break iteration
rules, and text made up of these characters cannot be handled by the rules-based
break iteration code for lines or words. Rather, they must be handled by a
dictionary-based approach. The ICU approach is as follows:
Once the Dictionary bit is detected, the set of characters with that bit is
handed off to "dictionary code." This code then inspects the characters more
carefully, and splits them by script (Thai, Khmer, Chinese, Japanese, Korean).
If text in this script has not yet been handled, it loads the appropriate
dictionary from disk, and initializes a specialized "BreakEngine" class for that
script.
There are three such specialized classes: Thai, Khmer and CJK.
Thai and Khmer use very similar approaches. They look through a dictionary that
is not weighted by word frequency, and attempt to find the longest total "match"
that can be made in the text.
For Chinese and Japanese text, on the other hand, we have a unified dictionary
(due to the fact that both use some of the same characters, it is difficult to
distinguish them) that contains information about word frequencies. The
algorithm to match text then uses dynamic programming to find the set of breaks
it considers "most likely" based on the frequency of the words created by the
breaks. This algorithm could also be used for Thai and Khmer, but we do not have
sufficient data to do so. This algorithm could also be used for Korean, but once
again we do not have the data to do so.
Code of interest is in `source/common/dictbe.{h, cpp}`, `source/common/brkeng.{h,
cpp}`, `source/common/dictionarydata.{h, cpp}`. The dictionaries use the `BytesTrie`
and `UCharsTrie` as their data store. The binary form of these dictionaries is
produced by the `gendict` tool, which has source in `source/tools/gendict`.
In order to add new dictionary implementations, a few changes have to be made.
First, you should create a new subclass of `DictionaryBreakEngine` or
`LanguageBreakEngine` in `dictbe.cpp` that implements your algorithm. Then, in
`brkeng.cpp`, you should add logic to create this dictionary break engine if we
strike the appropriate script - which should only be 3 or so lines of code at
the most. Lastly, you should add the correct data file. If your data is to be
represented as a `.dict` file - as is recommended, and in fact required if you
don't want to make substantial code changes to the engine loader - you need to
simply add a file in the correct format for gendict to the `source/data/brkitr`
directory, and add its name to the list of `BRK_DICT_SOURCE` in
`source/data/brkitr/brkfiles.mk`. This will cause your dictionary (say, `foo.txt`)
to be added as a `UCharsTrie` dictionary with the name foo.dict. If you want your
dictionary to be a `BytesTrie` dictionary, you will need to specify a transform
within the `Makefile`. To do so, find the part of `source/data/Makefile.in` and
`source/data/makedata.mak` that deals with `thaidict.dict` and `khmerdict.dict` and
add a similar set of lines for your script. Lastly, in
`source/data/brkitr/root.txt`, add a line to the dictionaries `{}` section of the
form:
shortscriptname:process(dependency){"dictionaryname.dict"}
For example, for Katakana:
Kata:process(dependency){"cjdict.dict"}
Make sure to add appropriate tests for the new implementation.

View file

@ -0,0 +1,529 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Boundary Analysis
## Overview of Text Boundary Analysis
Text boundary analysis is the process of locating linguistic boundaries while
formatting and handling text. Examples of this process include:
1. Locating appropriate points to word-wrap text to fit within specific margins
while displaying or printing.
2. Locating the beginning of a word that the user has selected.
3. Counting characters, words, sentences, or paragraphs.
4. Determining how far to move the text cursor when the user hits an arrow key
(Some characters require more than one position in the text store and some
characters in the text store do not display at all).
5. Making a list of the unique words in a document.
6. Figuring out if a given range of text contains only whole words.
7. Capitalizing the first letter of each word.
8. Locating a particular unit of the text (For example, finding the third word
in the document).
The `BreakIterator` classes were designed to support these kinds of tasks. The
BreakIterator objects maintain a location between two characters in the text.
This location will always be a text boundary. Clients can move the location
forward to the next boundary or backward to the previous boundary. Clients can
also check if a particular location within a source text is on a boundary or
find the boundary which is before or after a particular location.
## Four Types of BreakIterator
ICU `BreakIterator`s can be used to locate the following kinds of text boundaries:
1. Character Boundary
2. Word Boundary
3. Line-break Boundary
4. Sentence Boundary
Each type of boundary is found in accordance with the rules specified by Unicode
Standard Annex #29, *Unicode Text Segmentation*
(<https://unicode.org/reports/tr29/> ) or Unicode Standard Annex #14, *Unicode
Line Breaking Algorithm* (<https://unicode.org/reports/tr14/>)
### Character Boundary
The character-boundary iterator locates the boundaries according to the rules
defined in <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>.
These boundaries try to match what a user would think of as a "character"—a
basic unit of a writing system for a language—which may be more than just a
single Unicode code point.
The letter `Ä`, for example, can be represented in Unicode either with a single
code-point value or with two code-point values (one representing the `A` and
another representing the umlaut `¨`). The character-boundary iterator will treat
either representation as a single character.
End-user characters, as described above, are also called grapheme clusters, in
an attempt to limit the confusion caused by multiple meanings for the word
"character".
### Word Boundary
The word-boundary iterator locates the boundaries of words, for purposes such as
double click selection or "Find whole words" operations.
Words boundaries are identified according to the rules in
<https://www.unicode.org/reports/tr29/#Word_Boundaries>, supplemented by a word
dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for
locating word breaks take into account the alphabets and conventions used by
different languages.
Here's an example of a sentence, showing the boundary locations that will be
identified by a word break iterator:
> :point_right: **Note**: TODO: An example needs to be added here.
### Line-break Boundary
The line-break iterator locates positions that would be appropriate points to
wrap lines when displaying the text. The boundary rules are define here:
<https://www.unicode.org/reports/tr14/>
This example shows the differences in the break locations produced by word and
line break iterators:
> :point_right: **Note**: TODO: An example needs to be added here.
### Sentence Boundary
A sentence-break iterator locates sentence boundaries according to the rules
defined here: <https://www.unicode.org/reports/tr29/#Sentence_Boundaries>
## Dictionary-Based BreakIterator
Some languages are written without spaces, and word and line breaking requires
more than rules over character sequences. ICU provides dictionary support for
word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese.
Use of the dictionaries is automatic when text in one of the dictionary
languages is encountered. There is no separate API, and no extra programming
steps required by applications making use of the dictionaries.
## Usage
To locate boundaries in a document, create a BreakIterator using the
`BreakIterator::create***Instance` family of methods in C++, or the `ubrk_open()`
function (C), where "`***`" is `Character`, `Word`, `Line` or `Sentence`,
depending on the type of iterator wanted. These factory methods also take a
parameter that specifies the locale for the language of the text to be processed.
When creating a `BreakIterator`, a locale is also specified, and the behavior of
the BreakIterator obtained may be specialized in some way for that locale. For
most locales the default break iterator behavior is used.
Applications also may register customized BreakIterators for use in specific
locales. Once such a break iterator has been registered, any requests for break
iterators for that locale will return copies of the registered break iterator.
ICU may cache service instances. Therefore, registration should be done during
startup, before opening services by locale ID.
In the general-usage-model, applications will use the following basic steps to
analyze a piece of text for boundaries:
1. Create a `BreakIterator` with the desired behavior
2. Use the `setText()` method to set the iterator to analyze a particular piece
of text.
3. Locate the desired boundaries using the appropriate combination of `first()`,
`last()`, `next()`, `previous()`, `preceding()`, and `following()` methods.
The `setText()` method can be called more than once, allowing reuse of a
BreakIterator on new pieces of text. Because the creation of a `BreakIterator` can
be relatively time-consuming, it makes good sense to reuse them when practical.
The iterator always points to a boundary position between two characters. The
numerical value of the position, as returned by `current()` is the zero-based
index of the character following the boundary. Thus a position of zero
represents a boundary preceding the first character of the text, and a position
of one represents a boundary between the first and second characters.
The `first()` and `last()` methods reset the iterator's current position to the
beginning or end of the text (the beginning and the end are always considered
boundaries). The `next()` and `previous()` methods advance the iterator one boundary
forward or backward from the current position. If the `next()` or `previous()`
methods run off the beginning or end of the text, it returns DONE. The `current()`
method returns the current position.
The `following()` and `preceding()` methods are used for random access, to move the
iterator to an arbitrary position within the text. Since a BreakIterator always
points to a boundary position, the `following()` and `preceding()` methods will
never set the iterator to point to the position specified by the caller (even if
it is, in fact, a boundary position). `BreakIterator` will, however, set the
iterator to the nearest boundary position before or after the specified
position.
`isBoundary()` returns true if the specified position is a boundary.
### Thread Safety
`BreakIterator`s are not thread safe. This is inherit in their design—break
iterators are stateful, holding a reference to and position in the text, meaning
that a single instance cannot operate in parallel on multiple texts.
For concurrent break iteration, each thread must use its own break iterator.
These can be obtained by creating separate break iterators of the desired type,
or by initially creating a master break iterator and then creating a clone for
each thread.
### Line Breaking Strictness, a CSS Property
CSS has the concept of "[Line Breaking
Strictness](https://www.w3.org/TR/css-text-3/#line-break-property)". This
property specifies the strictness of line-breaking rules applied within an
element: especially how wrapping interacts with punctuation and symbols. ICU
line break iterators can choose a strictness using locale tags:
| Locale | Behavior |
| ------------ | ----------- |
| `en@lb=strict` <br/> `ja@lb=strict` | Breaks text using the most stringent set of line-breaking rules |
| `en@lb=normal` <br/> `ja@lb=normal` | Breaks text using the most common set of line-breaking rules. |
| `en@lb=loose` <br/> `ja@lb=loose` | Breaks text using the least restrictive set of line-breaking rules. Typically used for short lines, such as in newspapers. |
### Sentence Break Filters
Sentence breaking can return false positives - an indication that sentence ends
in an incorrect position - in the presence of abbreviations. For example,
consider the sentence
> In the meantime Mr. Weston arrived with his small ship.
Default sentence break shows a false boundary following the "Mr."
ICU includes lists of common abbreviations that can be used to filter, to
ignore, these false sentence boundaries. Filtering is enabled by the presence of
the `ss` locale tag when creating the break iterator.
| Locale | Behavior |
| ---------------- | ------------------------------------------------------- |
| `en` | no filtering |
| `en@ss=standard` | Filter based on common English language abbreviations. |
| `es@ss=standard` | Filter with common Spanish abbreviations. |
Abbreviation lists are available (as of ICU 64) for English, German, Spanish,
French, Italian and Portuguese.
## Accuracy
ICU's break iterators are based on the default boundary rules described in the
Unicode Standard Annexes [14](https://www.unicode.org/reports/tr14/) and
[29](https://www.unicode.org/unicode/reports/tr29/) . These are relatively
simple boundary rules that can be implemented efficiently, and are sufficient
for many purposes and languages. However, some languages and applications will
require a more sophisticated linguistic analysis of the text in order to find
boundaries with good accuracy. Such an analysis is not directly available from
ICU at this time.
Break Iterators based on custom, user-supplied boundary rules can be created and
used by applications with requirements that are not met by the standard default
boundary rules.
## BreakIterator Boundary Analysis Examples
### Print out all the word-boundary positions in a UnicodeString
**In C++:**
```c++
void listWordBoundaries(const UnicodeString& s) {
UErrorCode status = U_ZERO_ERROR;
BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);
bi->setText(s);
int32_t p = bi->first();
while (p != BreakIterator::DONE) {
printf("Boundary at position %d\n", p);
p = bi->next();
}
delete bi;
}
```
**In C:**
```c
void listWordBoundaries(const UChar* s, int32_t len) {
UBreakIterator* bi;
int32_t p;
UErrorCode err = U_ZERO_ERROR;
bi = ubrk_open(UBRK_WORD, 0, s, len, &err);
if (U_FAILURE(err)) return;
p = ubrk_first(bi);
while (p != UBRK_DONE) {
printf("Boundary at position %d\n", p);
p = ubrk_next(bi);
}
ubrk_close(bi);
}
```
### Get the boundaries of the word that contains a double-click position
**In C++:**
```c++
void wordContaining(BreakIterator& wordBrk,
int32_t idx,
const UnicodeString& s,
int32_t& start,
int32_t& end) {
// this function is written to assume that we have an
// appropriate BreakIterator stored in an object or a
// global variable somewhere-- When possible, programmers
// should avoid having the create() and delete calls in
// a function of this nature.
if (s.isEmpty())
return;
wordBrk.setText(s);
start = wordBrk.preceding(idx + 1);
end = wordBrk.next();
// NOTE: for this and similar operations, use preceding() and next()
// as shown here, not following() and previous(). preceding() is
// faster than following() and next() is faster than previous()
// NOTE: By using preceding(idx + 1) above, we're adopting the convention
// that if the double-click comes right on top of a word boundary, it
// selects the word that _begins_ on that boundary (preceding(idx) would
// instead select the word that _ends_ on that boundary).
}
```
**In C:**
```c
void wordContaining(UBreakIterator* wordBrk,
int32_t idx,
const UChar* s,
int32_t sLen,
int32_t* start,
int32_t* end,
UErrorCode* err) {
if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) {
*err = U_ILLEGAL_ARGUMENT_ERROR;
return;
}
ubrk_setText(wordBrk, s, sLen, err);
if (U_SUCCESS(*err)) {
*start = ubrk_preceding(wordBrk, idx + 1);
*end = ubrk_next(wordBrk);
}
}
```
### Check for Whole Words
Use the following to check if a range of text is a "whole word":
**In C++:**
```c++
UBool isWholeWord(BreakIterator& wordBrk,
const UnicodeString& s,
int32_t start,
int32_t end) {
if (s.isEmpty())
return FALSE;
wordBrk.setText(s);
if (!wordBrk.isBoundary(start))
return FALSE;
return wordBrk.isBoundary(end);
}
```
**In C:**
```c
UBool isWholeWord(UBreakIterator* wordBrk,
const UChar* s,
int32_t sLen,
int32_t start,
int32_t end,
UErrorCode* err) {
UBool result = FALSE;
if (wordBrk == NULL || s == NULL) {
*err = U_ILLEGAL_ARGUMENT_ERROR;
return FALSE;
}
ubrk_setText(wordBrk, s, sLen, err);
if (U_SUCCESS(*err)) {
result = ubrk_isBoundary(wordBrk, start) && ubrk_isBoundary(wordBrk, end);
}
return result;
}
```
Count the words in a document (C++ only):
```c++
int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) {
bi.setText(s);
int32_t count = 0;
while (start != BreakIterator::DONE) {
int breakType = bi.getRuleStatus();
if (breakType != UBRK_WORD_NONE) {
// Exclude spaces, punctuation, and the like.
// A status value UBRK_WORD_NONE indicates that the boundary does
// not start a word or number.
//
++count;
}
start = bi.next();
}
return count;
}
```
The function `getRuleStatus()` returns an enum giving additional information on
the text preceding the last break position found. Using this value, it is
possible to distinguish between numbers, words, words containing kana
characters, words containing ideographic characters, and non-word characters,
such as spaces or punctuation. The sample uses the break status value to filter
out, and not count, boundaries associated with non-word characters.
### Word-wrap a document (C++ only)
The sample function below wraps a paragraph so that each line is less than or
equal to 72 characters. The function fills in an array passed in by the caller
with the starting offsets of
each line in the document. Also, it fills in a second array to track how many
trailing white space characters there are in the line. For simplicity, it is
assumed that an outside process has already broken the document into paragraphs.
For example, it is assumed that every string the function is passed has a single
newline at the end only.
```c++
int32_t wrapParagraph(const UnicodeString& s,
const Locale& locale,
int32_t lineStarts[],
int32_t trailingwhitespace[],
int32_t maxLines,
UErrorCode &status) {
int32_t numLines = 0;
int32_t p, q;
const int32_t MAX_CHARS_PER_LINE = 72;
UChar c;
BreakIterator *bi = BreakIterator::createLineInstance(locale, status);
if (U_FAILURE(status)) {
delete bi;
return 0;
}
bi->setText(s);
p = 0;
while (p < s.length()) {
// jump ahead in the paragraph by the maximum number of
// characters that will fit
q = p + MAX_CHARS_PER_LINE;
// if this puts us on a white space character, a control character
// (which includes newlines), or a non-spacing mark, seek forward
// and stop on the next character that is not any of these things
// since none of these characters will be visible at the end of a
// line, we can ignore them for the purposes of figuring out how
// many characters will fit on the line)
if (q < s.length()) {
c = s[q];
while (q < s.length()
&& (u_isspace(c)
|| u_charType(c) == U_CONTROL_CHAR
|| u_charType(c) == U_NON_SPACING_MARK
)) {
++q;
c = s[q];
}
}
// then locate the last legal line-break decision at or before
// the current position ("at or before" is what causes the "+ 1")
q = bi->preceding(q + 1);
// if this causes us to wind back to where we started, then the
// line has no legal line-break positions. Break the line at
// the maximum number of characters
if (q == p) {
p += MAX_CHARS_PER_LINE;
lineStarts[numLines] = p;
trailingwhitespace[numLines] = 0;
++numLines;
}
// otherwise, we got a good line-break position. Record the start of this
// line (p) and then seek back from the end of this line (q) until you find
// a non-white space character (same criteria as above) and
// record the number of white space characters at the end of the
// line in the other results array
else {
lineStarts[numLines] = p;
int32_t nextLineStart = q;
for (q--; q > p; q--) {
c = s[q];
if (!(u_isspace(c)
|| u_charType(c) == U_CONTROL_CHAR
|| u_charType(c) == U_NON_SPACING_MARK)) {
break;
}
}
trailingwhitespace[numLines] = nextLineStart - q -1;
p = nextLineStart;
++numLines;
}
if (numLines >= maxLines) {
break;
}
}
delete bi;
return numLines;
}
```
Most text editors would not break lines based on the number of characters on a
line. Even with a monospaced font, there are still many Unicode characters that
are not displayed and therefore should be filtered out of the calculation. With
a proportional font, character widths are added up until a maximum line width is
exceeded or an end of the paragraph marker is reached.
Trailing white space does not need to be counted in the line-width measurement
because it does not need to be displayed at the end of a line. The sample code
above returns an array of trailing white space values because an external
rendering process needs to be able to measure the length of the line (without
the trailing white space) to justify the lines. For example, if the text is
right-justified, the invisible white space would be drawn outside the margin.
The line would actually end with the last visible character.
In either case, the basic principle is to jump ahead in the text to the location
where the line would break (without taking word breaks into account). Then, move
backwards using the preceding() method to find the last legal breaking position
before that location. Iterating straight through the text with next() method
will generally be slower.
## ICU BreakIterator Data Files
The source code for the ICU break rules for the standard boundary types is
located in the directory
[icu4c/source/data/brkitr/rules](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules).
These files will be built, and the corresponding binary state tables
incorporated into ICU's data, by the standard ICU4C build process.
The dictionary word lists used by word break, and for some languages, line break
are in
[icu4c/source/data/brkitr/dictionaries](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/dictionaries).
The same data is used by both ICU4C and ICU4J. In the normal ICU build process,
the source data is processed into a binary form using ICU4C, and the resulting
binary tables are incorporated into ICU4J.

View file

@ -0,0 +1,696 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Collation API Details
This section describes some of the usage conventions for the ICU Collation
Service API.
## Collator Instantiation
To use the Collation Service, you must instantiate a `Collator`. The
Collator defines the properties and behavior of the sort ordering. The Collator
can be repeatedly referenced until all collation activities have been performed.
The Collator can then be closed and removed.
### Instantiating the Predefined Collators
ICU comes with a large set of already predefined collators that are suited for
specific locales. Most of the ICU locales have a predefined collator. In the worst
case, the CLDR default set of rules,
which is mostly equivalent to the UCA default ordering (DUCET), is used.
The default sort order itself is designed to work well for many languages.
(For example, there are no tailorings for the standard sort orders for
English, German, French, etc.)
To instantiate a predefined collator, use the APIs `ucol_open`, `createInstance` and
`getInstance` for C, C++ and Java codes respectively. The C API takes a locale ID
(or language tag) string argument, C++ takes a Locale object, and Java takes a
Locale or ULocale.
For some languages, multiple collation types are available; for example,
"de-u-co-phonebk" / "de@collation=phonebook". They can be enumerated via
`Collator::getKeywordValuesForLocale()`. See also the list of available collation
tailorings in the online [ICU Collation
Demo](http://demo.icu-project.org/icu-bin/collation.html).
Starting with ICU 54, collation attributes can be specified via locale keywords
as well, in the old locale extension syntax ("el@colCaseFirst=upper") or in
language tag syntax ("el-u-kf-upper"). Keywords and values are case-insensitive.
See the [LDML Collation spec, Collation
Settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings),
and the [data
file](https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml) listing
the valid collation keywords and their values. (The deprecated attributes
kh/colHiraganaQuaternary and vt/variableTop are not supported.)
For the [old locale extension
syntax](http://www.unicode.org/reports/tr35/tr35.html#Old_Locale_Extension_Syntax),
the data file's alias names are used (first alias, if defined, otherwise the
name): "de@collation=phonebook;colCaseLevel=yes;kv=space"
For the language tag syntax, the non-alias names are used, and "true" values can
be omitted: "de-u-co-phonebk-kc-kv-space"
This example demonstrates the instantiation of a collator.
**C:**
```C
UErrorCode status = U_ZERO_ERROR;
UCollator *coll = ucol_open("en_US", &status);
if(U_SUCCESS(status)) {
/* close the collator*/
ucol_close(coll);
}
```
**C++:**
```C++
UErrorCode status = U_ZERO_ERROR;
Collator *coll = Collator::createInstance(Locale("en", "US"), status);
if(U_SUCCESS(status)) {
//close the collator
delete coll;
}
```
**Java:**
```Java
Collator col = null;
try {
col = Collator.getInstance(Locale.US);
} catch (Exception e) {
System.err.println("English collation creation failed.");
e.printStackTrace();
}
```
### Instantiating Collators Using Custom Rules
If the ICU predefined collators are not appropriate for your intended usage, you
can
define your own set of rules and instantiate a collator that uses them. For more
details, please see [the section on collation
customization](customization/index.md).
This example demonstrates the instantiation of a collator.
**C:**
```C
UErrorCode status = U_ZERO_ERROR;
U_STRING_DECL(rules, "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E", 52);
UCollator *coll;
U_STRING_INIT(rules, "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E", 52);
coll = ucol_openRules(rules, -1, UCOL_ON, UCOL_DEFAULT_STRENGTH, NULL, &status);
if(U_SUCCESS(status)) {
/* close the collator*/
ucol_close(coll);
}
```
**C++:**
```C++
UErrorCode status = U_ZERO_ERROR;
UnicodeString rules(u"&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E");
Collator *coll = new RuleBasedCollator(rules, status);
if(U_SUCCESS(status)) {
//close the collator
delete coll;
}
```
**Java:**
```Java
RuleBasedCollator coll = null;
String ruleset = "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E";
try {
coll = new RuleBasedCollator(ruleset);
} catch (Exception e) {
System.err.println("Customized collation creation failed.");
e.printStackTrace();
}
```
## Compare
Two of the most used functions in ICU collation API, `ucol_strcoll` and `ucol_getSortKey`, have their counterparts in both Win32 and ANSI APIs:
ICU C | ICU C++ | ICU Java | ANSI/POSIX | WIN32
----------------- | --------------------------- | -------------------------- | ---------- | -----
`ucol_strcoll` | `Collator::compare` | `Collator.compare` | `strcoll` | `CompareString`
`ucol_getSortKey` | `Collator::getSortKey` | `Collator.getCollationKey` | `strxfrm` | `LCMapString`
&nbsp; | `Collator::getCollationKey` | &nbsp; | &nbsp; |
For more sophisticated usage, such as user-controlled language-sensitive text
searching, an iterating interface to collation is provided. Please refer to the
section below on `CollationElementIterator` for more details.
The `ucol_compare` function compares one pair of strings at a time. Comparing two
strings is much faster than calculating sort keys for both of them. However, if
comparisons should be done repeatedly on a very large number of strings, generating
and storing sort keys can improve performance. In all other cases (such as quick
sort or bubble sort of a
moderately-sized list of strings), comparing strings works very well.
The C API used for comparing two strings is `ucol_strcoll`. It requires two
`UChar *` strings and their lengths as parameters, as well as a pointer to a valid
`UCollator` instance. The result is a `UCollationResult` constant, which can be one
of `UCOL_LESS`, `UCOL_EQUAL` or `UCOL_GREATER`.
The C++ API offers the method `Collator::compare` with several overloads.
Acceptable input arguments are `UChar *` with length of strings, or `UnicodeString`
instances. The result is a member of the `UCollationResult` or `EComparisonResult` enums.
The Java API provides the method `Collator.compare` with one overload. Acceptable
input arguments are Strings or Objects. The result is an int value, which is
less than zero if source is less than target, zero if source and target are
equal, or greater than zero if source is greater than target.
There are also several convenience functions and methods returning a boolean
value, such as `ucol_greater`, `ucol_greaterOrEqual`, `ucol_equal` (in C)
`Collator::greater`, `Collator::greaterOrEqual`, `Collator::equal` (in C++) and
`Collator.equals` (in Java).
### Examples
**C:**
```C
UChar *s [] = { /* list of Unicode strings */ };
uint32_t listSize = sizeof(s)/sizeof(s[0]);
UErrorCode status = U_ZERO_ERROR;
UCollator *coll = ucol_open("en_US", &status);
uint32_t i, j;
if(U_SUCCESS(status)) {
for(i=listSize-1; i>=1; i--) {
for(j=0; j<i; j++) {
if(ucol_strcoll(s[j], -1, s[j+1], -1) == UCOL_LESS) {
swap(s[j], s[j+1]);
}
}
}
ucol_close(coll);
}
```
**C++:**
```C++
UnicodeString s [] = { /* list of Unicode strings */ };
uint32_t listSize = sizeof(s)/sizeof(s[0]);
UErrorCode status = U_ZERO_ERROR;
Collator *coll = Collator::createInstance(Locale("en", "US"), status);
uint32_t i, j;
if(U_SUCCESS(status)) {
for(i=listSize-1; i>=1; i--) {
for(j=0; j<i; j++) {
if(coll->compare(s[j], s[j+1]) == UCOL_LESS) {
swap(s[j], s[j+1]);
}
}
}
delete coll;
}
```
**Java:**
```Java
String s [] = { /* list of Unicode strings */ };
try {
Collator coll = Collator.getInstance(Locale.US);
for (int i = s.length - 1; i > = 1; i --) {
for (j=0; j<i; j++) {
if (coll.compare(s[j], s[j+1]) == -1) {
swap(s[j], s[j+1]);
}
}
}
} catch (Exception e) {
System.err.println("English collation creation failed.");
e.printStackTrace();
}
```
## GetSortKey
The C API provides the `ucol_getSortKey` function, which requires (apart from a
pointer to a valid `UCollator` instance), an original `UChar` pointer, together with
its length. It also requires a pointer to a receiving buffer and its length.
The C++ API provides the `Collator::getSortKey` method with similar parameters as
the C version. It also provides `Collator::getCollationKey`, which produces a
`CollationKey` object instance (a wrapper around a sort key).
The Java API provides only the `Collator.getCollationKey` method, which produces a
`CollationKey` object instance (a wrapper around a sort key).
Sort keys are generally only useful in databases or other circumstances where
function calls are extremely expensive. See [Sortkeys vs
Comparison](concepts.md#sortkeys-vs-comparison).
### Sort Key Features
ICU writes sort keys as sequences of bytes.
Each sort key ends with one 00 byte and does not contain any other 00 byte. The
terminating 00 byte is included in the length of the sort key as returned by the
API (unlike any other ICU API where terminating NUL bytes or characters are not
counted as part of the length).
Sort key byte sequences must be compared with an unsigned-byte comparison, as
with `strcmp()`.
Comparing the sort keys of two strings from the same collator yields the same
ordering as using the collator to compare the two strings directly. That is:
`strcmp(coll.getSortKey(str1), coll.getSortKey(str2))` is equivalent to
`coll.compare(str1, str2)`.
Sort keys from different collators (different locale or strength or any other
attributes/settings) are not comparable.
Sort keys can be "merged" as described in [UTS #10 Merging Sort
Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys), via
`ucol_mergeSortkeys()` or Java `CollationKey.merge()`.
* Since CLDR 1.9/ICU 4.6, the same effect can be achieved by concatenating
strings with U+FFFE between them. The concatenation has the same sort order
as the merged sort keys.
* However, it is not guaranteed that the sort key of the concatenated strings
is the same as the merged result of the individual sort keys. (That is,
merge(getSortKey(str1), getSortKey(str2)) may differ from getSortKey(str1 +
'\\uFFFE' + str2).)
* In particular, a future version of ICU is likely to generate shorter sort
keys when concatenating strings with U+FFFE between them (by using
compression across the U+FFFE weights).
* *The recommended way to achieve "merged" sorting is via strings with
U+FFFE.*
Any further analysis or parsing of sort keys is not supported.
Sort keys will change from one ICU version to another; therefore, if sort keys
are stored in a database or other persistent storage, then each upgrade requires
their regeneration.
* The details of the underlying data change with every Unicode and CLDR
version.
* Sort keys are also subject to enhancements and bug fixes in the builder and
implementation code.
* On the other hand, the sort *order* is much more stable. It is subject to
deliberate changes to the default Unicode collation order, which is kept
quite stable, and subject to deliberate changes in CLDR data as new data is
added and feedback on existing data is taken into account.
Implementation notes: (Not supported as permanent constraints on sort keys)
Byte 02 was unique as a merge separator for some versions of ICU before version
ICU 53. Since ICU 53, 02 is also used in regular collation weights where there
is no conflict (to expand the number of available short weights).
Byte 01 has been unique as a level separator. This is not strictly necessary for
non-primary levels. (A level's compressible "common" weight as its level
separator would yield shorter sort keys.) However, the current implementation of
`ucol_mergeSortkeys()` relies on it. (Also, test code currently examines sort keys
for finding the strength of a comparison difference.) This may change in the
future, especially if `ucol_mergeSortkeys()` were to become deprecated.
Level separators are likely to be equivalent to single-byte weights (possibly
compressible): Multi-byte level separators would noticeably lengthen sort keys
for short strings.
The byte values used in several ICU versions for sort keys and collation
elements are documented in the [“Special Byte Values” design
doc](http://site.icu-project.org/design/collation/bytes) on the ICU site.
### Sort Key Output Buffer
`ucol_getSortKey()` can operate in 'preflighting' mode, which returns the amount
of memory needed to store the resulting sort key. This mode is automatically
activated if the output buffer size passed is set to zero. Should the sort key
become longer than the buffer provided, function again slips into preflighting
mode. The overall performance is poorer than if the function is called with a
zero output buffer . If the size of the sort key returned is greater than the
size of the buffer provided, the content of the result buffer is undefined. In
that case, the result buffer could be reallocated to its proper size and the
sort key generator function can be used again.
The best way to generate a series of sort keys is to do the following:
1. Create a big temporary buffer on the stack. Typically, this buffer is
allocated only once, and reused with every sort key generated. There is no
need to keep it as small as possible. A recommended size for the temporary
buffer is four times the length of the longest string processed.
2. Start the loop. Call `ucol_getSortKey()` to find out how big the sort key
buffer should be, and fill in the temporary buffer at the same time.
3. If the temporary buffer is too small, allocate or reallocate more space.
Fill in the sort key values in the overflow buffer.
4. Allocate the sort key buffer with the size returned by `ucol_getSortKey()` and
call memcpy to copy the sort key content from the temp buffer to the sort
key buffer.
5. Loop back to step 1 until you are done.
6. Delete the overflow buffer if you created one.
### Example
```C
void GetSortKeys(const Ucollator* coll, const UChar*
const *source, uint32_t arrayLength)
{
char[1000] buffer; // allocate stack buffer
char* currBuffer = buffer;
int32_t bufferLen = sizeof(buffer);
int32_t expectedLen = 0;
UErrorCode err = U_ZERO_ERROR;
for (int i = 0; i < arrayLength; ++i) {
expectedLen = ucol_getSortKey(coll, source[i], -1, currBuffer, bufferLen);
if (expectedLen > bufferLen) {
if (currBuffer == buffer) {
currBuffer = (char*)malloc(expectedLen);
} else {
currBuffer = (char*)realloc(currBuffer, expectedLen);
}
}
bufferLen = ucol_getSortKey(coll, source[i], -1, currBuffer, expectedLen);
}
processSortKey(i, currBuffer, bufferLen);
if (currBuffer != buffer && currBuffer != NULL) {
free(currBuffer);
}
}
```
> :point_right: **Note** Although the API allows you to call
> `ucol_getSortKey` with `NULL` to see what the
> sort key length is, it is strongly recommended that you NOT determine the length
> first, then allocate and fill the sort key buffer. If you do, it requires twice
> the processing since computing the length has to do the same calculation as
> actually getting the sort key. Instead, the example shown above uses a stack buffer.
### Using Iterators for String Comparison
ICU4C's `ucol_strcollIter` API allows for comparing two strings that are supplied
as character iterators (`UCharIterator`). This is useful when you need to compare
differently encoded strings using `strcoll`. In that case, converting the strings
first would probably be wasteful, since `strcoll` usually gives the result
before whole strings are processed. This API is implemented only as a C function
in ICU4C. There are no equivalent C++ or ICU4J functions.
```C
...
/* we are arriving with two char*: utf8Source and utf8Target, with their
* lengths in utf8SourceLen and utf8TargetLen
*/
UCharIterator sIter, tIter;
uiter_setUTF8(&sIter, utf8Source, utf8SourceLen);
uiter_setUTF8(&tIter, utf8Target, utf8TargetLen);
compareResultUTF8 = ucol_strcollIter(myCollation, &sIter, &tIter, &status);
...
```
### Obtaining Partial Sort Keys
When using different sort algorithms, such as radix sort, sometimes it is useful
to process strings only as much as needed to feed into the sorting algorithm.
For that purpose, ICU provides the `ucol_nextSortKeyPart` API, which also takes
character iterators. This API allows for iterating over subsequent pieces of an
uncompressed sort key. Between calls to the API you need to save a 64-bit state.
Following is an example of simulating a string compare function using the partial
sort key API. Your usage model is bound to look much different.
```C
static UCollationResult compareUsingPartials(UCollator *coll,
const UChar source[], int32_t sLen,
const UChar target[], int32_t tLen,
int32_t pieceSize, UErrorCode *status) {
int32_t partialSKResult = 0;
UCharIterator sIter, tIter;
uint32_t sState[2], tState[2];
int32_t sSize = pieceSize, tSize = pieceSize;
int32_t i = 0;
uint8_t sBuf[16384], tBuf[16384];
if(pieceSize > 16384) {
*status = U_BUFFER_OVERFLOW_ERROR;
return UCOL_EQUAL;
}
*status = U_ZERO_ERROR;
sState[0] = 0; sState[1] = 0;
tState[0] = 0; tState[1] = 0;
while(sSize == pieceSize && tSize == pieceSize && partialSKResult == 0) {
uiter_setString(&sIter, source, sLen);
uiter_setString(&tIter, target, tLen);
sSize = ucol_nextSortKeyPart(coll, &sIter, sState, sBuf, pieceSize, status);
tSize = ucol_nextSortKeyPart(coll, &tIter, tState, tBuf, pieceSize, status);
partialSKResult = memcmp(sBuf, tBuf, pieceSize);
}
if(partialSKResult < 0) {
return UCOL_LESS;
} else if(partialSKResult > 0) {
return UCOL_GREATER;
} else {
return UCOL_EQUAL;
}
}
```
### Other Examples
A longer example is presented in the 'Examples' section. Here is an illustration
of the usage model.
**C:**
```C
#define MAX_KEY_SIZE 100
#define MAX_BUFFER_SIZE 10000
#define MAX_LIST_LENGTH 5
const char text[] = {
"Quick",
"fox",
"Moving",
"trucks",
"riddle"
};
const UChar s [5][20];
int i;
int32_t length, expectedLen;
uint8_t temp[MAX_BUFFER _SIZE];
uint8_t *temp2 = NULL;
uint8_t keys [MAX_LIST_LENGTH][MAX_KEY_SIZE];
UErrorCode status = U_ZERO_ERROR;
temp2 = temp;
length = MAX_BUFFER_SIZE;
for( i = 0; i < 5; i++)
{
u_uastrcpy(s[i], text[i]);
}
UCollator *coll = ucol_open("en_US",&status);
uint32_t length;
if(U_SUCCESS(status)) {
for(i=0; i<MAX_LIST_LENGTH; i++) {
expectedLen = ucol_getSortKey(coll, s[i], -1,temp2,length );
if (expectedLen > length) {
if (temp2 == temp) {
temp2 =(char*)malloc(expectedLen);
} else
temp2 =(char*)realloc(temp2, expectedLen);
}
length =ucol_getSortKey(coll, s[i], -1, temp2, expectedLen);
}
memcpy(key[i], temp2, length);
}
}
qsort(keys, MAX_LIST_LENGTH,MAX_KEY_SIZE*sizeof(uint8_t), strcmp);
for (i = 0; i < MAX_LIST_LENGTH; i++) {
free(key[i]);
}
ucol_close(coll);
```
**C++:**
```C++
#define MAX_LIST_LENGTH 5
const UnicodeString s [] = {
"Quick",
"fox",
"Moving",
"trucks",
"riddle"
};
CollationKey *keys[MAX_LIST_LENGTH];
UErrorCode status = U_ZERO_ERROR;
Collator *coll = Collator::createInstance(Locale("en_US"), status);
uint32_t i;
if(U_SUCCESS(status)) {
for(i=0; i<listSize; i++) {
keys[i] = coll->getCollationKey(s[i], -1);
}
qsort(keys, MAX_LIST_LENGTH, sizeof(CollationKey),compareKeys);
delete[] keys;
delete coll;
}
```
**Java:**
```Java
String s [] = {
"Quick",
"fox",
"Moving",
"trucks",
"riddle"
};
CollationKey keys[] = new CollationKey[s.length];
try {
Collator coll = Collator.getInstance(Locale.US);
for (int i = 0; i < s.length; i ++) {
keys[i] = coll.getCollationKey(s[i]);
}
Arrays.sort(keys);
}
catch (Exception e) {
System.err.println("Error creating English collator");
e.printStackTrace();
}
```
## Collation ElementIterator
A collation element iterator can only be used in one direction. This is
established at the time of the first call to retrieve a collation element. Once
`ucol_next` (C), `CollationElementIterator::next` (C++) or
`CollationElementIterator.next` (Java) are invoked,
`ucol_previous` (C),
`CollationElementIterator::previous` (C++) or `CollationElementIterator.previous`
(Java) should not be used (and vice versa). The direction can be changed
immediately after `ucol_first`, `ucol_last`, `ucol_reset` (in C),
`CollationElementIterator::first`, `CollationElementIterator::last`,
`CollationElementIterator::reset` (in C++) or `CollationElementIterator.first`,
`CollationElementIterator.last`, `CollationElementIterator.reset` (in Java) is
called, or when it reaches the end of string while traversing the string.
When `ucol_next` is called at the end of the string buffer, `UCOL_NULLORDER` is
always returned with any subsequent calls to `ucol_next`. The same applies to
`ucol_previous`.
An example of how iterators are used is the Boyer-Moore search implementation,
which can be found in the samples section.
### API Example
**C:**
```C
UCollator *coll = ucol_open("en_US",status);
UErrorCode status = U_ZERO_ERROR;
UChar text[20];
UCollationElements *collelemitr;
uint32_t collelem;
u_uastrcpy(text, "text");
collelemitr = ucol_openElements(coll, text, -1, &status);
collelem = 0;
do {
collelem = ucol_next(collelemitr, &status);
} while (collelem != UCOL_NULLORDER);
ucol_closeElements(collelemitr);
ucol_close(coll);
```
**C++:**
```C++
UErrorCode status = U_ZERO_ERROR;
Collator *coll = Collator::createInstance(Locale::getUS(), status);
UnicodeString text("text");
CollationElementIterator *collelemitr = coll->createCollationElementIterator(text);
uint32_t collelem = 0;
do {
collelem = collelemitr->next(status);
} while (collelem != CollationElementIterator::NULLORDER);
delete collelemitr;
delete coll;
```
**Java:**
```Java
try {
RuleBasedCollator coll = (RuleBasedCollator)Collator.getInstance(Locale.US);
String text = "text";
CollationElementIterator collelemitr = coll.getCollationElementIterator(text);
int collelem = 0;
do {
collelem = collelemitr.next();
} while (collelem != CollationElementIterator.NULLORDER);
} catch (Exception e) {
System.err.println("Error in collation iteration");
e.printStackTrace();
}
```
## Setting and Getting Attributes
The general attribute setting APIs are `ucol_setAttribute` (in C) and
`Collator::setAttribute` (in C++). These APIs take an attribute name and an
attribute value. If the name and the value pass a syntax and range check, the
property of the collator is changed. If the name and value do not pass a syntax
and range check, however, the state is not changed and the error code variable
is set to an error condition. The Java version does not provide general
attribute setting APIs; instead, each attribute has its own setter API of
the form `RuleBasedCollator.setATTRIBUTE_NAME(arguments)`.
The attribute getting APIs are `ucol_getAttribute` (C) and `Collator::getAttribute`
(C++). Both APIs require an attribute name as an argument and return an
attribute value if a valid attribute name was supplied. If a valid attribute
name was not supplied, however, they return an undefined result and set the
error code. Similarly to the setter APIs for the Java version, no generic getter
API is provided. Each attribute has its own setter API of the form
`RuleBasedCollator.getATTRIBUTE_NAME()` in the Java version.
## References:
1. Ken Whistler, Markus Scherer: "Unicode Technical Standard #10, Unicode Collation
Algorithm" (<http://www.unicode.org/unicode/reports/tr10/>)
2. ICU Design doc: "Collation v2" (<http://site.icu-project.org/design/collation/v2>)
3. Mark Davis: "ICU Collation Design Document"
(<https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/collation/ICU_collation_design.htm>)
3. The Unicode Standard, chapter 5, "Implementation guidelines"
(<http://www.unicode.org/unicode/uni2book/ch05.pdf>)
4. Laura Werner: "Efficient text searching in Java: Finding the right string in
any language"
(<http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>)
5. Mark Davis, Martin Dürst: "Unicode Standard Annex #15: Unicode Normalization
Forms" (<http://www.unicode.org/unicode/reports/tr15/>).

View file

@ -0,0 +1,562 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Collation Service Architecture
This section describes the design principles, architecture and coding
conventions of the ICU Collation Service.
## Collator
To use the Collation Service, a Collator must first be instantiated. An
Collator is a data structure or object that maintains all of the property
and state information necessary to define and support the specific collation
behavior provided. Examples of properties described in the Collator are the
locale, whether normalization is to be performed, and how many levels of
collation are to be evaluated. Examples of the state information described in
the Collator include the direction of a Collation Element Iterator (forward
or backward) and the status of the last API executed.
The Collator is instantiated either by referencing a locale or by defining a
custom set of rules (a tailoring).
The Collation Service uses the paradigm:
1. Open a Collator,
2. Use while necessary,
3. Close the Collator.
Collator instances cannot be shared among threads. You should open them
instead, and use a different collator for each separate thread. The safe clone
function is supported for cloning collators in a thread-safe fashion.
The Collation Service follows the ICU conventions for locale designation
when opening collators:
1. NULL means the default locale.
2. The empty locale name ("") means the root locale.
The Collation Service adheres to the ICU conventions described in the
"[ICU Architectural Design](../design.md) " section of the users guide.
In particular:
3. The standard error code convention is usually followed. (Functions that do
not take an error code parameter do so for backward compatibility.)
4. The string length convention is followed: when passing a `UChar *`, the
length is required in a separate argument. If -1 is passed for the length,
it is assumed that the string is zero terminated.
### Collation locale and keyword handling
When a collator is created from a locale, the collation service (like all ICU
services) must map the requested locale to the localized collation data
available to ICU at the time. It does so using the standard ICU locale fallback
mechanism. See the fallback section of the [locale
chapter](../locale/index.md) for more details.
If you pass a regular locale in, like "en_US", the collation service first
searches with fallback for "collations/default" key. The first such key it finds
will have an associated string value; this is the keyword name for the collation
that is default for this locale. If the search falls all the way back to the
root locale, the collation service will us the "collations/default" key there,
which has the value "standard".
If there is a locale with a keyword, like "de-u-co-phonebk" or "de@collation=phonebook", the
collation service searches with fallback for "collations/phonebook". If the
search is successful, the collation service uses the string value it finds to
instantiate a Collator. If the search fails because no such key is present in
any of ICU's locale data (e.g., "de@collation=funky"), the service returns a
collator implementing the default tailoring of the locale.
If the fallback is all the way to the root locale, then
the return `UErrorCode` is `U_USING_DEFAULT_WARNING`.
## Input values for collation
Collation deals with processing strings. ICU generally requires that all the
strings should be in UTF-16 format, and that all the required conversion should
done before ICU functions are used. In the case of collation, there are APIs
that can also take instances of character iterators (`UCharIterator`)
or UTF-8 directly.
Theoretically, character iterators can iterate strings
in any encoding. ICU currently provides character iterator implementations for
UTF-8 and UTF-16BE (useful when processing data from a big endian platform on an
little endian machine). It should be noted, however, that using iterators for
collation APIs has a performance impact. It should be used in situations when it
is not desirable to convert whole strings before the operation - such as when
using a string compare function.
## Collation Elements
As discussed in the introduction, there are many possible orderings for sorted
text, depending on language and other factors. Ideally, there is a way to
describe each ordering as a set of rules for calculating numeric values for each
string of text. The collation process then becomes one of simply comparing these
numeric values.
This essentially describes the way the Collation Service works. To implement
a particular sort ordering, first the relationship between each character or
character sequence is derived. For example, a Spanish ordering defines the
letter sequence "CH" to be between the letters "C" and "D". As also discussed in
the introduction, to order strings properly requires that comparison of base
letters must be considered separately from comparison of accents. Letter case
must also be considered separately from either base letters or accents. Any
ordering specification language must provide a way to define the relationships
between characters or character sequences on multiple levels. ICU supports this
by using "<" to describe a relationship at the primary level, using "<<" to
describe a relationship at the secondary level, and using "<<<" to describe a
relationship at the tertiary level. Here are some example usages:
Symbol | Example | Description
------ | -------- | -----------
`<` | `c < ch` | Make a primary (base letter) difference between "c" and the character sequence "ch"
`<<` | `a << ä` | Make a secondary (accent) difference between "a" and "ä"
`<<<` | `a<<<A` | Make a tertiary difference between "a" and "A"
A more complete description of the ordering specification symbols and their
meanings is provided in the section on Collation Tailoring.
Once a sort ordering is defined by specifying the desired relationships between
characters and character sequences, ICU can convert these relationships to a
series of numerical values (one for each level) that satisfy these same
relationships.
This series of numeric values, representing the relative weighting of a
character or character sequence, is called a Collation Element (CE).
One possible encoding of a Collation Element is a 32-bit value consisting of
a 16-bit primary weight, a 8-bit secondary weight,
2 case bits, and a 6-bit tertiary weight.
The sort weight of a string is represented by the collation elements of its
component characters and character sequences. For example, the sort weight of
the string "apple" would consist of its component Collation Elements, as shown
here:
"Apple" | "Apple" Collation Elements
------- | --------------------------
a | `[1900.05.05]`
p | `[3700.05.05]`
p | `[3700.05.05]`
l | `[2F00.05.05]`
e | `[2100.05.05]`
In this example, the letter "a" has a 16-bit primary weight of 1900 (hex), an
8-bit secondary weight of 05 (hex), and a combined 8-bit case-tertiary weight of
05 (hex).
String comparison is performed by comparing the collation elements of each
string. Each of the primary weights are compared. If a difference is found, that
difference determines the relationship between the two strings. If no
differences are found, the secondary weights are compared and so forth.
With ICU it is possible to specify how many levels should be compared. For some
applications, it can be desirable to compare only primary levels or to compare
only primary and secondary levels.
## Sort Keys
If a string is to be compared thousands or millions of times,
it can be more efficient to use sort keys.
Sort keys are useful in situations where a large amount of data is indexed
and frequently searched. The sort key is generated once and used in subsequent
comparisons, rather than repeatedly generating the string's Collation Elements.
The comparison of sort keys is a very efficient and simple binary compare of strings of
unsigned bytes.
An important property of ICU sort keys is that you can obtain the same results
by comparing 2 strings as you do by comparing the sort keys of the 2 strings
(provided that the same ordering and related collation attributes are used).
An ICU sort key is a pre-processed sequence of bytes generated from a Unicode
string. The weights for each comparison level are concatenated, separated by a
"0x01" byte between levels.
The entire sequence is terminated with a 0x00 byte for convenience in C APIs.
(This 0x00 terminator is counted in the sort key length —
unlike regular strings where the NUL terminator is excluded from the string length.)
ICU actually compresses the sort keys so that they take the
minimum storage in memory and in databases.
<!-- TODO: (diagram was missing in Google Sites already)
The diagram below represents an uncompressed sort key in ICU for ease of understanding. -->
### Sort key size
One of the more important issues when considering using sort keys is the sort
key size. Unfortunately, it is very hard to give a fast exact answer to the
following question: "What is the maximum size for sort keys generated for
strings of size X". This problem is twofold:
1. The maximum size of the sort key depends on the size of the collation
elements that are used to build it. Size of collation elements vary greatly
and depends both on the alphabet in question and on the locale used.
2. Compression is used in building sort keys. Most 'regular' sequences of
characters produce very compact sort keys.
If one is to assume the worst case and use too-big buffers, a lot of space will
be wasted. However, if you use too-small buffers, you will lose performance if
generated sort keys are longer than supplied buffers too often
(and you have to reallocate for each of those).
A good strategy
for this problem would be to manually manage a large buffer for storing sortkeys
and keep a list of indices to sort keys in this buffer (see the "large buffers"
[Collation Example](examples.md#using-large-buffers-to-manage-sort-keys)
for more details).
Here are some rules of a thumb, please do not rely on them. If you are looking
at the East Asian locales, you probably want to go with 5 bytes per code point.
For Thai, 3 bytes per code point should be sufficient. For all the other locales
(mostly Latin and Cyrillic), you should be fine with 2 bytes per code point.
These values are based on average lengths of sort keys generated with tertiary
strength. If you need quaternary and identical strength (you should not), add 3
bytes per code point to each of these.
### Partial sort keys
In some cases, most notably when implementing [radix
sorting](http://en.wikipedia.org/wiki/Radix_sort), it is useful to produce only
parts of sort keys at a time. ICU4C 2.6+ provides an API that allows producing
parts of sort keys (`ucol_nextSortKeyPart` API). These sort keys may or may not be
compressed; that is, they may or may not be compatible with regular sort keys.
### Merging sort keys
Sometimes, it is useful to be able to merge sort keys. One example is having
separate sort keys for first and last names. If you need to perform an operation
that requires a sort key generated on the whole name, instead of concatenating
strings and regenerating sort keys, you should merge the sort keys. The merging
is done by merging the corresponding levels while inserting a terminator between
merged parts. The reserved sort key byte value for the merge terminator is 0x02.
For more details see [UCA section 1.6, Merging Sort
Keys](http://www.unicode.org/reports/tr10/#Interleaved_Levels).
* C API: unicode/ucol.h `ucol_mergeSortkeys()`
* Java API: `com.ibm.icu.text.CollationKey merge(CollationKey source)`
CLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is
intended to allow concatenating strings like firstName+\\uFFFE+lastName to yield
the same results as merging their individual sort keys.
This has been fully implemented in ICU since version 53.
### Generating bounds for a sort key (prefix matching)
Having sort keys for strings allows for easy creation of bounds - sort keys that
are guaranteed to be smaller or larger than any sort key from a give range. For
example, if bounds are produced for a sortkey of string "smith", strings between
upper and lower bounds with one level would include "Smith", "SMITH", "sMiTh".
Two kinds of upper bounds can be generated - the first one will match only
strings of equal length, while the second one will match all the strings with
the same initial prefix.
CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum
primary weight, so that for example the string "smith\\uFFFF" can be used as the
upper bound rather than modifying the sort key for "smith".
## Collation Element Iterator
The collation element iterator is used for traversing Unicode string collation
elements one at a time. It can be used to implement language-sensitive text
search algorithms like Boyer-Moore.
For most applications, the two API categories, compare and sort key, are
sufficient. Most people do not need to manipulate collation elements directly.
Example:
Consider iterating over "apple" and "äpple". Here are sequences of collation
elements:
String 1 | String 1 Collation Elements
-------- | ---------------------------
a | `[1900.05.05]`
p | `[3700.05.05]`
p | `[3700.05.05]`
l | `[2F00.05.05]`
e | `[2100.05.05]`
String 2 | String 2 Collation Elements
-------- | ---------------------------
a | `[1900.05.05]`
\\u0308 | `[0000.9D.05]`
p | `[3700.05.05]`
p | `[3700.05.05]`
l | `[2F00.05.05]`
e | `[2100.05.05]`
The resulting CEs are typically masked according to the desired strength, and
zero CEs are discarded. In the above example, masking with 0xFFFF0000 (for primary strength)
produces the results of NULL secondary and tertiary differences. The collator then
ignores the NULL differences and declares a match. For more details see the
paper "Efficient text searching in Java™: Finding the right string in any
language" by Laura Werner (
<http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>).
## Collation Attributes
The Collation Service has a number of attributes whose values can be changed
during run time. These attributes affect both the functionality and the
performance of the Collation Service. This section describes these
attributes and, where possible, their performance impact. Performance
indications are only approximate and timings may vary significantly depending on
the CPU, compiler, etc.
Although string comparison by ICU and comparison of each string's sort key give
the same results, attribute settings can impact the execution time of each
method differently. To be precise in the discussion of performance, this section
refers to the API employed in the measurement. The `ucol_strcoll` function is the
API for string comparison. The `ucol_getSortKey` function is used to create sort
keys.
> :point_right: **Note** There is a special attribute value, `UCOL_DEFAULT`,
> that can be used to set any attribute to its default value
> (which is inherited from the UCA and the tailoring).
### Attribute Types
#### Strength level
Collation strength, or the maximum collation level used for comparison, is set
by using the `UCOL_STRENGTH` attribute. Valid values are:
1. `UCOL_PRIMARY`
2. `UCOL_SECONDARY`
3. `UCOL_TERTIARY` (default)
4. `UCOL_QUATERNARY`
5. `UCOL_IDENTICAL`
#### French collation
The `UCOL_FRENCH_COLLATION` attribute determines whether to sort the secondary
differences in reverse order. Valid values are:
1. `UCOL_OFF` (default): compares secondary differences in the order they appear
in the string.
2. `UCOL_ON`: causes secondary differences to be considered in reverse order, as
it is done in the French language.
#### Normalization mode
The `UCOL_NORMALIZATION_MODE` attribute, or its alias `UCOL_DECOMPOSITION_MODE`,
controls whether text normalization is performed on the input strings. Valid
values are:
1. `UCOL_OFF` (default): turns off normalization check
2. `UCOL_ON` : normalization is checked and the collator performs normalization
if it is needed.
X | FCD | NFC | NFD
--------------------- | --- | --- | ---
A-ring | Y | Y |
Angstrom | Y | |
A + ring | Y | | Y
A + grave | Y | Y |
A-ring + grave | Y | |
A + cedilla + ring | Y | | Y
A + ring + cedilla | | |
A-ring + cedilla | | Y |
With normalization mode turned on, the `ucol_strcoll` function slows down by 10%.
In addition, the time to generate a sort key also increases by about 25%.
#### Alternate handling
This attribute allows shifting of the variable characters (usually spaces and
punctuation, in the UCA also most symbols) from the primary to the quaternary
strength level. This is set by using the `UCOL_ALTERNATE_HANDLING` attribute. For
details see [UCA: Variable
Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting), [LDML:
Collation
Settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings),
and [“Ignore Punctuation” Options](customization/ignorepunct.md).
1. `UCOL_NON_IGNORABLE` (CLDR/ICU default): variable characters are treated as
all the other characters
2. `UCOL_SHIFTED` (UCA default): all the variable characters will be ignored at
the primary, secondary and tertiary levels and their primary strengths will
be shifted to the quaternary level.
#### Case Ordering
Some conventions require uppercase letters to sort before lowercase ones, while
others require the opposite. This attribute is controlled by the value of the
`UCOL_CASE_FIRST`. The case difference in the UCA is contained in the tertiary
weights along with other appearance characteristics (like circling of letters).
The case-first attribute allows for emphasizing of the case property of the
letters by reordering the tertiary weights with either upper-first, and/or
lowercase-first. This difference gets the most significant bit in the weight.
Valid values for this attribute are:
1. `UCOL_OFF` (default): leave tertiary weights unaffected
2. `UCOL_LOWER_FIRST`: causes lowercase letters and uncased characters to sort
before uppercase
3. `UCOL_UPPER_FIRST` : causes uppercase letters to sort first
The case-first attribute does not affect the performance substantially.
#### Case level
When this attribute is set, an additional level is formed between the secondary
and tertiary levels, known as the Case Level. The case level is used to
distinguish large and small Japanese Kana characters. Case level could also be
used in other situations. for example to distinguish certain Pinyin characters.
Case level is controlled by `UCOL_CASE_LEVEL` attribute. Valid values for this
attribute are
1. `UCOL_OFF` (default): no additional case level
2. `UCOL_ON` : adds a case level
#### Hiragana Quaternary
*This setting is deprecated and ignored in recent versions of ICU.*
Hiragana Quaternary can be set to `UCOL_ON`, in which case Hiragana code points
will sort before everything else on the quaternary level. If set to `UCOL_OFF`
Hiragana letters are treated the same as all the other code points. This setting
can be changed on run-time using the `UCOL_HIRAGANA_QUATERNARY_MODE` attribute.
You probably won't need to use it.
#### Variable Top
Variable Top is a boundary which decides whether the code points will be treated
as variable (shifted to quaternary level in the **shifted** mode) or
non-ignorable. Special APIs are used for setting of variable top. It can
basically be set either to a codepoint or a primary strength value.
## Performance
ICU collation is designed to be fast, small and customizable. Several techniques
are used to enhance the performance:
1. Providing optimized processing for Latin characters.
2. Comparing strings incrementally and stopping at the first significant
difference.
3. Tuning to eliminate unnecessary file access or memory allocation.
4. Providing efficient preflight functions that allows fast sort key size
generation.
5. Using a single, shared copy of UCA in memory for the read-only default sort
order. Only small tailoring tables are kept in memory for locale-specific
customization.
6. Compressing sort keys efficiently.
7. Making the sort order be data-driven.
In general, the best performance from the Collation Service is expected by
doing the following:
1. After opening a collator, keep and reuse it until done. Do not open new
collators for the same sort order. (Note the restriction on
multi-threading.)
2. Use `ucol_strcoll` etc. when comparing strings. If it is necessary to
compare strings thousands or millions of times,
create the sort keys first and compare the sort keys instead.
Generating the sort keys of two strings is about 5-10
times slower than just comparing them directly.
3. Follow the best practice guidelines for generating sort keys. Do not call
`ucol_getSortKey` twice to first size the key and then allocate the sort key
buffer and repeat the call to the function to fill in the buffer.
### Performance and Storage Implications of Attributes
Most people use the default attributes when comparing strings or when creating
sort keys. When they do want to customize the ordering, the most common options
are the following :
`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED`\
Used to ignore space and punctuation characters
`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED` **and** `UCOL_STRENGTH == UCOL_QUATERNARY`\
Used to ignore the space and punctuation characters except when there are no previous letter, accent, or case/variable differences.
`UCOL_CASE_FIRST == UCOL_LOWER_FIRST` **or** `UCOL_CASE_FIRST == UCOL_UPPER_FIRST`\
Used to change the ordering of upper vs. lower case letters (as
well as small vs. large kana)
`UCOL_CASE_LEVEL == UCOL_ON` **and** `UCOL_STRENGTH == UCOL_PRIMARY`\
Used to ignore only the accent differences.
`UCOL_NORMALIZATION_MODE == UCOL_ON`\
Force to always check for normalization. This
is used if the input text may not be in FCD form.
`UCOL_FRENCH_COLLATION == UCOL_OFF`\
This is only useful for languages like French and Catalan that may turn this attribute on.
(It is the default only for Canadian French ("fr-CA").)
In String Comparison, most of these options have little or no effect on
performance. The only noticeable one is normalization, which can cost 10%-40% in
performance.
For Sort Keys, most of these options either leave the storage alone or reduce
it. Shifting can reduce the storage by about 10%-20%; case level + primary-only
can decrease it about 20% to 40%. Using no French accents can reduce the storage
by about 38% , but only for languages like French and Catalan that turn it on by
default. On the other hand, using Shifted + Quaternary can increase the storage by
10%-15%. (The Identical Level also increases the length, but this option is not
recommended).
> :point_right: **Note** All of the above numbers are based on
> tests run on a particular machine, with a particular set of data.
> (The data for each language is a large number of names
> in that language in the format <first_name>, <last name>.)
> The performance and storage may vary, depending on the particular computer,
> operating system, and data.
## Versioning
Sort keys are often stored on disk for later reuse. A common example is the use
of keys to build indexes in databases. When comparing keys, it is important to
know that both keys were generated by the same algorithms and weightings.
Otherwise, identical strings with keys generated on two different dates, for
example, might compare as unequal. Sort keys can be affected by new versions of
ICU or its data tables, new sort key formats, or changes to the Collator.
Starting with release 1.8.1, ICU provides a versioning mechanism to identify the
version information of the following (but not limited to),
1. The run-time executable
2. The collation element content
3. The Unicode/UCA database
4. The tailoring table
The version information of Collator is a 32-bit integer. If a new version of ICU
has changes affecting the content of collation elements, the version information
will be changed. In that case, to use the new version of ICU collator will
require regenerating any saved or stored sort keys.
However, it is possible to modify ICU code or data without changing relevant version numbers,
so it is safer to regenerate sort keys any time after any part of ICU has been updated.
Since ICU4C 1.8.1.
it is possible to build your program so that it uses more than one version of
ICU (only in C/C++, not in Java). Therefore, you could use the current version
for the features you need and use the older version for collation.
## Programming Examples
See the [Collation Examples](examples.md) chapter for an example of how to
compare and create sort keys with the default locale in C, C++ and Java.

View file

@ -0,0 +1,814 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Collation Concepts
The previous section demonstrated many of the requirements imposed on string
comparison routines that try to correctly collate strings according to
conventions of more than a hundred different languages, written in many
different scripts. This section describes the principles and architecture behind
the ICU Collation Service.
## Sortkeys vs Comparison
Sort keys are most useful in databases, where the overhead of calling a function
for each comparison is very large.
Generating a sort key from a Collator is many times more expensive than doing a
compare with the Collator (for common use cases). That's if the two functions
are called from Java or C. So for those languages, unless there is a very large
number of comparisons, it is better to call the compare function.
Here is an example, with a little back-of-the-envelope calculation. Let's
suppose that with a given language on a given platform, the compare performance
(CP) is 100 faster than sortKey performance (SP), and that you are doing a
binary search of a list with 1,000 elements. The binary comparison performance
is BP. We'd do about 10 comparisons, getting:
compare: 10 \* CP
sortkey: 1 \* SP + 10 \* BP
Even if BP is free, compare would be better. One has to get up to where log2(n)
= 100 before they break even.
But even this calculation is only a rough guide. First, the binary comparison is
not completely free. Secondly, the performance of compare function varies
radically with the source data. We optimized for maximizing performance of
collation in sorting and binary search, so comparing strings that are "close" is
optimized to be much faster than comparing strings that are "far away". That
optimization is important because normal sort/lookup operations compare close
strings far more often -- think of binary search, where the last few comparisons
are always with the closest strings. So even the above calculation is not very
accurate.
## Comparison Levels
In general, when comparing and sorting objects, some properties can take
precedence over others. For example, in geometry, you might consider first the
number of sides a shape has, followed by the number of sides of equal length.
This causes triangles to be sorted together, then rectangles, then pentagons,
etc. Within each category, the shapes would be ordered according to whether they
had 0, 2, 3 or more sides of the same length. However, this is not the only way
the shapes can be sorted. For example, it might be preferable to sort shapes by
color first, so that all red shapes are grouped together, then blue, etc.
Another approach would be to sort the shapes by the amount of area they enclose.
Similarly, character strings have properties, some of which can take precedence
over others. There is more than one way to prioritize the properties.
For example, a common approach is to distinguish characters first by their
unadorned base letter (for example, without accents, vowels or tone marks), then
by accents, and then by the case of the letter (upper vs. lower). Ideographic
characters might be sorted by their component radicals and then by the number of
strokes it takes to draw the character.
An alternative ordering would be to sort these characters by strokes first and
then by their radicals.
The ICU Collation Service supports many levels of comparison (named "Levels",
but also known as "Strengths"). Having these categories enables ICU to sort
strings precisely according to local conventions. However, by allowing the
levels to be selectively employed, searching for a string in text can be
performed with various matching conditions.
Performance optimizations have been made for ICU collation with the default
level settings. Performance specific impacts are discussed in the Performance
section below.
Following is a list of the names for each level and an example usage:
1. Primary Level: Typically, this is used to denote differences between base
characters (for example, "a" < "b"). It is the strongest difference. For
example, dictionaries are divided into different sections by base character.
This is also called the level-1 strength.
2. Secondary Level: Accents in the characters are considered secondary
differences (for example, "as" < "às" < "at"). Other differences between
letters can also be considered secondary differences, depending on the
language. A secondary difference is ignored when there is a primary
difference anywhere in the strings. This is also called the level-2
strength.
Note: In some languages (such as Danish), certain accented letters are
considered to be separate base characters. In most languages, however, an
accented letter only has a secondary difference from the unaccented version
of that letter.
3. Tertiary Level: Upper and lower case differences in characters are
distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In
addition, a variant of a letter differs from the base form on the tertiary
level (such as "A" and "Ⓐ"). Another example is the difference between large
and small Kana. A tertiary difference is ignored when there is a primary or
secondary difference anywhere in the strings. This is also called the
level-3 strength.
4. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations
(§)) at level 1-3, an additional level can be used to distinguish words with
and without punctuation (for example, "ab" < "a-b" < "aB"). This difference
is ignored when there is a primary, secondary or tertiary difference. This
is also known as the level-4 strength. The quaternary level should only be
used if ignoring punctuation is required or when processing Japanese text
(see Hiragana processing (§)).
5. Identical Level: When all other levels are equal, the identical level is
used as a tiebreaker. The Unicode code point values of the NFD form of each
string are compared at this level, just in case there is no difference at
levels 1-4 . For example, Hebrew cantillation marks are only distinguished
at this level. This level should be used sparingly, as only code point
value differences between two strings is an extremely rare occurrence.
Using this level substantially decreases the performance for
both incremental comparison and sort key generation (as well as increasing
the sort key length). It is also known as level 5 strength.
## Backward Secondary Sorting
Some languages require words to be ordered on the secondary level according to
the *last* accent difference, as opposed to the *first* accent difference. This
was previously the default for all French locales, based on some French
dictionary ordering traditions, but is currently only applicable to Canadian
French (locale **fr_CA**), for conformance with the [Canadian sorting
standard](http://www.unicode.org/reports/tr10/#CanStd). The difference in
ordering is only noticeable for a small number of pairs of real words. For more
information see [UCA: Contextual
Sensitivity](http://www.unicode.org/reports/tr10/#Contextual_Sensitivity).
Example:
Forward secondary | Backward secondary
----------------- | ------------------
cote | cote
coté | côte
côte | coté
côté | côté
## Contractions
A contraction is a sequence consisting of two or more letters. It is considered
a single letter in sorting.
For example, in the traditional Spanish sorting order, "ch" is considered a
single letter. All words that begin with "ch" sort after all other words
beginning with "c", but before words starting with "d".
Other examples of contractions are "ch" in Czech, which sorts after "h", and
"lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n"
respectively.
Example:
Order without contraction | Order with contraction "lj" sorting after letter "l"
------------------------- | ----------------------------------------------------
la | la
li | li
lj | lk
lja | lz
ljz | lj
lk | lja
lz | ljz
ma | ma
Contracting sequences such as the above are not very common in most languages.
> :point_right: **Note** Since ICU 2.2, and as required by the UCA,
> if a completely ignorable code point
> appears in text in the middle of contraction, it will not break the contraction.
> For example, in Czech sorting, cU+0000h will sort as it were ch.
## Expansions
If a letter sorts as if it were a sequence of more than one letter, it is called
an expansion.
For example, in German phonebook sorting (de@collation=phonebook or BCP 47
de-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae."
All words starting with "ä" will sort between words starting with "ad" and words
starting with "af".
In the case of Unicode encoding, characters can often be represented either as
pre-composed characters or in decomposed form. For example, the letter "à" can
be represented in its decomposed (a+\`) and pre-composed (à) form. Most
applications do not want to distinguish text by the way it is encoded. A search
for "à" should find all instances of the letter, regardless of whether the
instance is in pre-composed or decomposed form. Therefore, either form of the
letter must result in the same sort ordering. The architecture of the ICU
Collation Service supports this.
## Contractions Producing Expansions
It is possible to have contractions that produce expansions.
One example occurs in Japanese, where the vowel with a prolonged sound mark is
treated to be equivalent to the long vowel version:
カアー<<< カイー and\
キイー<<< キイー
> :point_right: **Note** Since ICU 2.0 Japanese tailoring uses
> [prefix analysis](http://www.unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings)
> instead of contraction producing expansions.
## Normalization
In the section on expansions, we discussed that text in Unicode can often be
represented in either pre-composed or decomposed forms. There are other types of
equivalences possible with Unicode, including Canonical and Compatibility. The
process of
Normalization ensures that text is written in a predictable way so that searches
are not made unnecessarily complicated by having to match on equivalences. Not
all text is normalized, however, so it is useful to have a collation service
that can address text that is not normalized, but do so with efficiency.
The ICU Collation Service handles un-normalized text properly, producing the
same results as if the text were normalized.
In practice, most data that is encountered is in normalized or semi-normalized
form already. The ICU Collation Service is designed so that it can process a
wide range of normalized or un-normalized text without a need for normalization
processing. When a case is encountered that requires normalization, the ICU
Collation Service drops into code specific to this purpose. This maximizes
performance for the majority of text that does not require normalization.
In addition, if the text is known with certainty not to contain un-normalized
text, then even the overhead of checking for normalization can be eliminated.
The ICU Collation Service has the ability to turn Normalization Checking either
on or off. If Normalization Checking is turned off, it is the user's
responsibility to insure that all text is already in the appropriate form. This
is true in a great majority of the world languages, so normalization checking is
turned off by default for most locales.
If the text requires normalization processing, Normalization Checking should be
on. Any language that uses multiple combining characters such as Arabic, ancient
Greek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking
to be on, or the text to go through a normalization process before collation.
For more information about Normalization related reordering please see
[Unicode Technical Note #5](http://www.unicode.org/notes/tn5/) and
[UAX #15.](http://www.unicode.org/reports/tr15/)
> :point_right: **Note** ICU supports two modes of normalization: on and off.
> Java.text.\* classes offer compatibility decomposition mode, which is not supported in ICU.
## Ignoring Punctuation
In some cases, punctuation can be ignored while searching or sorting data. For
example, this enables a search for "biweekly" to also return instances of
"bi-weekly". In other cases, it is desirable for punctuated text to be
distinguished from text without punctuation, but to have the text sort close
together.
These two behaviors can be accomplished if there is a way for a character to be
ignored on all levels except for the quaternary level. If this is the case, then
two strings which compare as identical on the first three levels (base letter,
accents, and case) are then distinguished at the fourth level based on their
punctuation (if any). If the comparison function ignores differences at the
fourth level, then strings that differ by punctuation only are compared as
equal.
The following table shows the results of sorting a list of terms in 3 different
ways. In the first column, punctuation characters (space " ", and hyphen "-")
are not ignored (" " < "-" < "b"). In the second column, punctuation characters
are ignored in the first 3 levels and compared only in the fourth level. In the
third column, punctuation characters are ignored in the first 3 levels and the
fourth level is not considered. In the last column, punctuated terms are
equivalent to the identical terms without punctuation.
For more options and details see the [“Ignore Punctuation”
Options](customization/ignorepunct.md) page.
Non-ignorable | Ignorable and Quaternary strength | Ignorable and Tertiary strength
------------- | --------------------------------- | -------------------------------
black bird | black bird | **black bird**
black Bird | black-bird | **black-bird**
black birds | blackbird | **blackbird**
black-bird | black Bird | black Bird
black-Bird | black-Bird | black-Bird
black-birds | blackBird | blackBird
blackbird | black birds | black birds
blackBird | black-birds | black-birds
blackbirds | blackbirds | blackbirds
> :point_right: **Note** The strings with the same font format in the last column are
compared as equal by ICU Collator.\
> Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that
> follow shifted code points will be completely ignored. This means that an accent
> following a space will compare as if it was a space alone.
## Case Ordering
The tertiary level is used to distinguish text by case, by small versus large
Kana, and other letter variants as noted above.
Some applications prefer to emphasize case differences so that words starting
with the same case sort together. Some Japanese applications require the
difference between small and large Kana be emphasized over other tertiary
differences.
The UCA does not provide means to separate out either case or Kana differences
from the remaining tertiary differences. However, the ICU Collation Service has
two options that help in customize case and/or Kana differences. Both options
are turned off by default.
### CaseFirst
The Case-first option makes case the most significant part of the tertiary
level. Primary and secondary levels are unaffected. With this option, words
starting with the same case sort together. The Case-first option can be set to
make either lowercase sort before
uppercase or uppercase sort before lowercase.
Note: The case-first option does not constitute a separate level; it is simply a
reordering of the tertiary level.
ICU makes use of the following three case categories for sorting
1. uppercase: "ABC"
2. mixed case: "Abc", "aBc"
3. normal (lowercase or no case): "abc", "123"
Mixed case is always sorted between uppercase and normal case when the
"case-first" option is set.
### CaseLevel
The Case Level option makes a separate level for case differences. This is an
extra level positioned between secondary and tertiary. The case level is used in
Japanese to make the difference between small and large Kana more important than
the other tertiary differences. It also can be used to ignore other tertiary
differences, or even secondary differences. This is especially useful in
matching. For example, if the strength is set to primary only (level-1) and the
case level is turned on, the comparison ignores accents and tertiary differences
except for case. The contents of the case level are affected by the case-first
option.
The case level is independent from the strength of comparison. It is possible to
have a collator set to primary strength with the case level turned on. This
provides for comparison that takes into account the case differences, while at
the same time ignoring accents and tertiary differences other than case. This
may be used in searching.
Example:
**Case-first off, Case level off**
apple\
ⓐⓟⓟⓛⓔ\
Abernathy\
ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
ähnlich\
Ähnlichkeit
**Lowercase-first, Case level off**
apple\
ⓐⓟⓟⓛⓔ\
ähnlich\
Abernathy\
ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
Ähnlichkeit
**Uppercase-first, Case level off**
Abernathy\
ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
Ähnlichkeit\
apple\
ⓐⓟⓟⓛⓔ\
ähnlich
**Lowercase-first, Case level on**
apple\
Abernathy\
ⓐⓟⓟⓛⓔ\
ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
ähnlich\
Ähnlichkeit
**Uppercase-first, Case level on**
Abernathy\
apple\
ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
ⓐⓟⓟⓛⓔ\
Ähnlichkeit\
ähnlich
## Script Reordering
Script reordering allows scripts and some other groups of characters to be moved
relative to each other. This reordering is done on top of the DUCET/CLDR
standard collation order. Reordering can specify groups to be placed at the
start and/or the end of the collation order.
By default, reordering codes specified for the start of the order are placed in
the order given after several special non-script blocks. These special groups of
characters are space, punctuation, symbol, currency, and digit. Script groups
can be intermingled with these special non-script groups if those special groups
are explicitly specified in the reordering.
The special code `others` stands for any script that is not explicitly mentioned
in the list. Anything that is after others will go at the very end of the list
in the order given. For example, `[Grek, others, Latn]` will result in an
ordering that puts all scripts other than Greek and Latin between them.
### Examples:
Note: All examples below use the string equivalents for the scripts and reorder
codes that would be used in collator rules. The script and reorder code
constants that would be used in API calls will be different.
**Example 1:**\
set reorder code - `[Grek]`\
result - `[space, punctuation, symbol, currency, digit, Grek, others]`
**Example 2:**\
set reorder code - `[Grek]`\
result - `[space, punctuation, symbol, currency, digit, Grek, others]`
followed by: set reorder code - `[Hani]`\
result -` [space, punctuation, symbol, currency, digit, Hani, others]`
That is, setting a reordering always modifies
the DUCET/CLDR order, replacing whatever was previously set, rather than adding
on to it. In order to cumulatively modify an ordering, you have to retrieve the
existing ordering, modify it, and then set it.
**Example 3:**\
set reorder code - `[others, digit]`\
result - `[space, punctuation, symbol, currency, others, digit]`
**Example 4:**\
set reorder code - `[space, Grek, punctuation]`\
result - `[symbol, currency, digit, space, Grek, punctuation, others]`
**Example 5:**\
set reorder code - `[Grek, others, Hani]`\
result - `[space, punctuation, symbol, currency, digit, Grek, others, Hani]`
**Example 6:**\
set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
followed by:\
set reorder code - `[NONE]`\
result - DUCET/CLDR
**Example 7:**\
set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
followed by:\
set reorder code - `[DEFAULT]`\
result - original reordering for the locale which may or may not be DUCET/CLDR
**Example 8:**\
set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
followed by:\
set reorder code - `[]`\
result - original reordering for the locale which may or may not be DUCET/CLDR
**Example 9:**\
set reorder code - `[Hebr, Phnx]`\
result - error
Beginning with ICU 55, scripts only reorder together if they are primary-equal,
for example Hiragana and Katakana.
ICU 4.8-54:
* Scripts were reordered in groups, each normally starting with a [Recommended
Script](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts).
* Reorder codes moved as a group (were “equivalent”) if their scripts shared a
primary-weight lead byte.
* For example, Hebr and Phnx were “equivalent” reordering codes and were
reordered together. Their order relative to each other could not be changed.
* Only any one code out of any group could be reordered, not multiple of the
same group.
## Sorting of Japanese Text (JIS X 4061)
Japanese standard JIS X 4061 requires two changes to the collation procedures:
special processing of Hiragana characters and (for performance reasons) prefix
analysis of text.
### Hiragana Processing
JIS X 4061 standard requires more levels than provided by the UCA. To offer
conformant sorting order, ICU uses the quaternary level to distinguish between
Hiragana and Katakana. Hiragana symbols are given smaller values than Katakana
symbols on quaternary level, thus causing Hiragana sequences to sort before
corresponding Katakana sequences.
### Prefix Analysis
Another characteristics of sorting according to the JIS X 4061 is a large number
of contractions followed by expansions (see
[Contractions Producing Expansions](#contractions-producing-expansions)).
This causes all the Hiragana and Katakana codepoints to be treated as
contractions, which reduces performance. The solution we adopted introduces the
prefix concept which allows us to improve the performance of Japanese sorting.
More about this can be found in the [customization
chapter](customization/index.md) .
## Thai/Lao reordering
UCA requires that certain Thai and Lao prevowels be reordered with a code point
following them. This option is always on in the ICU implementation, as
prescribed by the UCA.
This rule takes effect when:
1. A Thai vowel of the range \\U0E40-\\U0E44 precedes a Thai consonant of the
range \\U0E01-\\U0E2E
or
2. A Lao vowel of the range \\U0EC0-\\U0EC4 precedes a Lao consonant of the
range \\U0E81-\\U0EAE. In these cases the vowel is placed after the
consonant for collation purposes.
> :point_right: **Note** There is a difference between java.text.\* classes and ICU in regard to Thai
> reordering. Java.text.\* classes allow tailorings to turn off reordering by
> using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai
> prevowels.
## Space Padding
In many database products, fields are padded with null. To get correct results,
the input to a Collator should omit any superfluous trailing padding spaces. The
problem arises with contractions, expansions, or normalization. Suppose that
there are two fields, one containing "aed" and the other with "äd". German
phonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will
compare "ä" as if it were "ae" (on a primary level), so the order will be "äd" <
"aed". But if both fields are padded with spaces to a length of 3, then this
will reverse the order, since the first will compare as if it were one character
longer. In other words, when you start with strings 1 and 2
1 | a | e | d | \<space\>
-- | -- | -- | --------- | ---------
2 | ä | d | \<space\> | \<space\>
they end up being compared on a primary level as if they were 1' and 2'
1' | a | e | d | \<space\> | &nbsp;
-- | -- | -- | -- | --------- | ---------
2' | a | e | d | \<space\> | \<space\>
Since 2' has an extra character (the extra space), it counts as having a primary
difference when it shouldn't. The correct result occurs when the trailing
padding spaces are removed, as in 1" and 2"
1" | a | e | d
-- | -- | -- | --
2" | a | e | d
## Collator naming scheme
***Starting with ICU 54, the following naming scheme and its API functions are
deprecated.*** Use ucol_open() with language tag collation keywords instead (see
[Collation API Details](api.md)). For example,
ucol_open("de-u-co-phonebk-ka-shifted", &errorCode) for German Phonebook order
with "ignore punctuation" mode.
When collating or matching text, a number of attributes can be used to affect
the desired result. The following describes the attributes, their values, their
effects, their normal usage, and the string comparison performance and sort key
length implications. It also includes single-letter abbreviations for both the
attributes and their values. These abbreviations allow a 'short-form'
specification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which
can be used to specific that the desired options are: UCA version 4.0.0; ignore
spaces, punctuation and symbols; use Swedish linguistic conventions; compare
case-insensitively.
A number of attribute values are common across different attributes; these
include **Default** (abbreviated as D), **On** (O), and **Off** (X). Unless
otherwise stated, the examples use the UCA alone with default settings.
> :point_right: **Note** In order to achieve uniqueness, a collator name always
> has the attribute abbreviations sorted.
### Main References
1. For a full list of supported locales in ICU, see [Locale
Explorer](http://demo.icu-project.org/icu-bin/locexp) , which also contains
an on-line demo showing sorting for each locale. The demo allows you to try
different attribute values, to see how they affect sorting.
2. To see tabular results for the UCA table itself, see the [Unicode Collation
Charts](http://www.unicode.org/charts/collation/) .
3. For the UCA specification, see [UTS #10: Unicode Collation
Algorithm](http://www.unicode.org/reports/tr10/) .
4. For more detail on the precise effects of these options, see [Collation
Customization](customization/index.md) .
#### Collator Naming Attributes
Attribute | Abbreviation | Possible Values
---------------------- | ------------ | ---------------
Locale | L | \<language\>
Script | Z | \<script\>
Region | R | \<region\>
Variant | V | \<variant\>
Keyword | K | \<keyword\>
&nbsp; | &nbsp; | &nbsp;
Strength | S | 1, 2, 3, 4, I, D
Case_Level | E | X, O, D
Case_First | C | X, L, U, D
Alternate | A | N, S, D
Variable_Top | T | \<hex digits\>
Normalization Checking | N | X, O, D
French | F | X, O, D
Hiragana | H | X, O, D
#### Collator Naming Attribute Descriptions
The **Locale** attribute is typically the most
important attribute for correct sorting and matching, according to the user
expectations in different countries and regions. The default UCA ordering will
only sort a few languages such as Dutch and Portuguese correctly ("correctly"
meaning according to the normal expectations for users of the languages).
Otherwise, you need to supply the locale to UCA in order to properly collate
text for a given language. Thus a locale needs to be supplied so as to choose a
collator that is correctly **tailored** for that locale. The choice of a locale
will automatically preset the values for all of the attributes to something that
is reasonable for that locale. Thus most of the time the other attributes do not
need to be explicitly set. In some cases, the choice of locale will make a
difference in string comparison performance and/or sort key length.
In short attribute names,
`<language>_<script>_<region>_<variant>@collation=<keyword>` is
represented by: `L<language>_Z<script>_R<region>_V<variant>_K<keyword>`. Not
all the elements are required. Valid values for locale elements are general
valid values for RFC 3066 locale naming.
**Example:**\
**Locale="sv" (Swedish)** "Kypper" < "Köpfe"\
**Locale="de" (German)** "Köpfe" < "Kypper"
The **Strength** attribute determines whether accents or
case are taken into account when collating or matching text. ( (In writing
systems without case or accents, it controls similarly important features). The
default strength setting usually does not need to be changed for collating
(sorting), but often needs to be changed when **matching** (e.g. SELECT). The
possible values include Default (D), Primary (1), Secondary (2), Tertiary (3),
Quaternary (4), and Identical (I).
For example, people may choose to ignore accents or ignore accents and case when
searching for text.
Almost all characters are distinguished by the first three levels, and in most
locales the default value is thus Tertiary. However, if Alternate is set to be
Shifted, then the Quaternary strength (4) can be used to break ties among
whitespace, punctuation, and symbols that would otherwise be ignored. If very
fine distinctions among characters are required, then the Identical strength (I)
can be used (for example, Identical Strength distinguishes between the
**Mathematical Bold Small A** and the **Mathematical Italic Small A.** For more
examples, look at the cells with white backgrounds in the collation charts).
However, using levels higher than Tertiary - the Identical strength - result in
significantly longer sort keys, and slower string comparison performance for
equal strings.
**Example:**\
**S=1** role = Role = rôle\
**S=2** role = Role < rôle\
**S=3** role < Role < rôle
The **Case_Level** attribute is used when ignoring accents
**but not** case. In such a situation, set Strength to be Primary, and
Case_Level to be On. In most locales, this setting is Off by default. There is a
small string comparison performance and sort key impact if this attribute is set
to be On.
**Example:**\
**S=1, E=X** role = Role = rôle\
**S=1, E=O** role = rôle < Role
The **Case_First** attribute is used to control whether
uppercase letters come before lowercase letters or vice versa, in the absence of
other differences in the strings. The possible values are Uppercase_First (U)
and Lowercase_First (L), plus the standard Default and Off. There is almost no
difference between the Off and Lowercase_First options in terms of results, so
typically users will not use Lowercase_First: only Off or Uppercase_First.
(People interested in the detailed differences between X and L should consult
the [Collation Customization](customization/index.md) ).
Specifying either L or U won't affect string comparison performance, but will
affect the sort key length.
**Example:**\
**C=X or C=L** "china" < "China" < "denmark" < "Denmark"\
**C=U** "China" < "china" < "Denmark" < "denmark"
The **Alternate** attribute is used to control the handling of
the so-called **variable **characters in the UCA: whitespace, punctuation and
symbols. If Alternate is set to Non-Ignorable (N), then differences among these
characters are of the same importance as differences among letters. If Alternate
is set to Shifted (S), then these characters are of only minor importance. The
Shifted value is often used in combination with Strength set to Quaternary. In
such a case, white-space, punctuation, and symbols are considered when comparing
strings, but only if all other aspects of the strings (base letters, accents,
and case) are identical. If Alternate is not set to Shifted, then there is no
difference between a Strength of 3 and a Strength of 4.
For more information and examples, see
[Variable_Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting) in
the UCA.
The reason the Alternate values are not simply On and Off is that
additional Alternate values may be added in the future.
The UCA option
**Blanked** is expressed with Strength set to 3, and Alternate set to Shifted.
The default for most locales is Non-Ignorable. If Shifted is selected, it may be
slower if there are many strings that are the same except for punctuation; sort
key length will not be affected unless the strength level is also increased.
**Example:**\
**S=3, A=N** di Silva < Di Silva < diSilva < U.S.A. < USA\
**S=3, A=S** di Silva = diSilva < Di Silva < U.S.A. = USA\
**S=4, A=S** di Silva < diSilva < Di Silva < U.S.A. < USA
The **Variable_Top** attribute is only meaningful if the
Alternate attribute is not set to Non-Ignorable. In such a case, it controls
which characters count as ignorable. The \<hex\> value specifies the "highest"
character sequence (in UCA order) weight that is to be considered ignorable.
Thus, for example, if a user wanted white-space to be ignorable, but not any
visible characters, then s/he would use the value Variable_Top=0020 (space). The
digits should only be a single character. All characters of the same primary
weight are equivalent, so Variable_Top=3000 (ideographic space) has the same
effect as Variable_Top=0020.
This setting (alone) has little impact on string comparison performance; setting
it lower or higher will make sort keys slightly shorter or longer respectively.
**Example:**\
**S=3, A=S** di Silva = diSilva < U.S.A. = USA\
**S=3, A=S, T=0020** di Silva = diSilva < U.S.A. < USA
The **Normalization** setting determines whether
text is thoroughly normalized or not in comparison. Even if the setting is off
(which is the default for many locales), text as represented in common usage
will compare correctly (for details, see [UTN
#5](http://www.unicode.org/notes/tn5/)). Only if the accent marks are in
non-canonical order will there be a problem. If the setting is On, then the best
results are guaranteed for all possible text input.There is a medium string
comparison performance cost if this attribute is On, depending on the frequency
of sequences that require normalization. There is no significant effect on sort
key length.If the input text is known to be in NFD or NFKD normalization forms,
there is no need to enable this Normalization option.
**Example:**\
**N=X** ä = a + ◌̈ < ä + ̣ < + ̈\
**N=O** ä = a + ◌̈ < ä + ̣ = + ̈
Some **French** dictionary ordering traditions sort strings with
different accents from the back of the string. This attribute is automatically
set to On for the Canadian French locale (fr_CA). Users normally would not need
to explicitly set this attribute. There is a string comparison performance cost
when it is set On, but sort key length is unaffected.
**Example:**\
**F=X** cote < coté < côte < côté\
**F=O** cote < côte < coté < côté
Compatibility with JIS x 4061 requires the introduction of an
additional level to distinguish **Hiragana** and Katakana characters. If
compatibility with that standard is required, then this attribute is set On, and
the strength should be set to at least Quaternary.
This attribute is an implementation detail of the CLDR Japanese tailoring. The
implementation might change to use a different mechanism to achieve the same
Japanese sort order. Since ICU 50, this attribute is not settable any more.
**Example:**\
**H=X, S=4** きゅう = キュウ < きゆう = キユウ\
**H=O, S=4** きゅう < キュウ < きゆう < キユウ
> :point_right: **Note** If attributes in collator name are not overridden,
> it is assumed that they are the same as for the given locale.
> For example, a collator opened with an empty
> string has the same attribute settings as **AN_CX_EX_FX_HX_KX_NX_S3_T0000**.*
### Summary of Value Abbreviations
Value | Abbreviation
------------- | ------------
Default | D
On | O
Off | X
Primary | 1
Secondary | 2
Tertiary | 3
Quaternary | 4
Identical | I
Shifted | S
Non-Ignorable | N
Lower-First | L
Upper-First | U

View file

@ -0,0 +1,161 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# “Ignore Punctuation” Options
By default, spaces and punctuation characters add primary (base character)
differences. Such characters sort less-than digits and letters. For example, the
default collation yields “De Anza” < “de-luge” < “deanza”.
UCA/CLDR/ICU provide several options for “ignore punctuation” collation
settings, also known as Variable Weighting or Alternate Handling. These options
change the sorting behavior of “variable” characters algorithmically. “Variable”
characters are those with low (but non-zero) primary weights up to a threshold,
the “variable top”. By default, CLDR and ICU treat spaces and punctuation as
variable. (This can be changed via API.) The DUCET also includes most symbols.
## Non-Ignorable
The default behavior in CLDR & ICU, shown above, is to not ignore punctuation
(alternate=non-ignorable) but to map variable characters to their normal primary
collation elements.
All of the following options cause variable characters to be ignored on levels
1..3. Only when strings compare equal up to the tertiary level may variable
characters make a difference, depending on the options.
See also
* [UCA: Variable
Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting)
* [LDML: Setting
Options](http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-collation.html#Setting_Options)
Here is an overview of the sorting results with these options.
Non-ignorable | Blanked | Shifted | Shift-Trimmed | Variable-After
------------- | ------------ | ------- | ------------- | --------------
delug | delug | delug | delug | delug
de-luge | de-luge | de-luge | *deluge* | *deluge*
delu-ge | delu-ge (*) | delu-ge | de-luge | deluge-
*deluge* | *deluge* (*) | *deluge* | delu-ge | delu-ge
Deluge | deluge- (*) | deluge- | deluge- | de-luge
deluge- | Deluge | Deluge | Deluge | Deluge
Items with (*) compare equal to the preceding ones, and their relative order
is arbitrary. These only occur in the Blanked column. This table shows the
results of a stable sort algorithm with the non-ignorable column as input.
## Blanked
The simplest option is to “ignore punctuation” completely, as if all variable
characters (and following combining marks) had been removed from the input
strings before comparing them.
For example: “De Anza” = “De-Anza” = “DeAnza”.
In ICU, this option is selected with alternate=shifted and
strength=primary|secondary|tertiary. (ICU does not support Blanked combined with
strength=identical.)
The implementation “blanks” out all weights of the variable characters
collation elements.
*With all of the following options, variable characters are ignored on levels
1..3 but add distinctions on level 4 (quaternary level).*
## Shifted
Among strings that compare tertiary-equal, that is, they contain the same
letters, accents and casing:
* Sorts all variable characters less-than (before) regular characters.
* Appending a variable character makes a string sort *greater-than* the string
without it.
* *Inserting* a variable character makes a string sort *less-than* the string
without it.
* Inserting a variable character *earlier* in a string makes it sort
*less-than* inserting the variable character *later* in the string.
The result is similar to [Merging Sort
Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys) (with shorter
prefixes sorting less-than longer ones), like in last-name+first-name sorting,
except only among tertiary-equal strings.
For example: “de-luge” < “delu-ge” < “deluge” < “deluge-”.
In ICU, this option is selected with alternate=shifted and
strength=quaternary|identical.
The implementation “shifts” the primary weight p of the collation element \[p,
s, t, q\] of each variable characters down three levels: \[0, 0, 0, p\]. Regular
characters with primary collation elements get a high quaternary weight, higher
than that of any variable character.
Note that this behavior is different from collation on secondary and tertiary
level, because normal collation elements get low secondary & tertiary weights
but high quaternary weights. Adding an accent difference anywhere makes a string
sort greater-than the string without it, and adding an accent difference earlier
makes it sort greater-than adding it later. For example, “deanza” < “deanzä” <
“deänza” < “dëanza”. (Compare the ‘ä’/‘ë’ positions here with the - positions
above.)
## Shift-Trimmed
*Note: This method is not currently implemented in ICU.*
Among strings that compare tertiary-equal:
* Sorts variable characters sometimes less-than, sometimes greater-than
regular characters.
* Inserting a variable character anywhere makes a string sort *greater-than*
the string without it. (The string without variable characters gets an empty
quaternary level.)
* Inserting a variable character *earlier* in a string makes it sort
*less-than* inserting the variable character *later* in the string.
For example: “deluge” < “de-luge” < “delu-ge” < “deluge-”.
The Shift-Trimmed method works like Shifted, except that *trailing*
high-quaternary weights (from regular characters) are removed (trimmed).
Compared with Shifted, the Shift-Trimmed method sorts strings without variable
characters before ones with variable characters added, rather than producing the
equivalent of [Merging Sort
Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys).
Shift-Trimmed is more complicated to implement than all of the other options:
When comparing strings, a lookahead (or equivalent) is needed to determine
whether a non-variable character gets a zero quaternary weight (if no variables
follow) or a high quaternary weight (if at least one variable follows). When
building sort keys, trailing high/common quaternary weights are trimmed (backed
out) at the end of the quaternary level.
## Variable-After
*Note: This method is not currently implemented in ICU.*
Among strings that compare tertiary-equal:
* Sorts all variable characters greater-than (after) regular characters.
* Inserting a variable character anywhere makes a string sort *greater-than*
the string without it. (Like Shift-Trimmed.)
* Inserting a variable character *earlier* in a string makes it sort
*greater-than* inserting the variable character *later* in the string. (Like
accent differences.)
For example: “deluge” < “deluge-” < “delu-ge” < “de-luge”.
The implementation “shifts” the primary weight p of the collation element \[p,
s, t, q\] of each variable characters down three levels: \[0, 0, 0, p\]. Regular
characters with primary collation elements get a *low* quaternary weight,
*lower* than that of any variable character. This is consistent with collation
on secondary and tertiary levels but unlike [Merging Sort
Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys).
This method extends the [UCA well-formedness condition
2](http://www.unicode.org/reports/tr10/#WF2) to apply to quaternary weights.
(UCA versions before UCA 6.2 did not limit WF2 to secondary & tertiary weights,
which meant that several of the Variable Weighting options technically created
ill-formed quaternary weights.)

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,317 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Collation Examples
## Simple Collation Sample Customization
The following program demonstrates how to compare and create sort keys with
default locale.
In **C:**
```C
#include <stdio.h>
#include <memory.h>
#include <string.h>
#include "unicode/ustring.h"
#include "unicode/utypes.h"
#include "unicode/uloc.h"
#include "unicode/ucol.h"
#define MAXBUFFERSIZE 100
#define BIGBUFFERSIZE 5000
UBool collateWithLocaleInC(const char* locale, UErrorCode *status)
{
UChar dispName [MAXBUFFERSIZE];
int32_t bufferLen = 0;
UChar source [MAXBUFFERSIZE];
UChar target [MAXBUFFERSIZE];
UCollationResult result = UCOL_EQUAL;
uint8_t sourceKeyArray [MAXBUFFERSIZE];
uint8_t targetKeyArray [MAXBUFFERSIZE];
int32_t sourceKeyOut = 0,
targetKeyOut = 0;
UCollator *myCollator = 0;
if (U_FAILURE(*status))
{
return FALSE;
}
u_uastrcpy(source, "This is a test.");
u_uastrcpy(target, "THIS IS A TEST.");
myCollator = ucol_open(locale, status);
if (U_FAILURE(*status)){
bufferLen = uloc_getDisplayName(locale, 0, dispName, MAXBUFFERSIZE, status);
/*Report the error with display name... */
fprintf(stderr,
"Failed to create the collator for : \"%s\"\n", dispName);
return FALSE;
}
result = ucol_strcoll(myCollator, source, u_strlen(source), target, u_strlen(target));
/* result is 1, secondary differences only for ignorable space characters*/
if (result != UCOL_LESS)
{
fprintf(stderr,
"Comparing two strings with only secondary differences in C failed.\n");
return FALSE;
}
/* To compare them with just primary differences */
ucol_setStrength(myCollator, UCOL_PRIMARY);
result = ucol_strcoll(myCollator, source, u_strlen(source), target, u_strlen(target));
/* result is 0 */
if (result != 0)
{
fprintf(stderr,
"Comparing two strings with no differences in C failed.\n");
return FALSE;
}
/* Now, do the same comparison with keys */
sourceKeyOut = ucol_getSortKey(myCollator, source, -1, sourceKeyArray, MAXBUFFERSIZE);
targetKeyOut = ucol_getSortKey(myCollator, target, -1, targetKeyArray, MAXBUFFERSIZE);
result = 0;
result = strcmp(sourceKeyArray, targetKeyArray);
if (result != 0)
{
fprintf(stderr,
"Comparing two strings with sort keys in C failed.\n");
return FALSE;
}
ucol_close(myCollator);
return TRUE;
}
```
In **C++:**
```C++
#include <stdio.h>
#include "unicode/unistr.h"
#include "unicode/utypes.h"
#include "unicode/locid.h"
#include "unicode/coll.h"
#include "unicode/tblcoll.h"
#include "unicode/coleitr.h"
#include "unicode/sortkey.h"
UBool collateWithLocaleInCPP(const Locale& locale, UErrorCode& status)
{
UnicodeString dispName;
UnicodeString source("This is a test.");
UnicodeString target("THIS IS A TEST.");
Collator::EComparisonResult result = Collator::EQUAL;
CollationKey sourceKey;
CollationKey targetKey;
Collator *myCollator = 0;
if (U_FAILURE(status))
{
return FALSE;
}
myCollator = Collator::createInstance(locale, status);
if (U_FAILURE(status)){
locale.getDisplayName(dispName);
/*Report the error with display name... */
fprintf(stderr,
"%s: Failed to create the collator for : \"%s\"\n", dispName);
return FALSE;
}
result = myCollator->compare(source, target);
/* result is 1, secondary differences only for ignorable space characters*/
if (result != UCOL_LESS)
{
fprintf(stderr,
"Comparing two strings with only secondary differences in C failed.\n");
return FALSE;
}
/* To compare them with just primary differences */
myCollator->setStrength(Collator::PRIMARY);
result = myCollator->compare(source, target);
/* result is 0 */
if (result != 0)
{
fprintf(stderr,
"Comparing two strings with no differences in C failed.\n");
return FALSE;
}
/* Now, do the same comparison with keys */
myCollator->getCollationKey(source, sourceKey, status);
myCollator->getCollationKey(target, targetKey, status);
result = Collator::EQUAL;
result = sourceKey.compareTo(targetKey);
if (result != 0)
{
fprintf(stderr,
"%s: Comparing two strings with sort keys in C failed.\n");
return FALSE;
}
delete myCollator;
return TRUE;
}
```
### Main Function
```C++
extern "C" UBool collateWithLocaleInC(const char* locale, UErrorCode *status);
int main()
{
UErrorCode status = U_ZERO_ERROR;
fprintf(stdout, "\n");
if (collateWithLocaleInCPP(Locale("en", "US"), status) != TRUE)
{
fprintf(stderr,
"Collate with locale in C++ failed.\n");
} else
{
fprintf(stdout, "Collate with Locale C++ example worked!!\n");
}
status = U_ZERO_ERROR;
fprintf(stdout, "\n");
if (collateWithLocaleInC("en_US", &status) != TRUE)
{
fprintf(stderr,
"%s: Collate with locale in C failed.\n");
} else
{
fprintf(stdout, "Collate with Locale C example worked!!\n");
}
return 0;
}
```
In **Java:**
```Java
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.CollationElementIterator;
import com.ibm.icu.text.CollationKey;
import java.util.Locale;
public class CollateExample
{
public static void main(String arg[])
{
CollateExample example = new CollateExample();
try {
if (!example.collateWithLocale(Locale.US)) {
System.err.println("Collate with locale example failed.");
}
else {
System.out.println("Collate with Locale example worked!!");
}
} catch (Exception e) {
System.err.println("Collating with locale failed");
e.printStackTrace();
}
}
public boolean collateWithLocale(Locale locale) throws Exception
{
String source = "This is a test.";
String target = "THIS IS A TEST.";
Collator myCollator = Collator.getInstance(locale);
int result = myCollator.compare(source, target);
// result is 1, secondary differences only for ignorable space characters
if (result >= 0) {
System.err.println(
"Comparing two strings with only secondary differences failed.");
return false;
}
// To compare them with just primary differences
myCollator.setStrength(Collator.PRIMARY);
result = myCollator.compare(source, target);
// result is 0
if (result != 0) {
System.err.println(
"Comparing two strings with no differences failed.");
return false;
}
// Now, do the same comparison with keys
CollationKey sourceKey = myCollator.getCollationKey(source);
CollationKey targetKey = myCollator.getCollationKey(target);
result = sourceKey.compareTo(targetKey);
if (result != 0) {
System.err.println("Comparing two strings with sort keys failed.");
return false;
}
return true;
}
}
```
## Language-sensitive searching
String searching is a well-researched area, and there are algorithms that can
optimize the searching process. Perhaps the best is the Boyer-Moore method. For a
full description of this concept, please see Laura
Werner's text searching article for more details
(<http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>).
However, implementing collation-based search with the Boyer-Moore method
while getting correct results is very tricky,
and ICU no longer uses this method.
Please see the (String Search Service)[string-search.md] chapter.
## Using large buffers to manage sort keys
A good solution for the problem of not knowing the sort key size in advance is
to allocate a large buffer and store all the sort keys there, while keeping a
list of indexes or pointers to that buffer.
Following is sample code that will take a pointer to an array of UChar pointer,
an array of key indexes. It will allocate and fill a buffer with sort keys and
return the maximum size for a sort key. Once you have done this to your string,
you just need to allocate a field of maximum size and copy your sortkeys from
the buffer to fields.
```C++
uint32_t
fillBufferWithKeys(UCollator *coll, UChar **source, uint32_t *keys, uint32_t sourceSize,
uint8_t **buffer, uint32_t *maxSize, UErrorCode *status)
{
if(status == NULL || U_FAILURE(*status)) {
return 0;
}
uint32_t bufferSize = 16384;
uint32_t increment = 16384;
uint32_t currentOffset = 0;
uint32_t keySize = 0;
uint32_t i = 0;
*maxSize = 0;
*buffer = (uint8_t *)malloc(bufferSize * sizeof(uint8_t));
if(buffer == NULL) {
*status = U_MEMORY_ALLOCATION_ERROR;
return 0;
}
for(i = 0; i < sourceSize; i++) {
keys[i] = currentOffset;
keySize = ucol_getSortKey(coll, source[i], -1, *buffer+currentOffset, bufferSize-currentOffset);
if(keySize > bufferSize-currentOffset) {
*buffer = (uint8_t *)realloc(*buffer, bufferSize+increment);
if(buffer == NULL) {
*status = U_MEMORY_ALLOCATION_ERROR;
return 0;
}
bufferSize += increment;
keySize = ucol_getSortKey(coll, source[i], -1, *buffer+currentOffset, bufferSize-currentOffset);
}
/* here you can hook code that does something interesting with the keySize -
* remembers the maximum or similar...
*/
if(keySize > *maxSize) {
*maxSize = keySize;
}
currentOffset += keySize;
}
return currentOffset;
}
```

View file

@ -0,0 +1,55 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Collation FAQ
## Q. Should I turn Full Normalization on all the time?
**A.** You can if you want, but you don't typically need to. The key is that
normalization for most characters is already built into ICU's collation by
default. Everything that can be done without affecting performance is already
there, and will work with most languages. So the normalization parameter in ICU
really only changes whether full normalization is invoked.
The outlying cases are situations where a language uses multiple accents
(non-spacing marks) on the same base letter, such as Vietnamese or Arabic. In
those cases, full normalization needs to be turned on. If you use the right
locale (or language) when creating a collation in ICU, then full normalization
will be turned on or off according to what the language typically requires.
## Q. Are there any cases where I would want to override the Full Normalization setting?
**A.** The only case where you really need to worry about that parameter is for
very unusual cases, such as sorting an list containing of names according to
English conventions, but where the list contains, for example, some Vietnamese
names. One way to check for such a situation is to open a collator for each of
the languages you expect to find, and see if any of them have the full
normalization flags set.
## Q. How can collation rules mimic word sorting?
Word sort is a way of sorting where certain interpunction characters are
completely ignored, while other are considered. An example of word sort below
ignores hyphens and apostrophes:
Word Sort | String Sort
--------- | -----------
billet | bill's
bills | billet
bill's | bills
cannot | can't
cant | cannot
can't | cant
con | co-op
coop | con
co-op | coop
This specific behavior can be mimicked using a tailoring that makes these
characters completely ignorable. In this case, an appropriate rule would be
`"&\\u0000 = '' = '-'"`.
Please note that we don't think that such solution is correct, since different
languages have different word elements. Instead one should use shifted mode for
comparison.

View file

@ -0,0 +1,142 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Collation
## Overview
Information is displayed in sorted order to enable users to easily find the
items they are looking for. However, users of different languages might have
very different expectations of what a "sorted" list should look like. Not only
does the alphabetical order vary from one language to another, but it also can
vary from document to document within the same language. For example, phonebook
ordering might be different than dictionary ordering. String comparison is one
of the basic functions most applications require, and yet implementations often
do not match local conventions. The ICU Collation Service provides string
comparison capability with support for appropriate sort orderings for each of
the locales you need. In the event that you have a very unusual requirement, you
are also provided the facilities to customize orderings.
Starting in release 1.8, the ICU Collation Service is compliant to the Unicode
Collation Algorithm (UCA) ([Unicode Technical Standard
#10](http://www.unicode.org/unicode/reports/tr10/)) and based on the Default
Unicode Collation Element Table (DUCET) which defines the same sort order as ISO
14651.
The ICU Collation Service also contains several enhancements that are not
available in UCA. These have been adopted into the [CLDR Collation
Algorithm](http://www.unicode.org/reports/tr35/tr35-collation.html#CLDR_Collation_Algorithm).
For example:
* Additional case handling (as specified by CLDR): ICU allows case differences
to be ignored or flipped. Uppercase letters can be sorted before lowercase
letters, or vice-versa.
* Easy customization (as specified by CLDR): Services can be easily tailored
to address a wide range of collation requirements.
* The [default (root) sort
order](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation)
has been tailored slightly for improved functionality and performance.
In other words, ICU implements the CLDR Collation Algorithm which is an
extension of the Unicode Collation Algorithm (UCA) which is an extension of ISO
14651.
There are several benefits to using the collation algorithms defined in these
standards, including:
* The algorithms have been designed and reviewed by experts in multilingual
collation, and therefore are robust and comprehensive.
* Applications that share sorted data but do not agree on how the data should
be ordered fail to perform correctly. By conforming to the CLDR/UCA/14651
standards for collation and using CLDR language-specific collation data,
independently developed applications sort data identically and perform
properly.
In addition, Unicode contains a large set of characters. This can make it
difficult for collation to be a fast operation or require collation to use
significant memory or disk resources. The ICU collation implementation is
designed to be fast, have a small memory footprint and be highly customizable.
There are many challenges when accommodating the world's languages and writing
systems and the different orderings that are used. However, the ICU Collation
Service provides an excellent means for comparing strings in a locale-sensitive
fashion.
For example, here are some of the ways languages vary in ordering strings:
* The letters A-Z can be sorted in a different order than in English. For
example, in Lithuanian, "y" is sorted between "i" and "k".
* Combinations of letters can be treated as if they were one letter. For
example, in traditional Spanish "ch" is treated as a single letter, and
sorted between "c" and "d".
* Accented letters can be treated as minor variants of the unaccented letter.
For example, "é" can be treated equivalent to "e".
* Accented letters can be treated as distinct letters. For example, "Å" in
Danish is treated as a separate letter that sorts just after "Z".
* Unaccented letters that are considered distinct in one language can be
indistinct in another. For example, the letters "v" and "w" are two
different letters according to English. However, "v" and "w" are
traditionally considered variant forms of the same letter in Swedish.
* A letter can be treated as if it were two letters. For example, in German
phonebook (or "lists of names") order "ä" is compared as if it were "ae".
* Thai requires that the order of certain letters be reversed.
* Some French dictionary ordering traditions sort accents in backwards order,
from the end of the string. For example, the word "côte" sorts before "coté"
because the acute accent on the final "e" is more significant than the
circumflex on the "o".
* Sometimes lowercase letters sort before uppercase letters. The reverse is
required in other situations. For example, lowercase letters are usually
sorted before uppercase letters in English. Danish letters are the exact
opposite.
* Even in the same language, different applications might require different
sorting orders. For example, in German dictionaries, "öf" would come before
"of". In phone books the situation is the exact opposite.
* Sorting orders can change over time due to government regulations or new
characters/scripts in Unicode.
To accommodate the many languages and differing requirements, ICU collation
supports customizing sort orderings - also known as **tailoring**. More details
regarding tailoring are discussed in the [Customization
chapter.](customization/index.md)
The basic ICU Collation Service is provided by two main categories of APIs:
* String comparison - most commonly used: APIs return result of comparing two
strings (greater than, equal or less than). This is used as a comparator
when sorting lists, building tree maps, etc.
* Sort key generation - used when a very large set of strings are
compared/sorted repeatedly: APIs return a zero-terminated array of bytes per
string known as a sort key. The keys can be compared directly using strcmp
or memcmp standard library functions, saving repeated lookup and computation
of each string's collation properties. For example, database applications
use index tables of sort keys to index strings quickly. Note, however, that
this only improves performance for large numbers of strings because sorting
via the comparison functions is very fast. For more information, see
[Sortkeys vs Comparison](concepts.md#sortkeys-vs-comparison).
ICU provides an AlphabeticIndex API for generating language-appropriate
sorted-section labels like in dictionaries and phone books.
ICU also provides a higher-level [string search](icu-string-search-service.md)
API which can be used, for example, for case-insensitive or accent-insensitive
search in an editor or in a web page. ICU string search is based on the
low-level [collation element iteration](architecture.md).
## Programming Examples
Here are some [API usage conventions](api.md) for the ICU Collation Service
APIs.

View file

@ -0,0 +1,318 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# String Search Service
## Overview
String searching, also known as string matching, is a very important subject in
the wider domain of text processing and analysis. Many software applications use
the basic string search algorithm in the implementations on most operating
systems. With the popularity of Internet, the quantity of available data from
different parts of the world has increased dramatically within a short time.
Therefore, a string search algorithm that is language-aware has become more
important. A bitwise match that uses the `u_strstr` (C), `UnicodeString::indexOf`
(C++) or `String.indexOf` (Java) APIs will not yield the correct result specific
to a particular language's requirements. The APIs will not yield the correct
result because all the issues that are important to language-sensitive collation
are also applicable to text searching. The following lists those issues which
are applicable to text searching:
1. Accented letters\
In English, accents are treated as minor variations of a letter. In French,
accented letters have much more significance as they can actually change the
meaning of a word. Very often, an accented letter is actually a distinct
letter. For example, letter 'å' (\\u00e5) may be just a letter 'a' with an
accent symbol to English speakers. However, it is actually a distinct letter
in Danish; in Danish searching for 'a' should generally not match 'å' and
vice versa. In some cases, such as in traditional German, an accented letter
is short-hand for something longer. In sorting, an 'ä' (\\u00e4) is treated
as 'ae'. Note that primary- and secondary-level distinctions for *searching*
may not be the same as those for sorting; in ICU, many languages provide a
special "search" collator with the appropriate level settings for search.
2. Conjoined letters\
Special handling is required when a single letter is treated equivalent to
two distinct letters and vice versa. For example, in German, the letter 'ß'
(\\u00df) is treated as 'ss' in sorting. Also, in most languages, 'æ'
(\\u00e6) is considered equivalent to the letter 'a' followed by the letter
'e'. Also, the ligatures are often treated as distinct letters by
themselves. For example, 'ch' is treated as a distinct letter between the
letter 'c' and the letter 'd' in Spanish.
3. Ignorable punctuation\
As in collation, it is important that the user is able to choose to ignore
punctuation symbols while the user searches for a pattern in the string. For
example, a user may search for "blackbird" and want to include entries such
as "black-bird".
## ICU String Search Model
The ICU string search service provides similar APIs to the other text iterating
services. Allowing users to specify the starting position and direction within
the text string to be searched. For more information, please see the [Boundary
Analysis](../boundaryanalysis/index.md) chapter. The user can locate one or all
occurrences of a pattern in a string. For a given collator, a pattern match is
located at the offsets <start, end> in a string if the collator finds that the
sub-string between the start and end is equal.
The string search service supports two different types of canonical match
behavior.
Let S' be the sub-string of a text string S between the offsets start and end
<start, end>.
A pattern string P matches a text string S at the offsets <start, end> if
1. option 1. P matches some canonical equivalent string of S'. Suppose the
collator used for searching has a tertiary collation strength, all accents
are non-ignorable. If the pattern "a\\u0300" is searched in the target text
"a\\u0325\\u0300", a match will be found, since the target text is
canonically equivalent to "a\\u0300\\u0325"
2. option 2. P matches S' and if P starts or ends with a combining mark, there
exists no non-ignorable combining mark before or after S' in S respectively.
Following the example above, the pattern "a\\u0300" will not find a match in
"a\\u0325\\u0300", since there exists a non-ignorable accent '\\u0325' in
the middle of 'a' and '\\u0300'. Even with a target text of
"a\\u0300\\u0325" a match will not be found because of the non-ignorable
trailing accent \\u0325.
One restriction is to be noted for option 1. Currently there are no composite
characters that consists of a character with combining class greater than 0
before a character with combining class equals to 0. However, if such a
character exists in the future, the string search service may not work correctly
with option 1 when such characters are encountered.
Furthermore, option 1 could generate more than one "encompassing" matches. For
example, in Danish, 'å' (\\u00e5) and 'aa' are considered equivalent. So the
pattern "baad" will match "a--båd--man" (a--b\\u00e5d--man) at the start offset
at 3 and the end offset 5. However, the start offset can be 1 or 2 and the end
offset can be 6 or 7, because "-" (hyphen) is ignorable for a certain collation.
The ICU implementation always returns the offsets of the shortest match
sub-string. To be more exact, the string search added a "tightest" match
condition. In other words, if the pattern matches at offsets <start, end> as
well as offsets <start + 1, end>, the offsets <start, end> are not considered a
match. Likewise, if the pattern matches at offsets <start, end> as well as
offsets <start, end + 1>, the offsets <start, end + 1> are not considered a
match. Therefore, when the option 1 is chosen in Danish collator, 'baad' will
match in the string "a--båd--man" (a--b\\u00e5d--man) ONLY at offsets <3,5>.
The default behavior is that described in option 2 above. To obtain the behavior
described in option 1, you must set the normalization mode to ON in the collator
used for search.
> :point_right: **Note**: The "tightest match" behavior described above
> is defined as "Minimal Match" in
> [Section 8 Searching and Matching in UTS #10 Unicode Collation Collation Algorithm](http://www.unicode.org/reports/tr10/#Searching).
> "Medial Match" and "Maximal Match" are not yet implemented by the ICU String Search service.
The string search service also supports two varieties of “asymmetric search” as
described in *[Section 8.2 Asymmetric Search in UTS #10 Unicode Collation
Collation Algorithm](http://www.unicode.org/reports/tr10/#Asymmetric_Search)*.
With asymmetric search, for example, unaccented characters are treated as
“wildcards” that may match any character with the same primary weight, this
behavior can be applied just to characters in the search pattern, or to
characters in both the search pattern and the searched text. With the former
behavior, searching with French behavior for 'e' might match 'e', 'è', 'é', 'ê',
and so one, while search for 'é' would only match 'é'.
Both a locale or collator can be used to specify the language-sensitive rules
for searches. When a locale is specified, a collator will be created internally
and the StringSearch instance that is created is responsible for the ownership
of the collator. All the collation attributes will be considered during the
string search operation. However, the users only can set the collator attributes
using the collator APIs. Normalization is usually done within collation and the
process is outside the scope of the string search service.
As in other iterator interfaces, the string search service provides APIs to
perform string matching for the first pattern occurrence, immediate next,
previous match, and the last pattern occurrence. There are also options to allow
for overlapping matching. For example, in English, if the string is "ababab" and
the pattern is "abab", overlapping matching produces results of offsets <0, 3>
and <2, 5>. Otherwise, the mutually exclusive matching produces the result
offset <0, 3> only. To find a whole word match, the user can provide a
locale-specific `BreakIterator` object to a `StringSearch` instance to correctly
locate the word boundaries. For example, if "c" exists in the string "abc", a
match is returned. However, the behavior can be overwritten by supplying a word
`BreakIterator`.
The minimum unit of match is aligned to an extended grapheme cluster in the ICU
string search service implementation defined by [UAX #29 Unicode Text
Segmentation](http://unicode.org/reports/tr29/). Therefore, all matches will
begin and end on extended grapheme cluster boundaries. If the given input search
pattern starts with non-base character, no matches will be returned.
When there are contractions in the collation sequence and the contraction
happens to span across the boundary of a match, it is not considered a match.
For example, in traditional Spanish where 'ch' is a contraction, the "har"
pattern will not match in the string "uno charo". Boundaries that are
discontiguous contractions will yield a match result similar to those described
above, where the end of the match returned will be one character before the
immediate following base letter. In addition, only the first match will be
located if a pattern contains only combining marks and the search string
contains more than one occurrences of the pattern consecutively. For example, if
the user searches for the pattern "´" (\\u00b4) in the string "A´´B",
(A\\u00b4\\u00b4B) the result will be offsets <1, 2>.
### Example
**In C:**
```C
char *tgtstr = "The quick brown fox jumps over the lazy dog.";
char *patstr = "fox";
UChar target[64];
UChar pattern[16];
int pos = 0;
UErrorCode status = U_ZERO_ERROR;
UStringSearch *search = NULL;
u_uastrcpy(target, tgtstr);
u_uastrcpy(pattern, patstr);
search = usearch_open(pattern, -1, target, -1, "en_US",
NULL, &status);
if (U_FAILURE(status)) {
fprintf(stderr, "Could not create a UStringSearch.\n");
return;
}
for(pos = usearch_first(search, &status);
U_SUCCESS(status) && pos != USEARCH_DONE;
pos = usearch_next(search, &status))
{
fprintf(stdout, "Match found at position %d.\n", pos);
}
if (U_FAILURE(status)) {
fprintf(stderr, "Error searching for pattern.\n");
}
```
**In C++:**
```C++
UErrorCode status = U_ZERO_ERROR;
UnicodeString target("Jackdaws love my big sphinx of quartz.");
UnicodeString pattern("sphinx");
StringSearch search(pattern, target, Locale::getUS(), NULL, status);
if (U_FAILURE(status)) {
fprintf(stderr, "Could not create a StringSearch object.\n");
return;
}
for(int pos = search.first(status);
U_SUCCESS(status) && pos != USEARCH_DONE;
pos = search.next(status))
{
fprintf(stdout, "Match found at position %d.\n", pos);
}
if (U_FAILURE(status)) {
fprintf(stderr, "Error searching for pattern.\n");
}
```
**In Java:**
```Java
StringCharacterIterator target = new StringCharacterIterator(
"Pack my box with five dozen liquor jugs.");
String pattern = "box";
try {
StringSearch search = new StringSearch(pattern, target, Locale.US);
for(int pos = search.first();
pos != StringSearch.DONE;
pos = search.next())
{
System.out.println("Match found for pattern at position " + pos);
}
} catch (Exception e) {
System.err.println("StringSearch failure: " + e.toString());
}
```
## Performance and Other Implications
The ICU string search service is designed to be on top of the ICU collation
service. Therefore, all the performance implications that apply to a collator
are also applicable to the string search service. To obtain the best
performance, use the default collator attributes described in the Performance
and Storage Implications on Attributes section in the [Collation Service
Architecture](architecture.md#-performance-and-storage-implications-on-attributes)
chapter. In addition, users need to be aware of
the following `StringSearch` specific considerations:
### Search Algorithm
ICU4C releases up to 3.8 used the Boyer-Moore search algorithm in the string
search service. There were some known issues in these previous releases.
(See ICU tickets [ICU-5024](https://unicode-org.atlassian.net/browse/ICU-5024),
[ICU-5382](https://unicode-org.atlassian.net/browse/ICU-5382),
[ICU-5420](https://unicode-org.atlassian.net/browse/ICU-5420))
In ICU4C 4.0, the string
search service was updated with the simple linear search algorithm, which
locates a match by shifting a cursor in the target text one by one, and these
issues were fixed. In ICU4C 4.0.1, the Boyer-Moore search code was reintroduced
as a separated API set as a technology preview. In a later release, this code was deleted.
The Boyer-Moore searching
algorithm is based on automata or combinatorial properties of strings and
pre-processes the pattern and known to be much faster than the linear search
when search pattern length is longer. According to performance evaluation
between these two implementations, the Boyer-Moore search is faster than the
linear search when the pattern text is longer than 3 or 4 characters.
However, it is very tricky to get correct results with a collation-based Boyer-Moore search.
### Change Iterating Direction
The ICU string search service provides a set of very dynamic APIs that allow
users to change the iterating direction randomly. For example, users can search
for a particular word going forward by calling the `usearch_next` (C),
`StringSearch::next` (C++) or `StringSearch.next` (Java) APIs and then search
backwards at any point of the search operation by calling the `usearch_previous`
(C), `StringSearch::previous` (C++) or `StringSearch.previous` (Java) APIs. Another
way to change the iterating direction is by calling the `usearch_reset` (C),
`StringSearch::previous` (C++) or `StringSearch.previous` (Java) APIs. Though the
direction change can occur without calling the reset APIs first, this operation
comes with a reduction in speed.
> :point_right: **Note**: The backward search is not available with the
> ICU4C Boyer-Moore search technology preview introduced in ICU4C 4.0.1
> and only available with the linear search implementation.
### Thai and Lao Character Boundaries
In collation, certain Thai and Lao vowels are swapped with the next character.
For example, the text string "A ขเ" (A \\u0e02\\u0e40) is processed internally
in collation as
"A เข" (A \\u0e40\\u0e02). Therefore, if the user searches for the pattern "Aเ"
(A\\u0e40) in "A ขเ" (A \\u0e02\\u0e40) the string search service will match
starting at offset 0. Since this normalization process is internal to collation,
there is no notification that the swapping has happened. The return result
offsets in this example will be <0, 2> even though the range would encompass one
extra character.
### Case Level Search
Case level string search is currently done with the strength set to tertiary.
When searching with the strength set to primary and the case level attribute
turned on, results given may not be correct. The case level attribute is
different from tertiary strength in that accents are ignored but case
differences are not. Suppose you wanted to search for “A” in the text
“ABC\\u00C5a”. The match found should be at 0 and 3 if using the case level
attribute. However, searching with the case level attribute turned on finds
matches at 0, 3, and 4, which includes the lower case 'a'. To ensure that case
level differences are not ignored, string search must be done with at least
tertiary strength.

View file

@ -0,0 +1,92 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Compression
## Overview of SCSU
Compressing Unicode text for transmission or storage results in minimal
bandwidth usage and fewer storage devices. The compression scheme compresses
Unicode text into a sequence of bytes by using characteristics of Unicode text.
The compressed sequence can be used on its own or as further input to a general
purpose file or disk-block based compression scheme. Note that the combination
of the Unicode compression algorithm plus disk-block based compression produces
better results than either method alone.
Strings in languages using small alphabets contain runs of characters that are
coded close together in Unicode. These runs are typically interrupted only by
punctuation characters, which are themselves coded in proximity to each other in
Unicode (usually in the Basic Latin range).
For additional detail about the compression algorithm, which has been approved
by the Unicode Consortium, please refer to [Unicode Technical Report #6 (A
Standard Compression Scheme for
Unicode)](https://www.unicode.org/unicode/reports/tr6/).
The Standard Compression Scheme for Unicode (SCSU) is used to:
* express all code points in Unicode
* approximate the storage size of traditional character sets
* facilitate the use of short strings
* provide transparency for characters between `U+0020`-`U+00FF`, as well as `CR`, `LF`
and `TAB`
* support very simple decoders
* support simple as well as sophisticated encoders
It does not attempt to avoid the use of control bytes (including `NUL`) in the
compressed stream.
The compression scheme is mainly intended for use with short to medium length
Unicode strings. The resulting compressed format is intended for storage or
transmission in bandwidth limited environments. It can be used stand-alone or as
input to traditional general purpose data compression schemes. It is not
intended as processing format or as general purpose interchange format.
## BOCU-1
A MIME compatible encoding called BOCU-1 is also available in ICU. Details about
this encoding can be found in the [Unicode Technical Note
#6](https://www.unicode.org/notes/tn6/). Both SCSU and BOCU-1 are IANA
registered names.
## Usage
The compression service in ICU is a part of Conversion framework, and follows
the semantics of converters. For more information on how to use ICU's conversion
service, please refer to the Usage Model section in the [Using
Converters](converters.md) chapter.
```c++
uint16_t germanUTF16[]={
0x00d6, 0x006c, 0x0020, 0x0066, 0x006c, 0x0069, 0x0065, 0x00df, 0x0074
};
uint8_t germanSCSU[]={
0xd6, 0x6c, 0x20, 0x66, 0x6c, 0x69, 0x65, 0xdf, 0x74
};
char target[100];
UChar uTarget[100];
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
int32_t len;
/* set up the SCSU converter */
conv = ucnv_open("SCSU", &status);
assert(U_SUCCESS(status));
/* compress the string using SCSU */
len = ucnv_fromUChars(conv, target, 100, germanUTF16, -1, &status);
assert(U_SUCCESS(status));
len = ucnv_toUChars(conv, uTarget, 100, germanSCSU, -1, &status);
/* close the converter */
ucnv_close(conv);
```

View file

@ -0,0 +1,786 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Using Converters
## Overview
When designing applications around Unicode characters, it is sometimes required
to convert between Unicode encodings or between Unicode and legacy text data.
The vast majority of modern Operating Systems support Unicode to some degree,
but sometimes the legacy text data from older systems need to be converted to
and from Unicode. This conversion process can be done with an ICU converter.
## ICU converters
ICU provides comprehensive character set conversion services, mapping tables,
and implementations for many encodings. Since ICU uses Unicode (UTF-16)
internally, all converters convert between UTF-16 (with the endianness according
to the current platform) and another encoding. This includes Unicode encodings.
In other words, internal text is 16-bit Unicode, while "external text" used as
source or target for a conversion is always treated as a byte stream.
ICU converters are available for a wide range of encoding schemes. Most of them
are based on mapping table data that is handled by few generic implementations.
Some encodings are implemented algorithmically in addition to (or instead of)
using mapping tables, especially Unicode encodings. The partly or entirely
table-based encoding schemes include: All ICU converters map only single Unicode
character code points to and from single codepage character code points. ICU
converters **do not** deal directly with combining characters, bidirectional
reordering, or Arabic shaping, for example. Such processes, if required, must be
handled separately. For example, while in Unicode, the ICU BiDi APIs can be used
for bidirectional reordering after a conversion to Unicode or before a
conversion from Unicode.
ICU converters are not designed to perform any encoding autodetection. This
means that the converters do not autodetect "endianness", the 6 Unicode encoding
signatures, or the Shift-JIS vs. EUC-JP, etc. There are two exceptions: The
UTF-16 and UTF-32 converters work according to Unicode's specification of their
Character Encoding Schemes, that is, they read the BOM to figure out the actual
"endianness".
The ICU mapping tables mostly come from an [IBM® codepage
repository](http://www.ibm.com/software/globalization/cdra). For non-IBM
codepages, there is typically an equivalent codepage registered with this
repository. However, the textual data format (.ucm files) is generic, and data
for other codepage mapping tables can also be added.
## Using the Default Codepage
ICU has code to determine the default codepage of the system or process. This
default codepage can be used to convert `char *` strings to and from Unicode.
Depending on system design, setup and APIs, it may not always be possible to
find a default codepage that fully works as expected. For example,
1. On Windows there are three encodings in use at the same time. Unicode
(UTF-16) is always used inside of Windows, while for `char *` encodings there
are two classes, called "ANSI" and "OEM" codepages. ICU will use the ANSI
codepage. Note that the OEM codepage is used by default for console window
output.
2. On some UNIX-type systems, non-standard names are used for encodings, or
non-standard encodings are used altogether. Although ICU supports over 200
encodings in its standard build and many more aliases for them, it will not
be able to recognize such non-standard names.
3. Some systems do not have a notion of a system or process codepage, and may
not have APIs for that.
If you have means of detecting a default codepage name that are more appropriate
for your application, then you should set that name with `ucnv_setDefaultName()`
as the first ICU function call. This makes sure that the internally cached
default converter will be instantiated from your preferred name.
Starting in ICU 2.0, when a converter for the default codepage cannot be opened,
a fallback default codepage name and converter will be used. On most platforms,
this will be US-ASCII. For z/OS (OS/390), ibm-1047,swaplfnl is the default
fallback codepage. For AS/400 (iSeries), ibm-37 is the default fallback
codepage. This default fallback codepage is used when the operating system is
using a non-standard name for a default codepage, or the converter was not
packaged with ICU. The feature allows ICU to run in unusual computing
environments without completely failing.
## Usage Model
A "Converter" refers to the C structure "UConverter". Converters are cheap to
create. Any data that is shared between converters of the same kind (such as the
mappings, the name and the properties) are automatically cached and shared in
memory.
### Converter Names
Codepages with encoding schemes have been given many names by various vendors
and platforms over the years. Vendors have different ways specify which codepage
and encoding are being used. IBM uses a CCSID (Coded Character Set IDentifier).
Windows uses a CPID (CodePage IDentifier). Macintosh has a TextEncoding. Many
Unix vendors use [IANA](http://www.iana.org/assignments/character-sets)
character set names. Many of these names are aliases to converters within ICU.
In order to help identify which names are recognized by certain platforms, ICU
provides several converter alias functions. The complete description of these
functions can be found in the [ICU API
Reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .
| Function Names | Short Description |
| -------------- | ----------------- |
| `ucnv_countAvailable`, `ucnv_getAvailableName` | Get a list of available converter names that can be opened. |
| `ucnv_openAllNames` | Get a list of all known converter names. |
| `ucnv_getName` | Get the name of an open converter. |
| `ucnv_countAliases`, `ucnv_getAlias` | Get the list of aliases for the specified converter. |
| `ucnv_countStandards`, `ucnv_getStandard` | Get the list of known standards. |
| `ucnv_openStandardNames` | Get a filtered list of aliases for a converter that is known by the specified standard. |
| `ucnv_getStandardName` | Get the preferred alias name specified by a given standard. |
| `ucnv_getCanonicalName` | Get the converter name from the alias that is recognized by the specified standard. |
| `ucnv_getDefaultName` | Get the default converter name that is currently used by ICU and the operating system. |
| `ucnv_setDefaultName` | Use this function to override the default converter name. |
Even though IANA specifies a list of aliases, it usually does not specify the
mappings or the actual character set for the aliases. Sometimes vendors will map
similar glyph variants to different Unicode code points or sometimes they will
assign completely different glyphs for the same codepage code point. Because of
these ambiguities, you can sometimes get U_AMBIGUOUS_ALIAS_WARNING for the
returned UErrorCode when more than one converter uses the requested alias. This
is only a warning, and the results can still be used. This UErrorCode value is
just a reminder that you may not get what you expected. The above functions can
help you to determine which converter you actually wanted.
EBCDIC based converters do have the option to swap the newline and linefeed
character mappings. This can be useful when transferring EBCDIC documents
between z/OS (MVS, os/390 and the rest of the zSeries family) and another EBCDIC
machine like OS/400 on iSeries. The ",swaplnlf" or `UCNV_SWAP_LFNL_OPTION_STRING`
from ucnv.h can be appended to a converter alias in order to achieve this
behavior. You can view other available options in ucnv.h.
You can always skip many of these aliasing and mapping problems by just using
Unicode.
### Creating a Converter
There are four ways to create a converter:
1. **By name**: Converters can be created using different types of names. No
distinction is made when the converter is created, as to which name is being
employed. There are many types of aliases possible. Among these are
[IANA](http://www.iana.org/assignments/character-sets) ("shift_jis",
"koi8-r", or "iso-8859-3"), host specific names ("cp1252" which is the name
for a Microsoft® Windows™ or a similar IBM® codepage). Finally, ICU's own
internal canonical names for a converter can be used. These include "UTF-8"
or "ISO-8859-1" for built-in conversion types, and names such as
"ibm-949_P110-2000" (Shift-JIS with '\\' <-> '¥' mapping) or
"ibm-949_P11A-2000" (Shift-JIS with '\\' <-> '\\' mapping) for data-file
based conversions.
```C
UConverter *conv = ucnv_open("shift_jis", &myError);
```
As a convenience, converter names can be passed in as Unicode. (for example,
if a user passed in the string from a Unicode-based user interface).
However, the actual names are restricted to an invariant ASCII/EBCDIC
subset.
```C
UChar *name = ...; UConverter *conv = ucnv_openU(name, &myError);
```
Converter names are case-insensitive. In addition, beginning with ICU 3.6,
leading zeroes are ignored in sequences of digits (if further digits
follow), and all non-alphanumeric characters are ignored. Thus the strings
"UTF-8", "utf_8", "u\*T@f08" and "Utf 8" are equivalent. (Before ICU 3.6,
leading zeroes were not ignored, and only spaces, dashes and underscores
were ignored.) The `ucnv_compareNames()` function provides such string
comparisons.
Unlike the names of resources or other types of ICU data, converter names
can **not** be qualified with a path that indicates the directory or common
data file containing the corresponding converter data. The requested
converter's data must be present either in the main ICU data library or as a
separate file located in the ICU data directory. However, you can always
create a package of converters with pkgdata and open a converter from the
package with `ucnv_openPackage()`
```C
UConverter *conv = ucnv_openPackage("./myPackage.dat", "customConverter", &myError);
```
2. **By number**: The design of the ICU is to accommodate codepages provided by
different vendors. For example, the IBM CDRA (Character Data Representation
Architecture which is an IBM architecture that defines a set of identifiers)
has an ID type called the CCSID (Coded Character Set Identifier). The ICU
API for opening a codepage by number must be given a vendor along with the
number. Currently, only IBM (`UCNV_IBM`) is supported. For example, the US
EBCDIC codepage (IBM #37) can be opened with the following code:
```C
ucnv_openCCSID(37, UCNV_IBM, &myErr);
```
3. **By iteration**: An application might not know ahead of time which codepage
to use, and thus might need to query ICU to determine the entire list of
installed converters. The ICU returns a list of its canonical (internal)
names. From each names, the standard IANA name can be determined, and also a
list of aliases which point to that name can be determined. For example, ICU
might return among the canonical names "ibm-367". That name itself may or
may not provide the application or its users with the information needed.
(367 is actually the decimal form of a number that is calculated by
appending certain hex digits together.) However, the IANA name can be
requested from this canonical name, which should return something like
"us-ascii". The alias list for ibm-367 can be iterated over as well, which
returns additional names like "ascii", "646", "ansi_x3.4-1968" etc. If this
is not sufficient information, once a converter is opened, it can be queried
for its type, min and max char size, etc. This information is not available
without actually opening the converter (a fairly lightweight process.)
```C
/* Returns count of the number of available names */
int count = ucnv_countAvailable();
/* get the canonical name of the 36th available converter */
const char *convName1 = ucnv_getAvailableName(36);
/* get the 3rd alias for a given codepage. */
const char *asciiAlias = ucnv_getAlias("ibm-367", 3, &myError);
/* Get the IANA name of the converter */
const char *ascii = ucnv_getStandardName("ibm-367", "IANA");
/* Get the one of the non preferred IANA name of the converter. */
UEnumeration *asciiEnum =
ucnv_openStandardNames("ibm-367", "IANA", &myError);
uenum_next(asciiEnum, &myError); /* skip preferred IANA alias */
/* get one of the non-preferred IANA aliases */
const char *ascii2 = uenum_next(asciiEnum, &myError);
uenum_close(asciiEnum);
```
4. **By using the default converter**: The default converter can be opened by
passing a NULL as the name of the converter.
```C
ucnv_open(NULL, &myErr);
```
> :point_right: **Note**: ICU chooses this converter based on the best information available to it.
The purpose of this converter is to interface with the OS using a codepage (i.e. `char *`).
Do not use it as a way of determining the best overall converter to use.
Usually any Unicode encoding form is the best way to store and send text data,
so that important data does not get lost in the conversion.\
Also, if the OS supports Unicode-based API's (such as Win32),
it is better to use only those Unicode API's.
As an example, the new Windows 2000 locales (such as Hindi) do not
define the default codepage to something that supports Hindi.
The default converter is used in expressions such as: `UnicodeString text("abc");`
to convert 'abc', and in the u_uastrcpy() C functions.\
Code operating at the [OS level](../design.md) MAY choose to
change the default converter with `ucnv_setDefaultName()`.
However, be aware that this change has inconsistent results if it is done after
ICU components are initialized.
### Closing a Converter
Closing a converter frees memory occupied by that instance of the converter.
However it does not release the larger shared data tables the converter might
use. OS-level code may call `ucnv_flushCache()` to explicitly free memory occupied
by [unused tables](../design.md) .
```C
ucnv_close(conv)
```
### Converter Life Cycle
Note that a Converter is created with a certain type (for instance, ISO-8859-3)
which does not change over the life of that [object](../design.md) . Converters
should be allocated one per thread. They are cheap to create, as the shared data
doesn't need to be reallocated.
This is the typical life cycle of a converter, as shown step-by-step:
1. First, open up the converter with a specified name (or alias name).
```C
UConverter *conv = ucnv_open("shift_jis", &status);
```
2. Target here is the `char s[]` to write into, and targetSize is how big the
target buffer is. Source is the UChars that are being converted.
```C
int32_t len = ucnv_fromUChars(conv, target, targetSize, source, u_strlen(source), &status);
```
3. Clean up the converter.
```C
ucnv_close(conv);
```
### Sharing Converters Between Threads
A converter cannot be shared between threads at the same time. However, if it is
reset it can be used for unrelated chunks of data. For example, use the same
converter for converting data from Unicode to ISO-8859-3, and then reset it. Use
the same converter for converting data from ISO-8859-3 back into Unicode.
### Converting Large Quantities of Data
If it is necessary to convert a large quantity of data in smaller buffers, use
the same converter to convert each buffer. This will make sure any state is
preserved from one chunk to the next. Doing this conversion is known as
streaming or buffering, and is mentioned Buffered Conversion section (§) later
in this chapter.
### Cloning a Converter
Cloning a converter returns a clone of the converter object along with any
internal state that the converter might be storing. Cloning routines must be
used with extreme care when using converters for stateful or multibyte
encodings. If the converter object is carrying an internal state, and the
newly-created clone is used to convert a new chunk of text, the converter
produces incorrect results. Also note that the caller owns the cloned object and
has to call `ucnv_close()` to dispose of the object. Calling `ucnv_reset()` before
cloning will reset the converter to its original state.
```C
UConverter* newCnv = ucnv_safeClone(oldCnv, 0, &bufferSize, &err)
```
## Converter Behavior
### Conversion
1. The converters always consume the source buffer as far as possible, and
advance the source pointer.
2. The converters write to the target all converted output as far as possible,
and then write any remaining output to the internal services buffer. When
the conversion routines are called again, the internal buffer is flushed out
and written to the target buffer before proceeding with any further
conversion.
3. In conversions to Unicode from Multi-byte encodings or conversions from
Unicode involving surrogates, if a) only a partial byte sequence is
retrieved from the source buffer, b) the "flush" parameter is set to "TRUE"
and c) the end of source is reached, then the callback is called with
U_TRUNCATED_CHAR_FOUND.
### Reset
Converters can be reset explicitly or implicitly. Explicit reset is done by
calling:
1. `ucnv_reset()`: Resets the converter to initial state in both directions.
2. `ucnv_resetToUnicode()`: Resets the converter to initial state to Unicode
direction.
3. `ucnv_resetFromUnicode()`: Resets the converter to initial state from Unicode
direction.
The converters are reset implicitly when the conversion functions are called
with the "flush" parameter set to "TRUE" and the source is consumed.
### Error
#### Conversion from Unicode
Not all characters can be converted from Unicode to other codepages. In most
cases, Unicode is a superset of the characters supported by any given codepage.
The default behavior of ICU in this case is to substitute the illegal or
unmappable sequence, with the appropriate substitution sequence for that
codepage. For example, ISO-8859-1, along with most ASCII-based codepages, has
the character 0x1A (Control-Z) as the substitution sequence. When converting
from Unicode to ISO-8859-1, any characters which cannot be converted would be
replaced by 0x1A's.
SubChar1 is sometimes used as substitution character in MBCS conversions. For
more information on SubChar1 please see the [Conversion Data](data.md) chapter.
In stateful converters like ISO-2022-JP, if a substitution character has to be
written to the target, then an escape/shift sequence to change the state to
single byte mode followed by a substitution character is written to the target.
The substitution character can be changed by calling the `ucnv_setSubstChars()`
function with the desired codepage byte sequence. However, this has some
limitations: It only allows setting a single character (although the character
can consist of multiple bytes), and it may not work properly for some stateful
converters (like HZ or ISO 2022 variants) when setting a multi-byte substitution
character. (It will work for EBCDIC_STATEFUL ones.) Moreover, for setting a
particular character, the caller needs to know the correct byte sequence for
that character in the converter's codepage. (For example, a space (U+0020) is
encoded as 0x20 in ASCII-based codepages, 0x40 in EBCDIC-based ones, 0x00 0x20
or 0x20 0x00 in UTF-16 depending on the stream's endianness, etc.)
The `ucnv_setSubstString()` function (new in ICU 3.6) lifts these limitations. It
takes a Unicode string and verifies that it can be converted to the codepage
without error and that it is not too long (32 bytes as of ICU 3.6). The string
can contain zero, one or more characters. An empty string has the effect of
using the skip callback. See the Error Callbacks below. Stateful converters are
fully supported. The same Unicode string will give equivalent results with all
converters that support its conversion.
Internally, `ucnv_setSubstString()` stores the byte sequence from the test
conversion if the converter is stateless, or the Unicode string itself if the
converter is stateful. If the Unicode string is stored, then it is converted on
the fly during substitution, handling all state transitions.
The function `ucnv_getSubstChars()` can be used to retrieve the substitution byte
sequence if it is the default one, set by `ucnv_setSubstChars()`, or if
`ucnv_setSubstString()` stored the byte sequence for a stateless converter. The
Unicode string set for a stateful converter cannot be retrieved.
#### Conversion to Unicode
In conversion to Unicode, errors are normally due to ill-formed byte sequences:
Unused byte values, or lead bytes not followed by trail bytes according to the
encoding scheme. Well-formed but unmappable sequences are unusual but possible.
The ICU default behavior is to emit an U+FFFD REPLACEMENT CHARACTER per
offending sequence.
If the conversion table .ucm file contains a <subchar1> entry (such as in the
ibm-943 table), a U+001A C0 control ("SUB") is emitted for single-byte
illegal/unmappable input rather than U+FFFD REPLACEMENT CHARACTER. For details
on this behavior look for "001A" in the [Conversion Data](data.md) chapter.
* This behavior originates from mainframes with dedicated
single-byte-to-single-byte and double-to-double conversions.
* Emitting U+001A for single-byte errors can be avoided by (a) removing the
<subchar1> mapping or (b) using a similar conversion table that does not
have this mapping (e.g., windows-932 instead of ibm-943) or (c) writing a
custom callback function.
### Error Codes
Here are some of the `UErrorCode`s which have significant meaning for conversion:
#### U_INDEX_OUTOFBOUNDS_ERROR
In `getNextUChar()` - all source data
has been consumed without producing a Unicode character
#### U_INVALID_CHAR_FOUND
No mapping was found from the source to the target encoding. For example, U+0398
(Capital Theta) has no mapping into ISO-8859-1, and so U_INVALID_CHAR_FOUND
will result.
#### U_TRUNCATED_CHAR_FOUND
All of the source data was read, and a
character sequence was incomplete. For example, only half of a double-byte
sequence may have been encountered. When converting FROM Unicode, this error
would occur when a conversion ends with a low surrogate (U+D800) at the end of
the source, with no corresponding high surrogate.
#### U_ILLEGAL_CHAR_FOUND
A character sequence was found in the source which is disallowed in the source
encoding scheme. For example, many MBCS encodings have only certain byte
sequences which are allowed as lead bytes. When converting from Unicode, if a
low surrogate is NOT followed immediately by a high surrogate, or a high
surrogate without its preceding low surrogate, an illegal sequence results.
Note: Most, but not all, converters forbid surrogate code points or unpaired
surrogate code units. (Lead surrogate without trail, or trail without lead.)
Some converters permit surrogate code points/unpaired surrogates because their
charset specification permits it. For example, LMBCS, SCSU and
BOCU-1.
#### U_INVALID_TABLE_FORMAT
An error occurred trying to read the backing data
for the converter. The data could be corrupt, or the wrong
version.
#### U_BUFFER_OVERFLOW_ERROR
More output (target) characters were produced
than fit in the target buffer. If in `to/fromUnicode()`, then process the target
buffer and call the function again to retrieve the overflowed characters.
### Error Callbacks
What actually happens is that an "error callback function" is called at the
point where the conversion failure occurred. The function can deal with the
failed characters as it sees fit. Possible options at the callback's disposal
include ignoring the bad sequence, converting it to a different sequence, and
returning an error to the caller. The callback can also consume any data past
where the error occurred, whether or not that data would have caused an error.
Only one callback is installed at a time, per direction (to or from unicode).
A number of canned functions are provided by ICU, and an application can write
new ones. The "callbacks" are either From Unicode (to codepage), or To Unicode
(from codepage). Here is a list of the canned callbacks in ICU:
1. UCNV_**FROM_U**_CALLBACK_SUBSTITUTE: This callback is installed by default.
It will write the codepage's substitute sequence or a user-set substitute
sequence, or convert a user-set substitute UnicodeString to the codepage.
See "Error / Conversion from Unicode" above.
2. UCNV_**TO_U**_CALLBACK_SUBSTITUTE: This callback is installed by default. It
will write U+FFFD or sometimes U+001A. See "Error / Conversion to Unicode"
above.
3. UCNV_FROM_U_CALLBACK_SKIP, UCNV_TO_U_CALLBACK_SKIP: Simply ignores any
invalid characters in the input, no error is returned.
4. UCNV_FROM_U_CALLBACK_STOP, UCNV_TO_U_CALLBACK_STOP: Stop at the error.
Return the error to the caller. (When using the 'BUFFER' mode of conversion,
the source and target pointers returned can be examined to determine where
the error occurred. ucnv_getInvalidUChars() and ucnv_getInvalidChars()
return the actual text which failed).
5. UCNV_FROM_U_CALLBACK_ESCAPE, UCNV_TO_U_CALLBACK_ESCAPE: This callback is
especially useful for debugging. Missing codepage characters are replaced by
strings such as '%U094D' with the Unicode value, and missing Unicode chars
are replaced with text of the form '%X0A' where the codepage had the
unconvertible byte hex 0A.
When a callback is set, a "context" pointer is also provided. How this
pointer is created depends on the specific callback. There is usually a
createContext() function for that specific callback, where the caller can
set certain options for the callback. Consult the documentation for the
specific callback you are using. For ICU's canned callbacks, this pointer
may be set to NULL. The functions for setting a different callback also
return the old callback, and the old context pointer. These may be stored so
that the old callback is re-installed when an operation is finished.
Additionally the following options can be passed as the context parameter to
UCNV_FROM_U_CALLBACK_ESCAPE callback function to produce different outputs.
| UCNV_ESCAPE_ICU | %U12345 |
| ------------------- | ------- |
| UCNV_ESCAPE_JAVA | \\u1234 |
| UCNV_ESCAPE_C | \\udbc9\\udd36 for Plane 1 and \\u1234 for Plane 0 codepoints |
| UCNV_ESCAPE_XML_DEC | \&#4460; number expressed in Decimal |
| UCNV_ESCAPE_XML_HEX | \&#x1234; number expressed in Hexadecimal |
Here are some examples of how to use callbacks.
```C
UConverter *u;
void *oldContext, *newContext;
UConverterFromUCallback oldAction, newAction;
u = ucnv_open("shift_jis", &myError);
... /* do some conversion with u from unicode.. */
ucnv_setFromUCallBack(u, MY_FROMU_CALLBACK, newContext, &oldAction, &oldContext, &myError);
... /* do some other conversion from unicode */
/* Now, set the callback back */
ucnv_setFromUCallBack(u, oldAction, oldContext, &newAction, &newContext, &myError);
```
### Custom Callbacks
Writing a callback is somewhat involved, and will be covered more completely in
a future version of this document. One might look at the source to the provided
callbacks as a starting point, and address any further questions to the mailing
list.
Basically, callback, unlike other ICU functions which expect to be called with
U_ZERO_ERROR as the input, is called in an exceptional error condition. The
callback is a kind of 'last ditch effort' to rectify the error which occurred,
before it is returned back to the caller. This is why the implementation of STOP
is very simple:
```C
void UCNV_FROM_U_CALLBACK_STOP(...) { }
```
The error code such as U_INVALID_CHAR_FOUND is returned to the user. If the
callback determines that no error should be returned to the user, then the
callback must set the error code to U_ZERO_ERROR. Note that this is a departure
from most ICU functions, which are supposed to check the error code and return
immediately if it is set.
> :point_right: **Note**: See the functions `ucnv_cb_write...()` for
functions which a callback may use to perform its task.
#### Ignore Default_Ignorable_Code_Point
Unicode has a number of characters that are not by themselves meaningful but
assist with line breaking (e.g., U+00AD Soft Hyphen & U+200B Zero Width Space),
bi-directional text layout (U+200E Left-To-Right Mark), collation and other
algorithms (U+034F Combining Grapheme Joiner), or indicate a preference for a
particular glyph variant (U+FE0F Variation Selector 16). These characters are
"invisible" by default, that is, they should normally not be shown with a glyph
of their own, except in special circumstances. Examples include showing a hyphen
for when a Soft Hyphen was used for a line break, or modifying the glyph of a
character preceding a Variation Selector.
Unicode has a character property to identify such characters, as well as
currently-unassigned code points that are intended to be used for similar
purposes: Default_Ignorable_Code_Point, or "DI" for short:
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:DI:]>
Most charsets do not have most or any of these characters.
**ICU 54 and above by default skip default-ignorable code points if they are
unmappable**. (Ticket #[10551](http://bugs.icu-project.org/trac/ticket/10551))
**Older versions of ICU** replaced unmappable default-ignorable code points like
any other unmappable code points, by a question mark or whatever substitution
character is defined for the charset.
For best results, a custom from-Unicode callback can be used to ignore
Default_Ignorable_Code_Point characters that cannot be converted, so that they
are removed from the charset output rather than replaced by a visible character.
This is a code snippet for use in a custom from-Unicode callback:
```C
#include "unicode/uchar.h"
...
(from-Unicode callback)
    switch(reason) {
    case UCNV_UNASSIGNED:
        if(u_hasBinaryProperty(codePoint, UCHAR_DEFAULT_IGNORABLE_CODE_POINT)) {
            // Ignore/drop default ignorable code points that cannot be converted,
            // rather than treating them like errors/writing a substitution character etc.
            // For example, U+200B Zero Width Space,
            // U+200E Left-To-Right Mark, U+FE0F Variation Selector 16.
            *pErrorCode = U_ZERO_ERROR;
            return;
        } else {
            ...
```
## Modes of Conversion
When a converter is instantiated, it can be used to convert both in the Unicode
to Codepage direction, and also in the Codepage to Unicode direction. There are
three ways to use the converters, as well as a convenience function which does
not require the instantiation of a converter.
1. **Single-String**: Simplest type of conversion to or from Unicode. The data
is entirely contained within a single string.
2. **Character**: Converting from the codepage to a single Unicode codepoint,
one at a time.
3. **Buffer**: Convert data which may not fit entirely within a single buffer.
Usually the most efficient and flexible.
4. **Convenience**: Convert a single buffer from one codepage to another
through Unicode, without requiring the instantiation of a converter.
### 1. Single-String
Data must be contained entirely within a single string or buffer.
```C
conv = ucnv_open("shift_jis", &status);
/* Convert from Unicode to Shift JIS */
len = ucnv_fromUChars(conv, target, targetLen, source, sourceLen, &status);
ucnv_close(conv);
conv = ucnv_open("iso-8859-3", &status);
/* Convert from ISO-8859-3 to Unicode */
len = ucnv_toUChars(conv, target, targetSize, source, sourceLen, &status);
ucnv_close(conv);
```
### 2. Character
In this type, the input data is in the specified codepage. With each function
call, only the next Unicode codepoint is converted at a time. This might be the
most efficient way to scan for a certain character, or other processing of a
single character at a time, because converters are stateful. This works even for
multibyte charsets, and for stateful ones such as iso-2022-jp.
```C
conv = ucnv_open("Big-5", &status);
UChar32 target;
while(source < sourceLimit) {
target = ucnv_getNextUChar(conv, &source, sourceLimit, &status);
ASSERT(status);
processChar(target);
}
```
### 3. Buffered or Streamed
This is used in situations where a large document may be read in off of disk and
processed. Also, many codepages take multiple bytes to encode a character, or
have state. These factors make it impossible to convert arbitrary chunks of data
without maintaining state across chunks. Even conversion from Unicode may
encounter a leading surrogate at the end of one buffer, which needs to be paired
with the trailing surrogate in the next buffer.
A basic API principle of the ICU to/from Unicode functions is that they will
ALWAYS attempt to consume all of the input (source) data, unless the output
buffer is full or some other error occurs. In other words, there is no need to
ever test whether all of the source data has been consumed.
The basic loop that is used with the ICU buffer conversion routines is the same
in the to and from Unicode directions. In the following pseudocode, either
'source' (for fromUnicode) or 'target' (for toUnicode) are UTF-16 UChars.
```C
UErrorCode err = U_ZERO_ERROR;
while (... /*input data available*/ ) {
... /* read input data into buffer */
source = ... /* beginning of read data */;
sourceLimit = source + readLength; // end + 1
UBool flush = (further input data still available) // (i.e. feof())
/* loop until all source has been processed */
do {
/* set up target pointers */
target = ... /* beginning of output buffer */;
targetLimit = target + sizeOfOutput;
err = U_ZERO_ERROR; /* so that the to/from does not fail */
ucnv_to/fromUnicode(converter, &target, targetLimit, &source, sourceLimit, NULL, flush, &err);
... /* write (target-beginningOfOutputBuffer) items starting at beginning of output buffer */
} while (err == U_BUFFER_OVERFLOW_ERROR);
if(U_FAILURE(error)) {
... /* process error */
break; /* out of the 'while' loop that reads source data */
}
}
/* loop to read input data */
if(U_FAILURE(error)) {
... /* process error further */
}
```
The above code optimizes for processing entire chunks of input data. An
efficient size for the output buffer can be calculated as follows. (in bytes):
```C
ucnv_getMinCharSize() * inputBufferSize * sizeof(UChar)
ucnv_getMaxCharSize() * inputBufferSize
```
There are two loops used, an outer and an inner. The outer loop fetches input
data to keep the source buffer full, and the inner loop 'writes' out data to
keep the output buffer empty.
Note that while this efficiently handles data on the input side, there are some
cases where the size of the output buffer is fixed. For instance, in network
applications it is sometimes desirable to fill every output packet completely
(not including the last packet in the sequence). The above loop does not ensure
that every output buffer is completely full. For example, if a 4 UChar input
buffer was used, and a 3 byte output buffer with fromUnicode(), the loop would
typically write 3 bytes, then 1, then 3, and so on. If, instead of efficient use
of the input data, the goal is filling output buffers, a slightly different loop
can be used.
In such a scenario, the inner write does not occur unless a buffer overflow
occurs OR 'flush' is true. So, the 'write' and resetting of the target and
targetLimit pointers would only happen
`if(err == U_BUFFER_OVERFLOW_ERROR || flush == TRUE)`
The flush parameter on each conversion call should be set to FALSE, until the
conversion call is called for the last time for the buffer. This is because the
conversion is stateful. On the last conversion call, the flush parameter should
be set to TRUE. More details are mentioned in the API reference in
[ucnv.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .
### 4. Pre-flighting
Preflighting is the process of asking the conversion API for the size of target
buffer required. (For a more general discussion, see the Preflighting section
(§) in the [Strings](../strings/index.md) chapter.)
This is accomplished by calling the `ucnv_fromUChars` and `ucnv_toUChars` functions.
```C
UChar uchar2;
char input_char_buffer = "This is some text";
targetsize = ucnv_toUChars(myConverter, NULL, targetcapacity, input_char_buffer, sizeof(input_char_buffer), &err);
if(err==U_BUFFER_OVERFLOW_ERROR) {
err=U_ZERO_ERROR;
uchar2=(UChar*)malloc((targetsize) * sizeof(UChar));
targetsize = ucnv_toUChars(myConverter, uchar2, targetsize,
input_char_buffer, sizeof(input_char_buffer), &err);
if(U_FAILURE(err)) {
printf("ucnv_toUChars() FAILED %s\\n", myErrorName(err));
} else {
printf("ucnv_toUChars() o.k.\\n");
}
}
```
> :point_right: **Note**: This is inefficient since the conversion is performed twice, once for finding
the size of target and once for writing to the target.
### 5. Convenience
ICU provides some convenience functions for conversions:
```C
ucnv_toUChars(myConverter, target_uchars, targetsize, input_char_buffer, sizeof(input_char_buffer), &err);
ucnv_fromUChars(cnv, cTarget, (cTargetLimit-cTarget), uSource, (uSourceLimit-uSource), &errorCode);
char target[100];
UnicodeString str("ABCDEF", "iso-8859-1");
int32_t targetsize = str.extract(0, str.length(), target, sizeof(target), "SJIS");
target[targetsize] = 0; /* NULL termination */
```
## Conversion Examples
See the [ICU Conversion
Examples](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/ucnv/convsamp.cpp)
for more information.

View file

@ -0,0 +1,673 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Conversion Data
## Introduction
### Algorithmic vs. Data-based
In a comprehensive conversion library, there are three kinds of codepage
converter implementations: converters that use algorithms, mapping data, or
those converters that use both.
1. Most codepages have a simple and straightforward structure but have an
arbitrary relationship between input and output character codes. Mapping
tables are necessary to define the conversion. If the codepage characters
use more than one byte each, then the mapping table must also define the
structure of the codepage.
2. Algorithmic converters work by transforming the input stream with built-in
algorithms and possibly small, hard coded tables. The conversion can be
complex, but the actual mapping of a character code is done numerically if
the converter is purely algorithmic.
3. In some cases, a converter needs to be algorithmic for its basic operations
but also relies on mapping data.
ICU provides converter implementations for all three groups of codepages. Since
ICU always converts, to or from Unicode, the purely algorithmic converters are
the ones for Unicode encodings (such as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE,
UTF-32LE, SCSU, BOCU-1 and UTF-7). Since Unicode is based on US-ASCII and
ISO-8859-1 ("ISO Latin-1"), these encodings also use algorithmic converters for
performance reasons.
Most other codepages use simple byte sequences but are not encodings of Unicode.
They are converted with generic code using mapping data tables. ICU also
supports a few encodings, like ISO-2022 and its variants, that employ an
algorithmic structure to switch between a set of codepages. The converters for
these encodings are algorithmic but use mapping tables for the embedded
codepages.
### Stateful vs. Stateless
Character encodings are either stateful or stateless:
1. Stateless encodings define a byte sequence for each character. Complete
character byte sequences can be used in any order, and the same complete
character byte sequences always encodes the same characters. It is
preferable to always encode one character using the same byte sequence.
2. Stateful encodings define byte sequences that change the state of the text
stream. Depending on the current state, the same byte sequence may encode a
different character and the same character may be encoded with different
byte sequences.
This distinction between stateless and stateful encodings is important, because
it determines if any available ICU converter implementation is used. The
following are some more important considerations related to stateless versus
stateful encodings:
1. A runtime converter object is always stateful, even for "stateless"
encodings. They are always stateful because an input buffer may end with a
partial byte sequence that is to be continued in the next input buffer in
the following conversion call. The information about this is stored in the
converter object. Similarly, if the input is Unicode text, then an input
buffer may end with the first of a pair of surrogates. The converter object
also stores overflow bytes or code units if the result of a character
mapping did not fit entirely into the output buffer.
2. Stateless encodings are stateful in our converter implementation to
interpret "complete byte sequences". They are "stateful" because many
encodings can have the same byte value used in different positions of byte
sequences for different characters; a specific byte value may be a lead byte
or a trail byte. For instance, the lead and trail byte values overlap in
codepages like Shift-JIS. If a program does not start reading at a character
boundary, it may instead interpret the byte sequences from two or more
separate characters as one character. Often, character boundaries can be
detected reliably only by reading the non-Unicode text linearly from the
beginning. This can be a problem for non-Unicode text processing, where text
insertion, deletion, and searching are common. The UTF-8/16/32 encodings do
not have this problem because the single, lead, or trail units have disjoint
values and character boundary can be easily found.
3. Some stateful encodings only switch between two states: one with one byte
per character and one with two bytes per character. This type of encoding is
very common in mainframe systems based on Extended Binary Coded Decimal
Interchange Code (EBCDIC) and is actually handled in ICU with almost the
same code and type of mapping tables as stateless codepages.
4. The classifications of algorithmic vs. data-based converters and of
stateless vs. stateful encodings are independent of each other: UTF-8,
UTF-16, and UTF-32 encodings are algorithmic but stateless; UTF-7 and SCSU
encodings are algorithmic and stateful; Windows-1252 and Shift-JIS encodings
are data-based and stateless; ISO-2022-JP encoding is algorithmic,
data-based, and stateful.
### Scope of this chapter
The following sections in this chapter discuss the mapping data tables that are
used in ICU. For related material, please see:
1. [ICU character set collection](http://icu-project.org/charts/charset/)
2. [Unicode Technical Report 22](http://www.unicode.org/unicode/reports/tr22/)
3. "Cross Mapping Tables" in [Unicode Online
Data](http://www.unicode.org/unicode/onlinedat/online.html)
## ICU Mapping Table Data Files
### Overview
As stated above, most ICU converters rely on character mapping tables. ICU 1.8
has one single data structure for all character mapping tables, which is used by
a generic Multi-Byte Character Set (MBCS) converter implementation. The
implementation is flexible enough to handle stateless encodings with the
following parameters:
1. Support for variable-length, byte-based encodings with 1 to 4 bytes per
character.
2. Support for all Unicode characters (code points 0..0x10ffff). Since ICU 1.8
uses the UTF-16 encoding as its Unicode encoding form, surrogate pairs are
completely supported.
3. Efficient distinction between unassigned (unmappable) and illegal byte
sequences.
4. It is not possible to convert from Unicode to byte sequences with leading
zero bytes.
5. Simple stateful encodings are also handled using only Shift-In and Shift-Out
(SI/SO) codes and one single-byte and one double-byte state.
> :point_right: **Note**: *In the context of conversion tables, "unassigned" code points or codepage byte
sequences are valid but do not have a **mapping**. This is different from
"unassigned" code points in a character set like Unicode or Shift-JIS which are
codes that do not have assigned **characters**.*
Prior to version 1.8, ICU used more specific, more limited, converter
implementations for Single Byte Character Set (SBCS), Double Byte Character Set
(DBCS), and the stateful Extended Binary Coded Decimal Interchange Code (EBCDIC)
codepages. Mapping table data is provided in text files. ICU comes with several
dozen .ucm files (UniCode Mapping, in icu/source/data/mappings/) that are
translated at build time by its makeconv tool (source code in
icu/source/tools/makeconv). The makeconv tool writes one binary, memory-mappable
.cnv file per .ucm file. The resulting .cnv files are included by default in the
common data file for use at runtime.
The format of the .ucm files is similar to the format of the UPMAP files as
provided by IBM® in the codepage repository and as used in the uconvdef tool on
AIX. UPMAP is a text file that specifies the mapping of a codepage character to
and from Unicode.
The format of the .cnv files is ICU-specific. The .cnv file format may change
between ICU versions even for the same .ucm files. The .ucm file format may be
extended to include more features.
The following sections concentrate on the .ucm file format. The .cnv file format
is described in the source code in the icu/source/common/ucnvmbcs.c directory
and is updated using the MBCS converter implementation.
These conversion tables can have more than one name. ICU allows multiple names
("aliases") for the same encoding. It matches a requested encoding name against
a list of names in icu/source/data/mappings/convrtrs.txt and when it finds a
match, ICU opens a converter with the name in the leftmost position in the
matching line. The name matching is not case-sensitive and ICU ignores spaces,
dashes, and underscores. At build time, the gencnval tool located in the
icu/source/tools/gencnval directory, generates a binary form of the convrtrs.txt
file as a data file for runtime for the cnvalias.icu file ("Converter Aliases
data file").
### .ucm File Format
.ucm files are line-oriented text files. Empty lines and comments starting with
'#' are ignored.
A .ucm file contains two sections:
1. a header with general specifications of the codepage
2. a mapping table section between the "CHARMAP" and "END CHARMAP" lines.
For example:
```
<code_set_name> "IBM-943"
<char_name_mask> "AXXXX"
<mb_cur_min> 1
<mb_cur_max> 2
<uconv_class> "MBCS"
<subchar> \xFC\xFC
<subchar1> \x7F
<icu:state> 0-7f, 81-9f:1, a0-df, e0-fc:1
<icu:state> 40-7e, 80-fc
#
CHARMAP
#
#
#ISO 10646 IBM-943
#_________ _________
<U0000> \x00 |0
<U0001> \x01 |0
<U0002> \x02 |0
<U0003> \x03 |0
...
<UFFE4> \xFA\x55 |1
<UFFE5> \x81\x8F |0
<UFFFD> \xFC\xFC |2
END CHARMAP
```
The header fields are:
1. code_set_name - The name of the codepage. The makeconv tool generates the
.cnv file name from the .ucm filename but uses this header field for the
converter name that it writes into the .cnv file for ucnv_getName. The
makeconv tool prints a warning message if this header field does not match
the file name. The file name is not case-sensitive.
2. char_name_mask - This is ignored by makeconv tool. "AXXXX" specifies that
the POSIX-style character "name" consists of one letter (Alpha) followed by
4 hexadecimal digits. Since ICU only uses Unicode character "names" (for
example, code points) the format is fixed (see below).
3. mb_cur_min - The minimum number of bytes per character.
4. mb_cur_max - The maximum number of bytes per character.
5. uconv_class - This can be either "SBCS", "DBCS", "MBCS", or
"EBCDIC_STATEFUL"
The most general converter class/type/category is MBCS, which requires that
the codepage structure has the following <icu:state> lines. The other types
of converters are subsets of MBCS. The makeconv tool uses predefined state
tables for these other converters when their structure is not explicitly
specified. The following describes how the converter types are interpreted:
a. MBCS: Generic ICU converter type, requires a state table
b. SBCS: Single-byte, 8-bit codepages
c. DBCS: Double-byte EBCDIC codepages
d. EBCDIC_STATEFUL: Mixed Single-Byte or Double-Byte EBCDIC codepages (stateful, using SI/SO)
The following shows the exact implied state tables for non-MBCS types. A state
table may need to be overwritten in order to allow supplementary characters
(U+10000 and up).
1. subchar - The substitution character byte sequence for this codepage. This sequence must be a valid byte sequence according to the codepage structure.
2. subchar1 - This is the single byte substitution character when subchar is defined. Some IBM converter libraries use different substitution characters for "narrow" and "wide" characters (single-byte and double-byte). ICU uses only one substitution character per codepage because it is common industry practice.
3. icu:state - See the "State Table Syntax in .ucm Files" section for a detailed description of how to specify a codepage structure.
4. icu:charsetFamily - This specifies if the codepage is ASCII or EBCDIC based.
The subchar and subchar1 fields have been known to cause some confusion. The
following conditions outline when each are used:
1. Conversion from Unicode to a codepage occurs and an unassigned code point is
found
a. If a subchar1 byte is defined and a subchar1 mapping is defined for the code point (with a |2 precision indicator),
output the subchar1
b. Otherwise output the regular subchar
2. Conversion from a codepage to Unicode occurs and an unassigned codepoint is found
a. If the input sequence is of length 1 and a subchar1 byte is specified for the codepage, output U+001A
b. Otherwise output U+FFFD
In the CHARMAP section of a .ucm file, each line contains a Unicode code point
(like <U(*1-6 hexadecimal digits for the code point*)> ), a codepage character
byte sequence (each byte like \\x*hh* (2 hexadecimal digits} ), and an optional
"precision" or "fallback" indicator.
The precision indicator either must be present in all mappings or in none of
them. The indicator is a pipe symbol | followed by a 0, 1, 2, 3, or 4 that has
the following meaning:
* |0 - A "normal", roundtrip mapping from a Unicode code point and back.
* |1 - A "fallback" mapping only from Unicode to the codepage, but not back.
* |2 A subchar1 mapping. The code point is unmappable, and if a substitution
is performed, then the subchar1 should be used rather than the subchar.
Otherwise, such mappings are ignored.
* |3 - A "reverse fallback" mapping only from the codepage to Unicode, but not
back to the codepage.
* |4 - A "good one-way" mapping only from Unicode to the codepage, but not
back.
Fallback mappings from Unicode typically do not map codes for the same
character, but for "similar" ones. This mapping is sometimes done if a character
exists in Unicode but not in the codepage. To replace it, ICU maps a codepage
code to a similar-looking code for human-readable output. This mapping feature
is not useful for text data transmission especially in markup languages where a
Unicode code point can be escaped with its code point value. The ICU application
programming interface (API) ucnv_setFallback() controls this fallback behavior.
"Reverse fallbacks" are technically similar, but the same Unicode character can
be encoded twice in the codepage. ICU always uses reverse fallbacks at runtime.
A subset of the fallback mappings from Unicode is always used at runtime: Those
that map private-use Unicode code points. Fallbacks from private-use code points
are often introduced as replacements for previous roundtrip mappings for the
same pair of codes. These replacements are used when a Unicode version assigns a
new character that was previously mapped to that private-use code point. The
mapping table is then changed to map the same codepage byte sequence to the new
Unicode code point (as a new roundtrip) and the mapping from the old private-use
code point to the same codepage code is preserved as a fallback.
A "good one-way" mapping is like a fallback, but ICU always uses "good one-way"
mappings at runtime, regardless of the fallback API flag.
The idea is that fallbacks normally lose information, such as mapping from a
compatibility variant of a letter to the ASCII version; however, fallbacks from
PUA and reverse fallbacks are assumed to be for "the same character", just an
older code for it.
Something similar happens with from-Unicode Variation Selector sequences. It is
possible to round-trip (|0) either the unadorned character or the sequence with
a variation selector, and add a "good one-way" mapping (|4) from the other
version. That "good one-way" mapping does not lose much information, and it is
used even if the "use fallback" API flag is false. Alternatively, both mappings
could be fallbacks (|1) that should be controlled by the "use fallback"
attribute.
### State table syntax in .ucm files
The conversion to Unicode uses a state machine to achieve the above capabilities
with reasonable data file sizes. The state machine information itself is loaded
with the conversion data and defines the structure of the codepage, including
which byte sequences are valid, unassigned, and illegal. This data cannot (or
not easily) be computed from the pure mapping data. Instead, the .ucm files for
MBCS encodings have additional entries that are specific to the ICU makeconv
tool. The state tables for SBCS, DBCS, and EBCDIC_STATEFUL are implied, but they
can be overridden (see the examples below). These state tables are specified in
the header section of the .ucm file that contains the <icu:state> element. Each
line defines one aspect of the state machine. The state machine uses a table of
as many rows as there are states (= as many as there are <icu:state> lines).
Each row has 256 entries; one for each possible byte value.
The state table lines in the .ucm header conform to the following Extended
Backus-Naur Form (EBNF)-like grammar (whitespace is allowed between all tokens):
```
row=[[firstentry ','] entry (',' entry)*]
firstentry="initial" | "surrogates"
(initial state (default for state 0), output is all surrogate pairs)
```
Each state table row description (that follows the <icu:state>) begins with an
optional initial or surrogates keyword and is followed by one or more column
entries. For the purpose of codepage state tables, the states=rows in the table
are numbered beginning at 0 for the first line in the .ucm file header. The
numbers are assigned implicitly by the makeconv tool in order of the <icu:state>
lines.
A row may be empty (nothing following the <icu:state>) — that is equivalent to
"all illegal" or 0-ff.i and is useful for trail byte states for all-illegal byte
sequences.
```
entry=range ':' nextstate] ['.' [action]]
range = number ['-' number]
nextstate = number (0..7f)
action = 'u' | 's' | 'p' | 'i'
(unassigned, state change only, surrogate pair, illegal)
number = (1- or 2-digit hexadecimal number)
```
Each column entry contains at least one hexadecimal byte value or value range
and is separated by a comma. The column entry specifies how to interpret an
input byte in the row's state. If neither a next state nor an action is
explicitly specified (only the byte range is given) then the byte value
terminates the byte sequence, results in a valid mapping to a Unicode BMP
character, and resets the state number to 0. The first line with <icu:state> is
called state 0.
The next state can be explicitly specified with a separating colon ( : )
followed by the number of the state (=number/index of the row, starting at 0).
This specification is mostly used for intermediate byte values (such as bytes
that are not the last ones in a sequence). The state machine needs to proceed to
the next state and read another byte. In this case, no other action is
specified.
If the byte value(s) terminate(s) a byte sequence, then the byte sequence
results in the following depending on the action that is announced with a period
( . ) followed by a letter:
| letter | meaning |
|--|---------|
| u | Unassigned. The byte sequence is valid but does not encode a character. |
| none | (no letter) - Valid. If no action letter is specified, then the byte sequence is valid and encodes a Unicode character up to U+ffff |
| p | Surrogate Pair. The byte sequence is valid and the result may map to a UTF-16 encoded surrogate pair |
| i | Illegal. The byte sequence is illegal. This is the default for all byte values in a row that are not otherwise specified with column entries|
| s | State change only. The byte sequence does not encode any character but may change the state number. This may be used with simple, stateful encodings (for example, SI/SO codes), but currently it is not used by ICU.|
If an action is specified without a next state, then the next state number
defaults to 0. In other words, a byte value (range) terminates a sequence if
there is an action specified for it, or when there is neither an action nor a
next state. In this case, the byte value defaults to "valid, next state is 0"
(equivalent to :0.).
If a byte value is not specified in any column entry row, then it is illegal in
the current state. If a byte value is specified in more than one column entry of
the same row, then ICU uses the last state. These specifications allow you to
assign common properties for a wide byte value range followed by a few
exceptions. This is easier than having to specify mutually exclusive ranges,
especially if many of them have the same properties.
The optional keyword at the beginning of a state line has the following effect:
| keyword | effect |
|---------|--------|
| initial | The state machine can start reading byte sequences in this state. State 0 is always an initial state. Only initial states can be next states for final byte values. In an initial state, the Unicode mappings for all final bytes are also stored directly in the state table.
| surrogates | All Unicode mappings for final bytes in non-initial states are stored in a separate table of 16-bit Unicode (UTF-16) code units. Since most legacy codepages map only to Unicode code points up to U+ffff (the Basic Multilingual Plane, BMP), the default allocation per mapping result is one 16-bit unit. Individual byte values can be specified to map to surrogate pairs (= two 16-bit units) with action letter p. The surrogates keyword specifies the values for the entire state (row). Surrogate pair mapping entries can still hold single units depending on the actual mapping data, but single-unit mapping entries cannot hold a pair of units. Mapping to single-unit entries is the default because the mapping is faster, uses half as much memory in the code units table, and is sufficient for most legacy codepages.|
When converting to Unicode, the state machine starts in state number 0. In each
iteration, the state machine reads one input (codepage) byte and either proceeds
to the next state as specified, or treats it as a final byte with the specified
action and an optional non-0 next (initial) state. This means that a state table
needs to have at least as many state rows as the maximum number of bytes per
character, which is the maximum length of any byte sequence.
Exception: For EBCDIC_STATEFUL codepages, double-byte sequences start in state
1, with the SI/SO bytes switching from state 0 to state 1 or from state 1 to
state 0. See the default state table below.
### Extension and delta tables
ICU 2.8 adds an additional "extension" data structure to its conversion tables.
The new data structure supports a number of new features. When any of the
following features are used, then all mappings must use a precision indicator.
#### Converting multiple characters as a unit
Before ICU 2.8, only one Unicode code point could be converted to or from one
complete codepage byte sequence. The new data structure supports the conversion
between multiple Unicode code points and multiple complete codepage byte
sequences. (A "complete codepage byte sequence" is a sequence of bytes which is
valid according to the state table.)
Syntax: Simply write more than one Unicode code point on a mapping line, and/or
more than one complete codepage byte sequence. Plus signs (+) are optional
between code points and between bytes. For example,
ibm-1390_P110-2003.ucm contains
<U304B><U309A> \xEC\xB5 |0
and test3.ucm contains
<U101234>+<U50005>+<U60006> \x07+\x00+\x01\x02\x0f+\x09 |0
For more examples see the ICU conversion data and the
icu/source/test/testdata/test*.ucm test data files.
ICU 2.8 supports up to 19 UChars on the Unicode side of a mapping and up to 31
bytes on the codepage side.
The longest match possible is converted in order to properly handle tables where
the source sides of some mappings are prefixes of the source sides of other
mappings.
As a side effect, if conversion offsets are written and a potential match
crosses buffer boundaries, then some of the initial offsets for the following
output may be unknown (-1) because their input was stored in the converter from
a previous buffer while looking for a longer match.
Conversion tables for SI/SO-stateful (usually EBCDIC_STATEFUL) codepages cannot
include mappings with SI or SO bytes or where there are SBCS characters in a
multi-character byte sequence. In other words, for these tables there must be
exactly one byte in a mapping or else a sequence of one or more DBCS characters.
#### Delta (extension-only) conversion table files
Physically, a binary conversion table (.cnv) file automatically contains both a
traditional "base table" data structure for the 1:1 mappings and a new
"extension table" for the m:n mappings if any are encountered in the .ucm file.
An extension table can also be requested manually by splitting the CHARMAP into
two. The first CHARMAP section will be used for the base table, and the second
only for the extension table. M:n mappings in the first CHARMAP will be moved to
the extension table.
In order to save space for very similar conversion tables, it is possible to
create delta .cnv files that contain only an extension table and the name of
another .cnv file with a base table. The base file must be split into two
CHARMAPs such that the base file's base table does not contain any mappings that
contradict any of the delta file's mappings.
The delta (extension-only) file uses only a single CHARMAP section. In addition,
it nees a line in the header that both causes building just a delta file and
specifies the name of the base file. For example, windows-936-2000.ucm contains
<icu:base> “ibm-1386_P100-2002”
makeconv ignores all mappings for the delta file that are also in the base
file's base table. If the two conversion tables are sufficiently similar, then
the delta file will contain only a relatively small set of mappings, which
results in a small .cnv file. At runtime, both the delta file and its base file
are loaded, and the base file's base table is used together with the extension
file. The base file works as a standalone file, using its own extension table
for its full set of mappings. The base file must be in the same ICU data package
as the delta file.
The hard part is to split the base file's mappings into base and extension
CHARMAPs such that the base table does not overlap with any delta file, while
all shared mappings should be in the base table. (The base table data structure
is more compact than the extension table data structure.)
ICU provides the ucmkbase tool in the
[ucmtools](https://github.com/unicode-org/icu-data/tree/master/charset/source/ucmtools)
collection to do this.
For example, the following illustrates how to use ucmkbase to make a base .ucm
file for three Shift-JIS conversion table variants. (ibm-943_P15A-2003.ucm
becomes the base.)
```
C:\tmp\icu\ucm>ren ibm-943_P15A-2003.ucm ibm-943_P15A-2003.orig
C:\tmp\icu\ucm>ucmkbase ibm-943_P15A-2003.orig ibm-943_P130-1999.ucm ibm-942_P12A-1999.ucm > ibm-943_P15A-2003.ucm
```
After this, the two delta .ucm files only need to get the following line added
before the start of their CHARMAPs:
```
<icu:base> "ibm-943_P15A-2003"
```
The ICU tools and runtime code handle DBCS-only conversion tables specially,
allowing them to be built into delta files with MBCS or EBCDIC_STATEFUL base
files without using their single-byte mappings, and without ucmkbase moving the
single-byte mappings of the base file into the base file's extension table. See
for example ibm-16684_P110-2003.ucm and ibm-1390_P110-2003.ucm.
#### Other enhancements
ICU 2.8 adds support for the specification of which unassigned Unicode code
points should be mapped to subchar1 rather than the default subchar. See the
discussion of subchar1 above for more details.
The extension table data structure also removes one minor limitation on ICU
conversion tables: Fallback mappings to a single byte 00 are now allowed and
handled properly. ICU versions before 2.8 could only handle roundtrips to/from
00.
### Examples for codepage state tables
The following shows the exact implied state tables for non-MBCS types, A state
table may need to be overwritten in order to allow supplementary characters
(U+10000 and up).
US-ASCII
```
0-7f
```
This single-row state table describes US-ASCII. Byte values from 0 to 0x7f are
valid and map to Unicode characters up to U+ffff. Byte values from 0x80 to 0xff
are illegal.
Shift-JIS
```
0-7f, 81-9f:1, a0-df, e0-fc:1
40-7e, 80-fc
```
This two-row state table describes the Shift-JIS structure which encodes some
characters with one byte each and others with two bytes each. Bytes 0 to 0x7f
and 0xa0 to 0xdf are valid single-byte encodings. Bytes 0x81 to 0x9f and 0xe0 to
0xfc are lead bytes. (For example, they are followed by one of the bytes that is
specified as valid in state 1). A byte sequence of 0x85 0x61 is valid while a
single byte of 0x80 or 0xff is illegal. Similarly, a byte sequence of 0x85 0x31
is illegal.
EUC-JP
```
0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
a1-fe
a1-e4
a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
a1-fe.u
```
This fairly complicated state table describes EUC-JP. Valid byte sequences are
one, two, or three bytes long. Two-byte sequences have a lead byte of 0x8e and
end in state 2, or have lead bytes 0xa1 to 0xfe and end in state 1. Three-byte
sequences have a lead byte of 0x8f and continue in state 3. Some final byte
value ranges are entirely unassigned, therefore they end in state 4 with an
action letter of u for "unassigned" to save significant memory for the code
units table. Assigned three-byte sequences end in state 1 like most two-byte
sequences.
SBCS default state table:
```
0-ff
```
SBCS by default implies the structure for single-byte, 8-bit codepages.
DBCS default state table:
```
0-3f:3, 40:2, 41-fe:1, ff:3
41-fe
40
```
**Important**:
These are four states — the fourth has an empty line (equivalent to 0-ff.i)!
DBCS codepages, by default, are defined with the EBCDIC double-byte structure.
Valid sequences are pairs of bytes from 0x41 to 0xfe and the one pair 0x40/0x40
for the double-byte space. The structure is defined such that all illegal byte
sequences are always two in length. Therefore, every byte in the initial state
is a lead byte.
EBCDIC_STATEFUL default state table:
```
0-ff, e:1.s, f:0.s
initial, 0-3f:4, e:1.s, f:0.s, 40:3, 41-fe:2, ff:4
0-40:1.i, 41-fe:1., ff:1.i
0-ff:1.i, 40:1.
0-ff:1.i
```
This is the structure of Mixed Single-byte and Double-byte EBCDIC codepages,
which are stateful and use the Shift-In/Shift-Out (SI/SO) bytes 0x0f/0x0e. The
initial state 0 is almost the same as for SBCS except for SI and SO. State 1 is
also an initial state and is the basis for a state-shifted version of the DBCS
structure above. All double-byte sequences return to state 1 and SI switches
back to state 0. SI and SO are also allowed in their own states with no effect.
> :point_right: **Note**: *If a DBCS or EBCDIC_STATEFUL codepage maps supplementary (non-BMP) Unicode
characters, then a modified state table needs to be specified in the .ucm file.
The state table needs to use the surrogates designation for a table row or .p
for some entries.<br/> The reuse of a final or intermediate state (shown for EUC-JP) is valid for as
long as there is no circle in the state chain. The mappings will be unique
because of the different path to the shared state (sharing a state saves some
memory; each state table row occupies 1kB in the .cnv file). This table also
shows the redefinition of byte value ranges within one state row (State number
3)as shorthand. State 3 defines bytes a1-fe to go to state 1, but the following
entries redefine and override certain bytes to go to state 4.*
An initial state never needs a surrogates designation or .p because Unicode
mapping results in initial states that are stored directly in the state table,
providing enough room in each cell. The size of a generated .cnv mapping table
file depends primarily on the number and distribution of the mappings and on the
number of valid, multi-byte sequences that the state table allows. Each state
table row takes up one kilobyte.
For single-byte codepages, the state table cells contain all two-Unicode
mappings. Code point results for multi-byte sequences are stored in an array
with enough room for all valid byte sequences. For all byte sequences that end
in a surrogates or .p state, Unicode allocates two code units.
If possible, valid state table entries may be changed to .u to reduce the number
of valid, assignable sequences and to make the .cnv file smaller. If additional
states are necessary, then each additional state itself adds 1kB to the file
size, diminishing the file size savings. See the EUC-JP example above.
For codepages with up to two bytes per character, the makeconv tool
automatically compacts the bytes, if possible, by introducing one more trail
byte state. This state replaces valid entries in the original trail state with
unassigned entries and changes each lead byte entry to work with the new state
if there are no mappings with that lead byte.
For codepages with up to three or four bytes per character, compaction must be
done manually. However, if the verbose option is set on the command line, the
makeconv tool will print useful information about unassigned byte sequences.

View file

@ -0,0 +1,345 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Character Set Detection
## Overview
Character set detection is the process of determining the character set, or
encoding, of character data in an unknown format. This is, at best, an imprecise
operation using statistics and heuristics. Because of this, detection works best
if you supply at least a few hundred bytes of character data that's mostly in a
single language. In some cases, the language can be determined along with the
encoding.
Several different techniques are used for character set detection. For
multi-byte encodings, the sequence of bytes is checked for legal patterns. The
detected characters are also check against a list of frequently used characters
in that encoding. For single byte encodings, the data is checked against a list
of the most commonly occurring three letter groups for each language that can be
written using that encoding. The detection process can be configured to
optionally ignore html or xml style markup, which can interfere with the
detection process by changing the statistics.
The input data can either be a Java input stream, or an array of bytes. The
output of the detection process is a list of possible character sets, with the
most likely one first. For simplicity, you can also ask for a Java Reader that
will read the data in the detected encoding.
There is another character set detection C++ library, the [Compact Encoding
Detector](https://github.com/google/compact_enc_det), that may have a lower
error rate, particularly when working with short samples of text.
## CharsetMatch
The CharsetMatch class holds the result of comparing the input data to a
particular encoding. You can use an instance of this class to get the name of
the character set, the language, and how good the match is. You can also use
this class to decode the input data.
To find out how good the match is, you use the getConfidence() method to get a
*confidence value*. This is an integer from 0 to 100. The higher the value, the
more confidence there is in the match For example:
CharsetMatch match = ...;
int confidence;
confidence = match.getConfidence();
if (confidence < 50 ) {
// handle a poor match...
} else {
// handle a good match...
}
In C, you can use the
`ucsdet_getConfidence(const UCharsetMatch *ucsm, UErrorCode *status)`
method to get a confidence value
```C
const UCharsetMatch *ucm;
UErrorCode status = U_ZERO_ERROR;
int32_t confidence = ucsdet_getConfidence(ucm, &status);
if (confidence <50) {
// handle a poor match...
} else {
// handle a good match...
}
```
To get the name of the character set, which can be used as an encoding name in
Java, you use the getName() method:
```Java
CharsetMatch match = ...;
byte characterData[] = ...;
String charsetName;
String unicodeData;
charsetName = match.getName();
unicodeData = new String(characterData, charsetName);
```
To get the name of the character set in C :
```C
const UCharsetMatch *ucm;
UErrorCode status = U_ZERO_ERROR;
const char *name = ucsdet_getName(ucm, &status);
```
To get the three letter ISO code for the detected language, you use the
getLanguage() method. If the language could not be determined, getLanguage()
will return null. Note that language detection does not work with all charsets,
and includes only a very small set of possible languages. It should not used if
robust, reliable language detection is required.
```Java
CharsetMatch match = ...;
String languageCode;
languageCode = match.getLanguage();
if (languageCode != null) {
// handle the language code...
}
```
The `ucsdet_getLanguage(const UCharsetMatch *ucsm, UErrorCode *status)` method
can be used in C to get the language code. If the language could not be
determined, the method will return an empty string.
```C
const UCharsetMatch *ucm;
UErrorCode status = U_ZERO_ERROR;
const char *language = ucsdet_getLanguage(ucm, &status);
```
If you want to get a Java String containing the converted data you can use the
getString() method:
```Java
CharsetMatch match = ...;
String unicodeData;
unicodeData = match.getString();
```
If you want to limit the number of characters in the string, pass the maximum
number of characters you want to the getString() method:
```Java
CharsetMatch match = ...;
String unicodeData;
unicodeData = match.getString(1024);
```
To get a java.io.Reader to read the converted data, use the getReader() method:
```Java
CharsetMatch match = ...;
Reader reader;
StringBuffer sb = new StringBuffer();
char[] buffer = new char[1024];
int bytesRead = 0;
reader = match.getReader();
while ((bytesRead = reader.read(buffer, 0, 1024)) >= 0) {
sb.append(buffer, 0, bytesRead);
}
reader.close();
```
## CharsetDetector
The CharsetDetector class does the actual detection. It matches the input data
against all character sets, and computes a list of CharsetMatch objects to hold
the results. The input data can be supplied as an array of bytes, or as a
java.io.InputStream.
To use a CharsetDetector object, first you construct it, and then you set the
input data, using the setText() method. Because setting the input data is
separate from the construction, it is easy to reuse a CharsetDetector object:
```Java
CharsetDetector detector;
byte[] byteData = ...;
InputStream streamData = ...;
detector = new CharsetDetector();
detector.setText(byteData);
// use detector with byte data...
detector.setText(streamData);
// use detector with stream data...
```
If you want to know which character set matches your input data with the highest
confidence, you can use the detect() method, which will return a CharsetMatch
object for the match with the highest confidence:
```Java
CharsetDetector detector;
CharsetMatch match;
byte[] byteData = ...;
detector = new CharsetDetector();
detector.setText(byteData);
match = detector.detect();
```
If you want to know which character set matches your input data in C, you can
use the `ucsdet_detect(UCharsetDetector *csd , UErrorCode *status)` method.
```C
UCharsetDetector *csd;
const UCharsetMatch *ucm;
static char buffer[BUFFER_SIZE] = {....};
int32_t inputLength = ... // length of the input text
UErrorCode status = U_ZERO_ERROR;
ucsdet_setText(csd, buffer, inputLength, &status);
ucm = ucsdet_detect(csd, &status);
```
If you want to know all of the character sets that could match your input data
with a non-zero confidence, you can use the detectAll() method, which will
return an array of CharsetMatch objects sorted by confidence, from highest to
lowest.:
```Java
CharsetDetector detector;
CharsetMatch matches[];
byte[] byteData = ...;
detector = new CharsetDetector();
detector.setText(byteData);
matches = detector.detectAll();
for (int m = 0; m < matches.length; m += 1) {
// process this match...
}
```
> :point_right: **Note**: The
`ucsdet_detectALL(UCharsetDetector *csd , int32_t *matchesFound, UErrorCode *status)`
method can be used in C in order to detect all of the
character sets where matchesFound is a pointer to a variable that will be set to
the number of charsets identified that are consistent with the input data.
The CharsetDetector class also implements a crude *input filter* that can strip
out html and xml style tags. If you want to enable the input filter, which is
disabled when you construct a CharsetDetector, you use the enableInputFilter()
method, which takes a boolean. Pass in true if you want to enable the input
filter, and false if you want to disable it:
```Java
CharsetDetector detector;
CharsetMatch match;
byte[] byteDataWithTags = ...;
detector = new CharsetDetector();
detector.setText(byteDataWithTags);
detector.enableInputFilter(true);
match = detector.detect();
```
To enable an input filter in C, you can use
`ucsdet_enableInputFilter(UCharsetDetector *csd, UBool filter)` function.
```C
UCharsetDetector *csd;
const UCharsetMatch *ucm;
static char buffer[BUFFER_SIZE] = {....};
int32_t inputLength = ... // length of the input text
UErrorCode status = U_ZERO_ERROR;
ucsdet_setText(csd, buffer, inputLength, &status);
ucsdet_enableInputFilter(csd, TRUE);
ucm = ucsdet_detect(csd, &status);
```
If you have more detailed knowledge about the structure of the input data, it is
better to filter the data yourself before you pass it to CharsetDetector. For
example, you might know that the data is from an html page that contains CSS
styles, which will not be stripped by the input filter.
You can use the inputFilterEnabled() method to see if the input filter is
enabled:
```Java
CharsetDetector detector;
detector = new CharsetDetector();
// do a bunch of stuff with detector
// which may or may not enable the input filter...
if (detector.inputFilterEnabled()) {
// handle enabled input filter
} else {
// handle disabled input filter
}
```
> :point_right: **Note**: The ICU4C API provide uscdet_isInputFilterEnabled(const UCharsetDetector\*
csd) function to check whether the input filter is enabled.
The CharsetDetector class also has two convenience methods that let you detect
and convert the input data in one step: the getReader() and getString() methods:
```Java
CharsetDetector detector;
byte[] byteData = ...;
InputStream streamData = ...;
String unicodeData;
Reader unicodeReader;
detector = new CharsetDetector();
unicodeData = detector.getString(byteData, null);
unicodeReader = detector.getReader(streamData, null);
```
> :point_right: **Note**: The second argument to the getReader() and getString() methods is a
String called declaredEncoding, which is not currently used. There is also a
setDeclaredEncoding() method, which is also not currently used.
The following code is equivalent to using the convenience methods:
```Java
CharsetDetector detector;
CharsetMatch match;
byte[] byteData = ...;
InputStream streamData = ...;
String unicodeData;
Reader unicodeReader;
detector = new CharsetDetector();
detector.setText(byteData);
match = detector.detect();
unicodeData = match.getString();
detector.setText(streamData);
match = detector.detect();
unicodeReader = match.getReader();CharsetDetector
```
## Detected Encodings
The following table shows all the encodings that can be detected. You can get
this list (without the languages) by calling the getAllDetectableCharsets()
method:
| **Character Set** | **Languages** |
| ----------------- | ------------- |
| UTF-8 | &nbsp; |
| UTF-16BE | &nbsp; |
| UTF-16LE | &nbsp; |
| UTF-32BE | &nbsp; |
| UTF-32LE | &nbsp; |
| Shift_JIS | Japanese |
| ISO-2022-JP | Japanese |
| ISO-2022-CN | Simplified Chinese |
| ISO-2022-KR | Korean |
| GB18030 | Chinese |
| Big5 | Traditional Chinese |
| EUC-JP | Japanese |
| EUC-KR | Korean |
| ISO-8859-1 | Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish |
| ISO-8859-2 | Czech, Hungarian, Polish, Romanian |
| ISO-8859-5 | Russian |
| ISO-8859-6 | Arabic |
| ISO-8859-7 | Greek |
| ISO-8859-8 | Hebrew |
| ISO-8859-9 | Turkish |
| windows-1250 | Czech, Hungarian, Polish, Romanian |
| windows-1251 | Russian |
| windows-1252 | Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish |
| windows-1253 | Greek |
| windows-1254 | Turkish |
| windows-1255 | Hebrew |
| windows-1256 | Arabic |
| KOI8-R | Russian |
| IBM420 | Arabic |
| IBM424 | Hebrew |

View file

@ -0,0 +1,141 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Conversion
## Conversion Overview
A converter is used to convert from one character encoding to another. In the
case of ICU, the conversion is always between Unicode and another encoding, or
vice-versa. A text encoding is a particular mapping from a given character set
definition to the actual bits used to represent the data.
Unicode provides a single character set that covers the major languages of the
world, and a small number of machine-friendly encoding forms and schemes to fit
the needs of existing applications and protocols. It is designed for best
interoperability with both ASCII and ISO-8859-1 (the most widely used character
sets) to make it easier for Unicode to be used in almost all applications and
protocols.
Hundreds of encodings have been developed over the years, each for small groups
of languages and for special purposes. As a result, the interpretation of text,
input, sorting, display, and storage depends on the knowledge of all the
different types of character sets and their encodings. Programs have been
written to handle either one single encoding at a time and switch between them,
or to convert between external and internal encodings.
There is no single, authoritative source of precise definitions of many of the
encodings and their names. However,
[IANA](http://www.iana.org/assignments/character-sets) is the best source for
names, and our Character Set repository is a good source of encoding definitions
for each platform.
The transferring of text from one machine to another one often causes some loss
of information. Some platforms have a different interpretation of the text than
the other platforms. For example, Shift-JIS can be interpreted differently on
Windows™ compared to UNIX®. Windows maps byte value 0x5C to the backslash
symbol, while some UNIX machines map that byte value to the Yen symbol. Another
problem arises when a character in the codepage looks like the Unicode Greek
letter Mu or the Unicode micro symbol. Some platforms map this codepage byte
sequence to one Unicode character, while another platform maps it to the other
Unicode character. Fallbacks can partially fix this problem by mapping both
Unicode characters to the same codepage byte sequence. Even though some
character information is lost, the text is still readable.
ICU's converter API has the following main features:
1. Unicode surrogate support
2. Support for all major encodings
3. Consistent text conversion across all computer platforms
4. Text data can be streamed (buffered) through the API
5. Fast text conversion
6. Supports fallbacks to the codepage
7. Supports reverse fallbacks to Unicode
8. Allows callbacks for handling and substituting invalid or unmapped byte
sequences
9. Allows a user to add support for unsupported encodings
This section deals with the processes of converting encodings to and from
Unicode.
## Recommendations
1. **Use Unicode encodings whenever possible.** Together with Unicode for
internal processing, it makes completely globalized systems possible and
avoids the many problems with non-algorithmic conversions. (For a discussion
of such problems, see for example ["Character Conversions and Mapping
Tables"](http://icu-project.org/docs/papers/conversions_and_mappings_iuc19.ppt)
on <http://icu-project.org/docs/> and the [XML Japanese
Profile](http://www.w3.org/TR/japanese-xml/) .)
1. Use UTF-8 and UTF-16.
2. Use UTF-16BE, SCSU and BOCU-1 as appropriate.
3. In special environments, other Unicode encodings may be used as well,
such as UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, UTF-EBCDIC, and
CESU-8. (For turning Unicode filenames into ASCII-only filename strings,
the IMAP-mailbox-name encoding can be used.)
4. Do not exchange text with single/unpaired surrogates.
2. **Use legacy charsets only when absolutely necessary**. For best data
fidelity:
1. ISO-8859-1 is relatively unproblematic — if its limited character
repertoire is sufficient — because it is converted trivially (1:1) to
Unicode, avoiding conversion table problems for its small set of
characters. (By contrast, proper conversion from US-ASCII requires a
check for illegal byte values 0x80..0xff, which is an unnecessary
complication for modern systems with 8-bit bytes. ISO-8859-1 is nearly
as ubiquitous for modern systems as US-ASCII was for 7-bit systems.)
2. If you need to communicate with a certain platform, then use the same
conversion tables as that platform itself, or at least ones that are
very, very close.
3. ICU's conversion table repository contains hundreds of Unicode
conversion tables from a number of common vendors and platforms as well
as comparisons between these conversion tables:
<http://icu-project.org/charts/charset/> .
4. Do not trust codepage documentation that is not machine-readable, for
example nice-looking charts: They are usually incomplete and out of
date.
5. ICU's default build includes about 200 conversion tables. See the [ICU
Data](../icudata.md) chapter for how to add or remove conversion tables
and other data.
6. In ICU, you can (and should) also use APIs that map a charset name
together with a standard/platform name. This allows you to get different
converters for the same ambiguous charset name (like "Shift-JIS"),
depending on the standard or platform specified. See the
[convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt)
alias table, the [Using Converters](converters.md) chapter and [API
references](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .
7. For data exchange (rather than pure display), turn off fallback
mappings: ucnv_setFallback(cnv, FALSE);
8. For some text formats, especially XML and HTML, it is possible to set an
"escape callback" function that turns unmappable Unicode code points
into corresponding escape sequences, preventing data loss. See the API
references and the [ucnv sample
code](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/ucnv/)
.
9. **Never modify a conversion table.** Instead, use existing ones that
match precisely those in systems with which you communicate. "Modifying"
a conversion table in reality just creates a new one, which makes the
whole situation even less manageable.

View file

@ -0,0 +1,254 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Calendar Examples
## Calendar for Default Time Zone
These C++, C , and Java examples get a Calendar based on the default time zone
and add days to a date.
**C++**
```C++
UErrorCode status = U_ZERO_ERROR;
GregorianCalendar* gc = new GregorianCalendar(status);
if (U_FAILURE(status)) {
puts("Couldn't create GregorianCalendar");
return;
}
// set up the date
gc->set(2000, Calendar::FEBRUARY, 26);
gc->set(Calendar::HOUR_OF_DAY, 23);
gc->set(Calendar::MINUTE, 0);
gc->set(Calendar::SECOND, 0);
gc->set(Calendar::MILLISECOND, 0);
// Iterate through the days and print it out.
for (int32_t i = 0; i < 30; i++) {
// print out the date.
// You should use the DateFormat to properly format it
printf("year: %d, month: %d (%d in the implementation), day: %d\n",
gc->get(Calendar::YEAR, status),
gc->get(Calendar::MONTH, status) + 1,
gc->get(Calendar::MONTH, status),
gc->get(Calendar::DATE, status));
if (U_FAILURE(status)) {
puts("Calendar::get failed");
return;
}
// Add a day to the date
gc->add(Calendar::DATE, 1, status);
if (U_FAILURE(status)) {
puts("Calendar::add failed");
return;
}
}
delete gc;
```
**C**
```C
UErrorCode status = U_ZERO_ERROR;
int32_t i;
UCalendar* cal = ucal_open(NULL, -1, NULL, UCAL_GREGORIAN, &status);
if (U_FAILURE(status)) {
puts("Couldn't create GregorianCalendar");
return;
}
// set up the date
ucal_set(cal, UCAL_YEAR, 2000);
ucal_set(cal, UCAL_MONTH, UCAL_FEBRUARY); /* FEBRUARY */
ucal_set(cal, UCAL_DATE, 26);
ucal_set(cal, UCAL_HOUR_OF_DAY, 23);
ucal_set(cal, UCAL_MINUTE, 0);
ucal_set(cal, UCAL_SECOND, 0);
ucal_set(cal, UCAL_MILLISECOND, 0);
// Iterate through the days and print it out.
for (i = 0; i < 30; i++) {
// print out the date.
// You should use the udat_* API to properly format it
printf("year: %d, month: %d (%d in the implementation), day: %d\n",
ucal_get(cal, UCAL_YEAR, &status),
ucal_get(cal, UCAL_MONTH, &status) + 1,
ucal_get(cal, UCAL_MONTH, &status),
ucal_get(cal, UCAL_DATE, &status));
if (U_FAILURE(status)) {
puts("Calendar::get failed");
return;
}
// Add a day to the date
ucal_add(cal, UCAL_DATE, 1, &status);
if (U_FAILURE(status)) {
puts("Calendar::add failed");
return;
}
}
ucal_close(cal);
```
**Java**
```Java
Calendar cal = new GregorianCalendar();
if (cal == null) {
System.out.println("Couldn't create GregorianCalendar");
return;
}
// set up the date
cal.set(Calendar.YEAR, 2000);
cal.set(Calendar.MONTH, Calendar.FEBRUARY); /* FEBRUARY */
cal.set(Calendar.DATE, 26);
cal.set(Calendar.HOUR_OF_DAY, 23);
cal.set(Calendar.MINUTE, 0);
cal.set(Calendar.SECOND, 0);
cal.set(Calendar.MILLISECOND, 0);
// Iterate through the days and print it out.
for (int i = 0; i < 30; i++) {
// print out the date.
System.out.println(" year: " + cal.get(Calendar.YEAR) +
" month: " + (cal.get(Calendar.MONTH) + 1) +
" day : " + cal.get(Calendar.DATE)
);
cal.add(Calendar.DATE, 1);
}
```
These C++, C , and Java examples demonstrates converting dates from one calendar
(Gregorian) to another calendar (Japanese).
**C++**
```C++
UErrorCode status = U_ZERO_ERROR;
UDate time;
Calendar *cal1, *cal2;
// Create a new Gregorian Calendar.
cal1 = Calendar::createInstance("en_US@calender=gregorian", status);
if (U_FAILURE(status)) {
printf("Error creating Gregorian calendar.\n");
return;
}
// Set the Gregorian Calendar to a specific date for testing.
cal1->set(1980, UCAL_SEPTEMBER, 3);
// Display the date.
printf("Gregorian Calendar:\t%d/%d/%d\n",
cal1->get(UCAL_MONTH, status) + 1,
cal1->get(UCAL_DATE, status),
cal1->get(UCAL_YEAR, status));
if (U_FAILURE(status)) {
printf("Error getting Gregorian date.");
return;
}
// Create a Japanese Calendar.
cal2 = Calendar::createInstance("ja_JP@calendar=japanese", status);
if (U_FAILURE(status)) {
printf("Error creating Japnese calendar.\n");
return;
}
// Set the date.
time = cal1->getTime(status);
if (U_FAILURE(status)) {
printf("Error getting time.\n");
return;
}
cal2->setTime(time, status);
if (U_FAILURE(status)) {
printf("Error setting the date for Japanese calendar.\n");
return;
}
// Set the timezone
cal2->setTimeZone(cal1->getTimeZone());
// Display the date.
printf("Japanese Calendar:\t%d/%d/%d\n",
cal2->get(UCAL_MONTH, status) + 1,
cal2->get(UCAL_DATE, status),
cal2->get(UCAL_YEAR, status));
if (U_FAILURE(status)) {
printf("Error getting Japanese date.");
return;
}
delete cal1;
delete cal2;
```
**C**
```C
UErrorCode status = U_ZERO_ERROR;
UDate time;
UCalendar *cal1, *cal2;
// Create a new Gregorian Calendar.
cal1 = ucal_open(NULL, -1, "en_US@calendar=gregorian", UCAL_TRADITIONAL,
&status);
if (U_FAILURE(status)) {
printf("Couldn't create Gregorian Calendar.");
return;
}
// Set the Gregorian Calendar to a specific date for testing.
ucal_setDate(cal1, 1980, UCAL_SEPTEMBER, 3, &status);
if (U_FAILURE(status)) {
printf("Error setting date.");
return;
}
// Display the date.
printf("Gregorian Calendar:\t%d/%d/%d\n",
ucal_get(cal1, UCAL_MONTH, &status) + 1,
ucal_get(cal1, UCAL_DATE, &status),
ucal_get(cal1, UCAL_YEAR, &status));
if (U_FAILURE(status)) {
printf("Error getting Gregorian date.");
return 1;
}
// Create a Japanese Calendar.
cal2 = ucal_open(NULL, -1, "ja_J@calendar=japanese", UCAL_TRADITIONAL, &status);
if (U_FAILURE(status)) {
printf("Couldn't create Japanese Calendar.");
return 1;
}
// Set the date.
time = ucal_getMillis(cal1, &status);
if (U_FAILURE(status)) {
printf("Error getting time.\n");
return;
}
ucal_setMillis(cal2, time, &status);
if (U_FAILURE(status)) {
printf("Error setting time.\n");
return;
}
// Display the date.
printf("Japanese Calendar:\t%d/%d/%d\n",
ucal_get(cal2, UCAL_MONTH, &status) + 1,
ucal_get(cal2, UCAL_DATE, &status),
ucal_get(cal2, UCAL_YEAR, &status));
if (U_FAILURE(status)) {
printf("Error getting Japanese date.");
return;
}
ucal_close(cal1);
ucal_close(cal2);
```
**Java**
```Java
Calendar cal1, cal2;
// Create a new Gregorian Calendar.
cal1 = new GregorianCalendar();
// Set the Gregorian Calendar to a specific date for testing.
cal1.set(1980, Calendar.SEPTEMBER, 3);
// Display the date.
System.out.println("Gregorian Calendar:\t" + (cal1.get(Calendar.MONTH) + 1) +
"/" +
cal1.get(Calendar.DATE) + "/" +
cal1.get(Calendar.YEAR));
// Create a Japanese Calendar.
cal2 = new JapaneseCalendar();
// Set the date and timezone
cal2.setTime(cal1.getTime());
cal2.setTimeZone(cal1.getTimeZone());
// Display the date.
System.out.println("Japanese Calendar:\t" + (cal2.get(Calendar.MONTH) + 1) +
"/" +
cal2.get(Calendar.DATE) + "/" +
cal2.get(Calendar.YEAR));
```

View file

@ -0,0 +1,313 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Calendar Classes
## Overview
ICU has two main calendar classes used for parsing and formatting Calendar
information correctly:
1. Calendar
An abstract base class that defines the calendar API. This API supports
UDate to fields conversion and field arithmetic.
2. GregorianCalendar
A concrete subclass of Calendar that implements the standard calendar used
today internationally.
In addition to these, ICU has other Calendar sub classes to support
non-gregorian calendars including:
* Japanese
* Buddhist
* Chinese
* Persian
* Indian
* Islamic
* Hebrew
* Indian
* Coptic
* Ethiopic
The Calendar class is designed to support additional calendar systems in the
future.
> :point_right: **Note**: *Calendar classes are related to UDate, the TimeZone classes, and the DateFormat
classes.*
### Calendar locale and keyword handling
When a calendar object is created, via either Calendar::create(), or
ucal_open(), or indirectly within a date formatter, ICU looks up the 'default'
calendar type for that locale. At present, all locales default to a Gregorian
calendar, except for the compatibility locales th_TH_TRADITIONAL and
ja_JP_TRADITIONAL. If the "calendar" keyword is supplied, this value will
override the default for that locale.
For instance, Calendar::createInstance("fr_FR", status) will create a Gregorian
calendar, but Calendar::createInstance("fr_FR@calendar=buddhist") will create a
Buddhist calendar.
It is an error to use an invalid calendar type. It will produce a missing
resource error.
> :point_right: **Note**: *As of ICU 2.8, the above description applies to ICU4J only. ICU4J will have
this behavior in 3.0*
## Usage
This section discusses how to use the Calendar class and the GregorianCalendar
subclass.
### Calendar
Calendar is an abstract base class. It defines common protocols for a hierarchy
of classes. Concrete subclasses of Calendar, for example the GregorianCalendar
class, define specific operations that correspond to a real-world calendar
system. Calendar objects (instantiations of concrete subclasses of Calendar),
embody state that represents a specific context. They correspond to a real-world
locale. They also contain state that specifies a moment in time.
The API defined by Calendar encompasses multiple functions:
1. Representation of a specific time as a UDate
2. Representation of a specific time as a set of integer fields, such as YEAR,
MONTH, HOUR, etc.
3. Conversion from UDate to fields
4. Conversion from fields to UDate
5. Field arithmetic, including adding, rolling, and field difference
6. Context management
7. Factory methods
8. Miscellaneous: field meta-information, time comparison
#### Representation and Conversion
The basic function of the Calendar class is to convert between a UDate value and
a set of integer fields. A UDate value is stored as UTC time in milliseconds,
which means it is calendar and time zone independent. UDate is the most compact
and portable way to store and transmit a date and time. Integer field values, on
the other hand, depend on the calendar system (that is, the concrete subclass of
Calendar) and the calendar object's context state.
> :point_right: **Note**: *Integer field values are needed when implementing a human interface that must
display or input a date and/or time.*
At any given time, a calendar object uses (when DateFormat is not sufficient)
either its internal UDate or its integer fields (depending on which has been set
most recently via setTime() or set()), to represent a specific date and time.
Whatever the current internal representation, when the caller requests a UDate
or an integer field it is computed if necessary. The caller need never trigger
the conversion explicitly. The caller must perform a conversion to set either
the UDate or the integer fields, and then retrieve the desired data. This also
applies in situations where the caller has some integer fields and wants to
obtain others.
#### Field Arithmetic
Arithmetic with UDate values is straightforward. Since the values are
millisecond scalar values, direct addition and subtraction is all that is
required. Arithmetic with integer fields is more complicated. For example, what
is the date June 4, 1999 plus 300 days? Calendar defines three basic methods (in
several variants) that perform field arithmetic: add(), roll(), and
fieldDifference().
The add() method adds positive or negative values to a specified field. For
example, calling add(Calendar::MONTH, 2) on a GregorianCalendar object set to
March 15, 1999 sets the calendar to May 15, 1999. The roll() method is similar,
but does not modify fields that are larger. For example, calling
roll(Calendar::HOUR, n) changes the hour that a calendar is set to without
changing the day. Calling roll(Calendar::MONTH, n) changes the month without
changing the year.
The fieldDifference() method is the inverse of the add() method. It computes the
difference between a calendar's currently set time and a specified UDate in
terms of a specified field. Repeated calls to fieldDifference() compute the
difference between two UDates in terms of whatever fields the caller specifies
(for example, years, months, days, and hours). If the add() method is called
with the results of fieldDifference(when, n) , then the calendar is moved toward
field by field.
This is demonstrated in the following example:
```C++
Calendar cal = Calendar.getInstance();
cal.set(2000, Calendar.MARCH, 15);
Date date = new Date(2000-1900, Calendar.JULY, 4);
int yearDiff = cal.fieldDifference(date, Calendar.YEAR); // yearDiff <= 0
int monthDiff = cal.fieldDifference(date, Calendar.MONTH); // monthDiff ;<= 3
// At this point cal has been advanced 3 months to June 15, 2000.
int dayDiff = cal.fieldDifference(date, Calendar.DAY_OF_MONTH); // dayDiff ;<=19
// At this point cal has been advanced 19 days to July 4, 2000.
```
#### Context Management
A calendar object performs its computations within a specific context. The
context affects the results of conversions and arithmetic computations. When a
calendar object is created, it establishes its context using either default
values or values specified by the caller:
1. Locale-specific week data, including the first day of the week and the
minimal days in the first week. Initially, this is retrieved from the locale
resource data for the specified locale, or if none is specified, for the
default locale.
2. A TimeZone object. Initially, this is set to the specified zone object, or
if none is specified, the default TimeZone.
The context of a calendar object can be queried after the calendar is created
using calls such as getMinimalDaysInFirstWeek(), getFirstDayOfWeek(), and
getTimeZone(). The context can be changed using calls such as
setMinimalDaysInFirstWeek(), setFirstDayOfWeek(), and setTimeZone().
#### Factory Methods
Like other format classes, the best way to create a calendar object is by using
one of the factory methods. These are static methods on the Calendar class that
create and return an instance of a concrete subclass. Factory methods should be
used to enable the code to obtain the correct calendar for a locale without
having to know specific details. The factory methods on Calendar are named
createInstance().
***MONTH field***
> :point_right: **Note**: *Calendar numbers months starting from zero, so calling cal.set(1998, 3, 5)
sets cal to April 15, 1998, not March 15, 1998. This follows the Java
convention. To avoid mistakes, use the constants defined in the Calendar class
for the months and days of the week. For example, cal.set(1998, Calendar::APRIL,
15).*
#### Ambiguous Wall Clock Time Resolution
When the time offset from UTC has changed, it produces an ambiguous time slot
around the transition. For example, many US locations observe daylight saving
time. On the date of transition to daylight saving time in US, wall clock time
jumps from 12:59 AM (standard) to 2:00 AM (daylight). Therefore, wall clock
times from 1:00 AM to 1:59 AM do not exist on the date. When the input wall time
falls into this missing time slot, the ICU Calendar resolves the time using the
UTC offset before the transition by default. In this example, 1:30 AM is
interpreted as 1:30 AM standard time (non-exist), so the final result will be
2:30 AM daylight time.
On the date of transition back to standard time, wall clock time is moved back
one hour at 2:00 AM. So wall clock times from 1:00 AM to 1:59 AM occur twice. In
this case, the ICU Calendar resolves the time using the UTC offset after the
transition by default. For example, 1:30 AM on the date is resolved as 1:30 AM
standard time.
Ambiguous wall clock time resolution behaviors can be customized by Calendar
APIs setRepeatedWallTimeOption() and setSkippedWallTimeOption(). These APIs are
available in ICU 49 or later versions.
### Gregorian Calendar
The GregorianCalendar class implements two calendar systems, the Gregorian
calendar and the Julian calendar. These calendar systems are closely related,
differing mainly in their definition of the leap year. The Julian calendar has
leap years every four years; the Gregorian calendar refines this by excluding
century years that are not divisible by 400. GregorianCalendar defines two eras,
BC (B.C.E.) and AD (C.E.).
Historically, most western countries used the Julian calendar until the 16th to
20th century, depending on the country. They then switched to the Gregorian
calendar. The GregorianCalendar class mirrors this behavior by defining a
cut-over date. Before this date, the Julian calendar algorithms are used. After
it, the Gregorian calendar algorithms are used. By default, the cut-over date is
set to October 4, 1582 C.E., which reflects the time when countries first began
adopting the Gregorian calendar. The GregorianCalendar class does not attempt
historical accuracy beyond this behavior, and does not vary its cut-over date by
locale. However, users can modify the cut-over date by using the
setGregorianChange() method.
Code that is written correctly instantiates calendar objects using the Calendar
factory methods, and therefore holds a Calendar* pointer, Such code can not
directly access the GregorianCalendar-specific methods not present in Calendar.
The correct way to handle this is to perform a dynamic cast, after testing the
type of the object using getDynamicClassID(). For example:
```C++
void setCutover(Calendar *cal, UDate myCutover) {
if (cal->getDynamicClassID() == GregorianCalendar::getStaticClassID()) {
GregorianCalendar *gc = (GregorianCalendar*)cal;
gc->setGregorianChange(myCutover, status);
}
}
```
> :point_right: **Note**: *This is a general technique that should be used throughout ICU in conjunction
with the factory methods.*
### Disambiguation
When computing a UDate from fields, some special circumstances can arise. There
might be insufficient information to compute the UDate (such as only year and
month but no day in the month), there might be inconsistent information (such as
"Tuesday, July 15, 1996" -— July 15, 1996, is actually a Monday), or the input
time might be ambiguous because of time zone transition.
1. **Insufficient Information**
ICU Calendar uses the default field values to specify missing fields. The
default for a field is the same as that of the start of the epoch (that is,
YEAR = 1970, MONTH = JANUARY, DAY_OF_MONTH = 1).
2. **Inconsistent Information**
If fields conflict, the calendar gives preference to fields set more
recently. For example, when determining the day, the calendar looks for one
of the following combinations of fields:
MONTH + DAY_OF_MONTH
MONTH + WEEK_OF_MONTH + DAY_OF_WEEK
MONTH + DAY_OF_WEEK_IN_MONTH + DAY_OF_WEEK
DAY_OF_YEAR
DAY_OF_WEEK + WEEK_OF_YEAR
For the time of day, the calendar looks for one of the following
combinations of fields:
HOUR_OF_DAY
AM_PM + HOUR
3. **Ambiguous Wall Clock Time**
When time offset from UTC has changed, it produces ambiguous time slot
around the transition. For example, many US locations observe daylight
saving time. On the date switching to daylight saving time in US, wall clock
time jumps from 1:00 AM (standard) to 2:00 AM (daylight). Therefore, wall
clock time from 1:00 AM to 1:59 AM do not exist on the date. When the input
wall time fall into this missing time slot, the ICU Calendar resolves the
time using the UTC offset before the transition by default. In this example,
1:30 AM is interpreted as 1:30 AM standard time (non-exist), so the final
result will be 2:30 AM daylight time.
On the date switching back to standard time, wall clock time is moved back
one hour at 2:00 AM. So wall clock time from 1:00 AM to 1:59 AM occur twice.
In this case, the ICU Calendar resolves the time using the UTC offset after
the transition by default. For example, 1:30 AM on the date is resolved as
1:30 AM standard time.
***Options for Ambiguous Time Resolution***
> :point_right: **Note**: *Ambiguous wall clock time resolution behaviors can be customized by Calendar APIs setRepeatedTimeOption() and setSkippedTimeOption(). These methods are available in ICU 49 or later versions.*
***WEEK_OF_YEAR field***
> :point_right: **Note**: *Values calculated for the WEEK_OF_YEAR field range from 1 to 53. Week 1 for a year is the first week that contains at least getMinimalDaysInFirstWeek() days from that year. It depends on the values of getMinimalDaysInFirstWeek(), getFirstDayOfWeek(), and the day of the week of January 1. Weeks between week 1 of one year and week 1 of the following year are numbered sequentially from 2 to 52 or 53 (if needed).
For example, January 1, 1998 was a Thursday. If getFirstDayOfWeek() is MONDAY
and getMinimalDaysInFirstWeek() is 4 (these are the values reflecting ISO 8601
and many national standards), then week 1 of 1998 starts on December 29, 1997,
and ends on January 4, 1998. However, if getFirstDayOfWeek() is SUNDAY, then
week 1 of 1998 starts on January 4, 1998, and ends on January 10, 1998. The
first three days of 1998 are then part of week 53 of 1997.*
## Programming Examples
Programming for calendar [examples in C++, C, and Java](examples.md) .

View file

@ -0,0 +1,137 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Date/Time Services
## Overview of ICU System Time Zones
A time zone represents an offset applied to Greenwich Mean Time (GMT) to obtain
local time. The offset might vary throughout the year, if daylight savings time
(DST) is used, or might be the same all year long. Typically, regions closer to
the equator do not use DST. If DST is in use, then specific rules define the
point at which the offset changes and the amount by which it changes. Thus, a
time zone is described by the following information:
* An identifying string, or ID. This consists only of invariant characters
(see the file utypes.h). It typically has the format continent / city. The
city chosen is not the only city in which the zone applies, but rather a
representative city for the region. Some IDs consist of three or four
uppercase letters; these are legacy zone names that are aliases to standard
zone names.
* An offset from GMT, either positive or negative. Offsets range from
approximately minus half a day to plus half a day.
If DST is observed, then three additional pieces of information are needed:
1. The precise date and time during the year when DST begins. In the first half
of the year it's in the northern hemisphere, and in the second half of the
year it's in the southern hemisphere.
2. The precise date and time during the year when DST ends. In the first half
of the year it's in the southern hemisphere, and in the second half of the
year it's in the northern hemisphere.
3. The amount by which the GMT offset changes when DST is in effect. This is
almost always one hour.
### System and User Time Zones
ICU supports local time zones through the classes TimeZone and SimpleTimeZone in
the C++ API. In the C API, time zones are designated by their ID strings.
Users can construct their own time zone objects by specifying the above
information to the C++ API. However, it is more typical for users to use a
pre-existing system time zone since these represent all current international
time zones in use. This document lists the system time zones, both in order of
GMT offset and in alphabetical order of ID.
Since this list changes one or more times a year, *this document only represents
a snapshot*. For the most current list of ICU system zones, use the method
TimeZone::getAvailableIDs().
*The zones are listed in binary sort order (that is, 'A' through 'Z' come before
'a' through 'z'). This is the same order in which the zones are stored
internally, and the same order in which they are returned by
TimeZone::getAvailableIDs(). The reason for this is that ICU locates zones using
a binary search, and the binary search relies on this sort order.*
*You might notice that zones such as Etc/GMT+1 appear to have the wrong sign for
their GMT offset. In fact, their sign is inverted since the the Etc zones follow
the POSIX sign conventions. This is the way the original Olson data is set up,
and ICU reproduces the Olson data faithfully. See the Olson files for more
details.*
### References
The ICU system time zones are derived from the tz database (also known as the
“Olson” database) at [ftp://elsie.nci.nih.gov/pub](ftp://elsie.nci.nih.gov/pub)
. This is the data used across much of the industry, including by UNIX systems,
and is usually updated several times each year. ICU (since version 2.8) and base
Java (since Java 1.4) contain code and tz data supporting both current and
historic time zone usage.
## How ICU Represents Dates/Times
ICU represents dates and times using UDates. A UDate is a scalar value that
indicates a specific point in time, independent of calendar system and local
time zone. It is stored as the number of milliseconds from a reference point
known as the epoch. The epoch is midnight Universal Time Coordinated (UTC)
January 1, 1970 A.D. Negative UDate values indicate times before the epoch.
*These classes have the same architecture as the Java classes.*
Most people only need to use the DateFormat classes for parsing and formatting
dates and times. However, for those who need to convert dates and times or
perform numeric calculations, the services described in this section can be very
useful.
To translate a UDate to a useful form, a calendar system and local time zone
must be specified. These are specified in the form of objects of the Calendar
and TimeZone classes. Once these two objects are specified, they can be used to
convert the UDate to and from its corresponding calendar fields. The different
fields are defined in the Calendar class and include the year, month, day, hour,
minute, second, and so on.
Specific Calendar objects correspond to calendar systems (such as Gregorian) and
conventions (such as the first day of the week) in use in different parts of the
world. To obtain a Calendar object for France, for example, call
Calendar::createInstance(Locale::getFrance(), status).
The TimeZone class defines the conversion between universal coordinated time
(UTC),, and local time, according to real-world rules. Different TimeZone
objects correspond to different real-world time zones. For example, call
TimeZone::createTimeZone("America/Los_Angeles") to obtain an object that
implements the U.S. Pacific time zone, both Pacific Standard Time (PST) and
Pacific Daylight Time (PDT).
As previously mentioned, the Calendar and TimeZone objects must be specified
correctly together. One way of doing so is to create each independently, then
use the Calendar::setTimeZone() method to associate the time zone with the
calendar. Another is to use the Calendar::createInstance() method that takes a
TimeZone object. For example, call Calendar::createInstance(
TimeZone::createInstance( "America/Los_Angeles"), Locale:getUS(), status) to
obtain a Calendar appropriate for use in the U.S. Pacific time zone.
ICU has four classes pertaining to calendars and timezones:
* [Calendar](calendar/index.md)
Calendar is an abstract base class that represents a calendar system.
Calendar objects map UDate values to and from the individual fields used in
a particular calendar system. Calendar also performs field computations such
as advancing a date by two months.
* [Gregorian Calendar](calendar/index.md) (§)
GregorianCalendar is a concrete subclass of Calendar that implements the
rules of the Julian calendar and the Gregorian calendar, which is the common
calendar in use internationally today.
* [TimeZone](timezone/index.md)
TimeZone is an abstract base class that represents a time zone. TimeZone
objects map between universal coordinated time (UTC) and local time.
* [SimpleTimeZone](timezone/index.md) (§)
SimpleTimeZone is a concrete subclass of TimeZone that implements standard
time and daylight savings time according to real-world rules. Individual
SimpleTimeZone objects correspond to real-world time zones.

View file

@ -0,0 +1,76 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Date and Time Zone Examples
## C++ TimeZone example code
This example code illustrates some time zone operations.
```C++
UErrorCode success = U_ZERO_ERROR;
UnicodeString dateReturned, curTZNameEn, curTZNameFr;
UDate curDate;
int32_t stdOffset,dstOffset;
// Create a Time Zone with America/Los_Angeles
TimeZone *tzWest = TimeZone::createTimeZone("America/Los_Angeles");
// Print out the Time Zone Name, GMT offset etc.
curTZNameEn = tzWest->getDisplayName(Locale::getEnglish(),curTZNameEn);
u_printf("%s\n","Current Time Zone Name in English:");
u_printf("%S\n", curTZNameEn.getTerminatedBuffer());
curTZNameFr = tzWest->getDisplayName(Locale::getCanadaFrench(),curTZNameFr);
u_printf("%s\n","Current Time Zone Name in French:");
u_printf("%S\n", curTZNameFr.getTerminatedBuffer());
// Create a Calendar to get current date
Calendar* calendar = Calendar::createInstance(success);
curDate = calendar->getNow();
// Print out the Current Date/Time in the given time zone
DateFormat *dt = DateFormat::createDateInstance();
dateReturned = dt->format(curDate,dateReturned,success);
u_printf("%s\n", "Current Time:");
u_printf("%S\n", dateReturned.getTerminatedBuffer());
// Use getOffset to get the stdOffset and dstOffset for the given time
tzWest->getOffset(curDate,true,stdOffset,dstOffset,success);
u_printf("%s\n%d\n","Current Time Zone STD offset:",stdOffset/(1000*60*60));
u_printf("%s\n%d\n","Current Time Zone DST offset:",dstOffset/(1000*60*60));
u_printf("%s\n", "Current date/time is in daylight savings time?");
u_printf("%s\n", (calendar->inDaylightTime(success))?"Yes":"No");
// Use createTimeZoneIDEnumeration to get the specific Time Zone IDs
// in United States with -5 hour standard offset from GMT
stdOffset = (-5)*U_MILLIS_PER_HOUR; // U_MILLIS_PER_HOUR = 60*60*1000;
StringEnumeration *ids = TimeZone::createTimeZoneIDEnumeration(UCAL_ZONE_TYPE_CANONICAL_LOCATION,"US",&stdOffset,success);
for (int i=0; i<ids->count(success);i++) {
u_printf("%s\n",ids->next(NULL,success));
}
// Use Calendar to get the hour of the day for different time zones
int32_t hour1,hour2;
TimeZone *tzEast = TimeZone::createTimeZone("America/New_York");
Calendar * cal1 = Calendar::createInstance(tzWest,success);
Calendar * cal2 = Calendar::createInstance(tzEast,success);
hour1 = cal1->get(UCAL_HOUR_OF_DAY,success);
hour2 = cal2->get(UCAL_HOUR_OF_DAY,success);
u_printf("%s\n%d\n","Current hour of the day in North American West: ", hour1);
u_printf("%s\n%d\n","Current hour of the day in North American East: ", hour2);
delete cal1;
delete cal2;
delete ids;
delete calendar;
delete dt;
```

View file

@ -0,0 +1,242 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# ICU TimeZone Classes
## Overview
A time zone is a system that is used for relating local times in different
geographical areas to one another. For example, in the United States, Pacific
Time is three hours earlier than Eastern Time; when it's 6 P.M. in San
Francisco, it's 9 P.M. in Brooklyn. To make things simple, instead of relating
time zones to one another, all time zones are related to a common reference
point.
For historical reasons, the reference point is Greenwich, England. Local time in
Greenwich is referred to as Greenwich Mean Time, or GMT. (This is similar, but
not precisely identical, to Universal Coordinated Time, or UTC. We use the two
terms interchangeably in ICU since ICU does not concern itself with either leap
seconds or historical behavior.) Using this system, Pacific Time is expressed as
GMT-8:00, or GMT-7:00 in the summer. The offset -8:00 indicates that Pacific
Time is obtained from GMT by adding -8:00, that is, by subtracting 8 hours.
The offset differs in the summer because of daylight savings time, or DST. At
this point it is useful to define three different flavors of local time:
* **Standard Time**:
Standard Time is local time without a daylight savings time offset. For
example, in California, standard time is GMT-8:00; that is, 8 hours before
GMT.
* **Daylight Savings Time**:
Daylight savings time is local time with a daylight savings time offset.
This offset is typically one hour, but is sometimes less. In California,
daylight savings time is GMT-7:00. Daylight savings time is observed in most
non-equatorial areas.
* **Wall Time**:
Wall time is what a local clock on the wall reads. In areas that observe
daylight savings time for part of the year, wall time is either standard
time or daylight savings time, depending on the date. In areas that do not
observe daylight savings time, wall time is equivalent to standard time.
## Time Zones in ICU
ICU supports time zones through two classes:
* **TimeZone**:
`TimeZone` is an abstract base class that defines the time zone API. This API
supports conversion between GMT and local time.
* **SimpleTimeZone**:
`SimpleTimeZone` is a concrete subclass of TimeZone that implements the
standard time zones used today internationally.
Timezone classes are related to `UDate`, the `Calendar` classes, and the
`DateFormat` classes.
### Timezone Class in ICU
`TimeZone` is an abstract base class. It defines common protocol for a hierarchy
of classes. This protocol includes:
* A programmatic ID, for example, "America/Los_Angeles". This ID is used to
call up a specific real-world time zone. It corresponds to the IDs defined
in the [IANA Time Zone datbase](https://www.iana.org/time-zones) used by UNIX
and other systems, and has the format continent/city or ocean/city.
* A raw offset. This is the difference, in milliseconds, between a time zone's
standard time and GMT. Positive raw offsets are east of Greenwich.
* Factory methods and methods for handling the default time zone.
* Display name methods.
* An API to compute the difference between local wall time and GMT.
#### Factory Methods and the Default Timezone
The TimeZone factory method `createTimeZone()` creates and returns a `TimeZone`
object given a programmatic ID. The user does not know what the class of the
returned object is, other than that it is a subclass of `TimeZone`.
The `createAvailableIDs()` methods return lists of the programmatic IDs of all
zones known to the system. These IDs may then be passed to `createTimeZone()` to
create the actual time zone objects. ICU maintains a comprehensive list of
current international time zones, as derived from the Olson data.
`TimeZone` maintains a static time zone object known as the *default time zone*.
This is the time zone that is used implicitly when the user does not specify
one. ICU attempts to match this to the host OS time zone. The user may obtain a
clone of the default time zone by calling `createDefault()` and may change the
default time zone by calling `setDefault()` or `adoptDefault()`.
#### Display Name
When displaying the name of a time zone to the user, use the display name, not
the programmatic ID. The display name is returned by the `getDisplayName()`
method. A time zone may have three display names:
* Generic name, such as "Pacific Time".
* Standard name, such as "Pacific Standard Time".
* Daylight savings name, such as "Pacific Daylight Time".
Furthermore, each of these names may be LONG or SHORT. The SHORT form is
typically an abbreviation, e.g., "PST", "PDT".
In addition to being available directly from the `TimeZone` API, the display name
is used by the date format classes to format and parse time zones.
#### getOffset() API
`TimeZone` defines the API `getOffset()` by which the caller can determine the
difference between local time and GMT. This is a pure virtual API, so it is
implemented in the concrete subclasses of `TimeZone`.
## Updating the Time Zone Data
Time zone data changes often in response to governments around the world
changing their local rules and the areas where they apply. ICU derives its tz
data from the [IANA Time Zone Database](http://www.iana.org/time-zones).
The ICU project publishes updated timezone resource data in response to IANA
updates, and these can be used to patch existing ICU installations. Several
update strategies are possible, depending on the ICU version and configuration.
* ICU4J: Use the time zone update utility.
* ICU4C 54 and newer: Drop in the binary update files.
* ICU4C 36 and newer: the best update strategy will depend on how ICU data
loading is configured for the specific ICU installation.
* Data is loaded from a .dat package file: replace the time zone resources
in the .dat file using the icupkg tool.
* Data is loaded from a .dll or .so shared library: obtain the updated
sources for the tz resources and rebuild the data library.
* Data is loaded from individual files: drop in the updated binary .res
files.
The [ICU Data](../../icudata.md) section of this user guide gives more
information on how ICU loads resources.
The ICU resource files required for time zone data updates are posted at
<https://github.com/unicode-org/icu-data/tree/master/tzdata/icunew>. The
required resource files for ICU version 44 and newer are
* zoneinfo64.res
* windowsZones.res
* timezoneTypes.res
* metaZones.res
### ICU4C TZ update of a .dat Package File
For ICU configurations that load data from a .dat package file, replace the time
zone resources in that file.
1. Download the new .res files from
`https://github.com/unicode-org/icu-data/tree/master/tzdata/icunew/<IANA tz version>/44/<platform directory>`.
* `<IANA tz version>` is a combination of year and letter, such as "2019c".
* *"44"* is the directory for updates to ICU version 4.4 and newer.
* `<platform directory>` is "le" for little endian processors, including
all Intel processors.
* `<platform directory>` is "be" for big endian processors, including IBM
Power and Sparc.
* `<platform directory>` is "ee" for IBM mainframes using EBCDIC character
sets.
2. Check that the tool "icupkg" is available. If not already on your system,
you can get it by [downloading](https://github.com/unicode-org/icu/releases)
and building ICU, following the instructions in the ReadMe file included in
the download. Alternatively, on many Linux systems, "apt-get install
icu-devtools" will install the tool.
3. Locate the .dat file to be updated, and do the update. The commands below
are for a .dat file named icudt55l.dat.
```Shell
icupkg -a zoneinfo64.res icudt55l.dat
icupkg -a windowsZones.res icudt55l.dat
icupkg -a timezoneTypes.res icudt55l.dat
icupkg -a metaZones.res icudt55l.dat
```
In ICU versions older than 4.4 some of the time zone resources have slightly
different names. The update procedure is the same, but substitute the names
found in the desired download directory - 42, 40, 38 or 36.
### ICU4C TZ Update with Drop-in .res files (ICU 54 and newer)
With this approach, the four individual .res files are dropped in any convenient
location in the file system, and ICU is given an absolute path to the directory
containing them. For the time zone resources only, ICU will check this directory
first when loading data. This approach will work even when all other ICU data
loading is from a shared library or .dat file.
There are two ways to specify the directory:
* At ICU build time, by defining the C pre-processor variable
`U_TIMEZONE_FILES_DIR` to the run time path to the directory containing the
.res files.
* At run time, by setting the environment variable `ICU_TIMEZONE_FILES_DIR` to
the absolute path of the directory containing the .res files.
If both are defined, the environment variable `ICU_TIMEZONE_FILES_DIR` take
precedence. If either is defined, the time zone directory will be checked first,
meaning that time zone resource files placed there will override time zone
resources that may exist in other ICU data locations.
To do the update, download the .res files appropriate for the platform, as
described for the .dat file update above, and copy them into the time zone res
file directory.
### ICU4C TZ update when ICU is configured for individual files
If the ICU-using application sets an ICU data path (or can be changed to set
one), then the time zone .res file can be placed there. Download the files as
described above and copy them to the specified directory. See the
[ICU Data](../../icudata.md) page of the user guide for more information about
the ICU data path.
### ICU4C TZ update when ICU data is built into a shared library
1. Set up the environment necessary to rebuild your specific configuration of
ICU.
2. Download the .txt file sources for the updated resources from
`https://github.com/unicode-org/icu-data/tree/master/tzdata/icunew/<IANA tz version>/44`
3. Copy the downloaded .txt files into the ICU sources for your installation,
in the subdirectory source/data/misc/
4. Rebuid ICU.
5. Copy the freshly built ICU data shared library to the desired destination.
> :point_right: **Note**: The standard ICU download package contains pre-built
ICU data. To rebuild ICU data from .txt files, you will need to replace the
contents of `icu4c/source/data` with the contents of ICU4C data.zip. See
[ICU Data Build Tool](../../icu_data/buildtool.md) for more details.
There are too many possible platform variations to be more specific about how to
rebuild ICU4C in these instructions. See the ReadMe file included with the ICU
sources for general information on building ICU.
### Update the time zone data for ICU4J
The [ICU4J Time Zone Update Update
Utility](http://site.icu-project.org/download/icutzu) automates the process of
updating ICU4J jar files with the latest time zone data. Instructions for use
are [here](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-data/blob/master/tzdata/tzu/readme.html).
The updater will work with ICU version 3.4.2 and newer.
Sample Code
See the [Date and Time Zone Examples](examples.md) subpage.

View file

@ -0,0 +1,256 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Universal Time Scale
## Overview
There are quite a few different conventions for binary datetime, depending on
the platform or protocol. Some of these have severe drawbacks. For example,
people using Unix time (seconds since Jan 1, 1970, usually in a 32-bit integer)
think that they are safe until near the year 2038. But cases can and do arise
where arithmetic manipulations causes serious problems. Consider the computation
of the average of two datetimes, for example: if one calculates them with
`averageTime = (time1 + time2)/2`, there will be overflow even with dates
beginning in 2004. Moreover, even if these problems don't occur, there is the
issue of conversion back and forth between different systems.
Binary datetimes differ in a number of ways: the data type, the unit, and the
epoch (origin). We'll refer to these as time scales. For example: (Sorted by
epoch and unit, descending. In Java, `int64_t`=`long` and `int32_t`=`int`.)
| Source | Data Type | Epoch | Unit |
| ------------------------------------------ | -------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------- |
| MacOS X (CFDate/NSDate) | double (1.0=1s but fractional seconds are used as well; imprecise for 0.1s etc.) | 2001-Jan-01 | seconds (and fractions thereof) |
| Unix time_t | int32_t or int64_t (signed int32_t limited to 1970..2038) | 1970-Jan-01 | seconds |
| Java Date | int64_t | 1970-Jan-01 | milliseconds |
| Joda DateTime | int64_t | 1970-Jan-01 | milliseconds |
| ICU4C UDate | double (does not use fractional milliseconds) | 1970-Jan-01 | milliseconds |
| JavaScript Date | double (does not use fractional milliseconds; JavaScript Number stores a double) | 1970-Jan-01 | milliseconds |
| Unix struct timeval (as in gettimeofday) | struct: time_t (seconds); suseconds_t (microseconds) | 1970-Jan-01 | microseconds |
| Gnome g_get_real_time() | gint64 | 1970-Jan-01 | microseconds |
| Unix struct timespec (as in clock_gettime) | struct: time_t (seconds); long (nanoseconds) | 1970-Jan-01 | nanoseconds |
| MacOS (old) | uint32_t (1904..2040) | 1904-Jan-01 | seconds |
| Excel | ? | 1899-Dec-31 | days |
| DB2 | ? | 1899-Dec-31 | days |
| Windows FILETIME | int64_t | 1601-Jan-01 | ticks (100 nanoseconds; finest granularity in industry) |
| .NET DateTime | uint62 (only 0001-9999; only 62 bits; also 2-bit field for UTC/local) | 0001-Jan-01 | ticks (100 nanoseconds; finest granularity in industry) |
| ICU Universal Time Scale | int64_t | 0001-Jan-01 | same as .Net but allows 29000BC..29000AD |
All of the epochs start at 00:00 am (the earliest possible time on the day in
question), and are usually assumed to be UTC.
The ranges, in years, for different data types are given in the following table.
The range for integer types includes the entire range expressible with positive
and negative values of the data type. The range for double is the range that
would be allowed without losing precision to the corresponding unit.
| Units | 64-bit integer | Double | 32-bit integer |
| ---------------------- | ----------------------- | -------------- | -------------- |
| 1 second | 5.84542x10<sup>11</sup> | 285,420,920.94 | 136.10 |
| 1 millisecond | 584,542,046.09 | 285,420.92 | 0.14 |
| 1 microsecond | 584,542.05 | 285.42 | 0.00 |
| 100 nanoseconds (tick) | 58,454.20 | 28.54 | 0.00 |
| 1 nanosecond | 584.5420461 | 0.2854 | 0.00 |
ICU implements a universal time scale that is similar to the
[.NET framework's System.DateTime](https://docs.microsoft.com/dotnet/api/system.datetime?view=netframework-4.8).
The universal time scale is a 64-bit integer that holds ticks since midnight,
January 1<sup>st</sup>, 0001. Negative values are supported. This has enough
range to guarantee that calculations involving dates around the present are safe.
The universal time scale always measures time according to the proleptic
Gregorian calendar. That is, the Gregorian calendar's leap year rules are used
for all times, even before 1582 when it was introduced. (This is different from
the default ICU calendar which switches from the Julian to the Gregorian
calendar in 1582. See `GregorianCalendar::setGregorianChange()` and
`ucal_setGregorianChange()`.)
ICU provides conversion functions to and from all other major time scales,
allowing datetimes in any time scale to be converted to the universal time
scale, safely manipulated, and converted back to any other datetime time scale.
## Background
So how did we decide what to use for the universal time scale? Java time has
plenty of range, but cannot represent a .NET `System.DateTime` value without
severe loss of precision. ICU4C time addresses this by using a double that is
otherwise equivalent to the Java time. However, there are disadvantages with
doubles. They provide for much more graceful degradation in arithmetic
operations. But they only have 53 bits of accuracy, which means that they will
lose precision when converting back and forth to ticks. What would really be
nice would be a long double (80 bits -- 64 bit mantissa), but that is not
supported on most systems.
The Unix extended time uses a structure with two components: time in seconds and
a fractional field (microseconds). However, this is clumsy, slow, and prone to
error (you always have to keep track of overflow and underflow in the fractional
field). `BigDecimal` would allow for arbitrary precision and arbitrary range, but
we did not want to use this as the normal type, because it is slow and does not
have a fixed size.
Because of these issues, we concluded that the .NET `System.DateTime` is the best
timescale to use. However, we use the full range allowed by the data type,
allowing for datetimes back to 29,000 BC and up to 29,000 AD. (`System.DateTime`
uses only 62 bits and only supports dates from 0001 AD to 9999 AD.) This time
scale is very fine grained, does not lose precision, and covers a range that
will meet almost all requirements. It will not handle the range that Java times
do, but frankly, being able to handle dates before 29,000 BC or after 29,000 AD
is of very limited interest.
## Constants
ICU provides routines to convert from other timescales to the universal time
scale, to convert from the universal time scale to other timescales, and to get
information about a particular timescale. In all of these routines, the
timescales are referenced using an integer constant, according to the following
table:
| Source | ICU4C | ICU4J |
| ---------------------- | --------------------------- | ---------------------- |
| Java | UDTS_JAVA_TIME | JAVA_TIME |
| Unix | UDTS_UNIX_TIME | UNIX_TIME |
| ICU4C | UDTS_ICU4C_TIME | ICU4C_TIME |
| Windows FILETIME | UDTS_WINDOWS_FILE_TIME | WINDOWS_FILE_TIME |
| .NET DateTime | UDTS_DOTNET_DATE_TIME | DOTNET_DATE_TIME |
| Macintosh (old) | UDTS_MAC_OLD_TIME | MAC_OLD_TIME |
| Macintosh | UDTS_MAC_TIME | MAC_TIME |
| Excel | UDTS_EXCEL_TIME | EXCEL_TIME |
| DB2 | UDTS_DB2_TIME | DB2_TIME |
| Unix with microseconds | UDTS_UNIX_MICROSECONDS_TIME | UNIX_MICROSECONDS_TIME |
The routine that gets a particular piece of information about a timescale takes
an integer constant that identifies the particular piece of information,
according to the following table:
| Value | ICU4C | ICU4J |
| -------------------- | ----------------------- | ------------------ |
| Precision | UTSV_UNITS_VALUE | UNITS_VALUE |
| Epoch offet | UTSV_EPOCH_OFFSET_VALUE | EPOCH_OFFSET_VALUE |
| Minimum "from" value | UTSV_FROM_MIN_VALUE | FROM_MIN_VALUE |
| Maximum "from" value | UTSV_FROM_MAX_VALUE | FROM_MAX_VALUE |
| Minimum "to" value | UTSV_TO_MIN_VALUE | TO_MIN_VALUE |
| Maximum "to" value | UTSV_TO_MAX_VALUE | TO_MAX_VALUE |
Here is what the values mean:
* Precision -- the precision of the timescale, in ticks.
* Epoch offset -- the distance from the universal timescale's epoch to the timescale's epoch, in the timescale's precision.
* Minimum "from" value -- the minimum timescale value that can safely be converted to the universal timescale.
* Maximum "from" value -- the maximum timescale value that can safely be converted to the universal timescale.
* Minimum "to" value -- the minimum universal timescale value that can safely be converted to the timescale.
* Maximum "to" value -- the maximum universal timescale value that can safely be converted to the timescale.
## Converting
You can convert from other timescale values to the universal timescale using the
"from" methods. In ICU4C, you use `utmscale_fromInt64`:
```c
UErrorCode err = U_ZERO_ERROR;
int64_t unixTime = ...;
int64_t universalTime;
universalTime = utmscale_fromInt64(unixTime, UDTS_UNIX_TIME, &err);
```
In ICU4J, you use `UniversalTimeScale.from`:
```java
long javaTime = ...;
long universalTime;
universalTime = UniversalTimeScale.from(javaTime, UniversalTimeScale.JAVA_TIME);
```
You can convert values in the universal timescale to other timescales using the
"to" methods. In ICU4C, you use `utmscale_toInt64`:
```c
UErrorCode err = U_ZERO_ERROR;
int64_t universalTime = ...;
int64_t unixTime;
unixTime = utmscale_toInt64(universalTime, UDTS_UNIX_TIME, &err);
```
In ICU4J, you use `UniversalTimeScale.to`:
```java
long universalTime = ...;
long javaTime;
javaTime = UniversalTimeScale.to(universalTime, UniversalTimeScale.JAVA_TIME);
```
That's all there is to it!
If the conversion is out of range, the ICU4C routines
will set the error code to `U_ILLEGAL_ARGUMENT_ERROR`, and the ICU4J methods will
throw `IllegalArgumentException`. In ICU4J, you can avoid out of range conversions
by using the `BigDecimal` methods:
```java
long fileTime = ...;
double icu4cTime = ...;
BigDecimal utICU4C, utFile, utUnix, unixTime, macTime;
utFile = UniversalTimeScale.bigDecimalFrom(fileTime, UniversalTime.WINDOWS_FILE_TIME);
utICU4C = UniversalTimeScale.bigDecimalFrom(icu4cTime, UniversalTimeScale.ICU4C_TIME);
unixTime = UniversalTimeScale.toBigDecimal(utFile, UniversalTime.UNIX_TIME);
macTime = UniversalTimeScale.toBigDecimal(utICU4C, UniversalTime.MAC_TIME);
utUnix = UniversalTimeScale.bigDecimalFrom(unixTime, UniversalTime.UNIX_TIME);
```
> :point_right: **Note**: Because the Universal Time Scale has a finer resolution
> than some other time scales, time values that can be represented exactly in the
> Universal Time Scale will be rounded when converting to these time scales, and
> resolution will be lost. If you convert these values back to the Universal Time
> Scale, you will not get the same time value that you started with. If the time
> scale to which you are converting uses a double to represent the time value, you
> may loose precision even though the double supports a range that is larger than
> the range supported by the Universal Time Scale.
## Formatting and Parsing
Currently, ICU does not support direct formatting or parsing of Universal Time
Scale values. If you want to format a Universal Time Scale value, you will need
to convert it to an ICU time scale value first. Use `UTDS_ICU4C_TIME` with ICU4C,
and `UniversalTimeScale.JAVA_TIME` with ICU4J.
When you parse a datetime string, the result will be an ICU time scale value.
You can convert this value to a Universal Time Scale value using `UDTS_ICU4C_TIME`
with ICU4C, and `UniversalTime.JAVA_TIME` for ICU4J.
See the previous section, *Converting*, for details of how to do the conversion.
## Getting Timescale Information
To get information about a particular timescale in ICU4C, use
`utmscale_getTimeScaleValue`:
```c
UErrorCode err = U_ZERO_ERROR;
int64_t unixEpochOffset = utmscale_getTimeScaleValue(
UDTS_UNIX_TIME,
UTSV_EPOCH_OFFSET_VALUE,
&err);
```
In ICU4J, use `UniversalTimeScale.getTimeScaleValue`:
```java
long javaEpochOffset = UniversalTimeScale.getTimeScaleValue(
UniversalTimeScale.JAVA_TIME,
UniversalTimeScale.EPOCH_OFFSET_VALUE);
```
If the integer constants for selecting the timescale or the timescale value are
out of range, the ICU4C routines will set the error code to
`U_ILLEGAL_ARGUMENT_ERROR`, and the ICU4J methods will throw
`IllegalArgumentException`.

899
docs/userguide/design.md Normal file
View file

@ -0,0 +1,899 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# ICU Architectural Design
This chapter discusses the ICU design structure, the ICU versioning support, and
the introduction of namespace in C++.
## Java and ICU Basic Design Structure
The JDK internationalization components and ICU components both share the same
common basic architectures with regard to the following:
1. locales
2. data-driven services
3. ICU threading models and the open and close model
4. cloning customization
5. error handling
6. extensibility
7. resource bundle inheritance model
There are design features in ICU4C that are not in the Java Development Kit
(JDK) due
to programming language restrictions. These features include the following:
### Locales
Locale IDs are composed of language, country, and variant information. The
following links provide additional useful information regarding ISO standards:
[ISO-639](http://lcweb.loc.gov/standards/iso639-2/englangn.html) , and an ISO
Country Code,
[ISO-3166](http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html)
. For example, Italian, Italy, and Euro are designated as: it_IT_EURO.
### Data-driven Services
Data-driven services often use resource bundles for locale data. These services
map a key to data. The resources are designed not only to manage system locale
information but also to manage application-specific or general services data.
ICU supports string, numeric, and binary data types and can be structured into
nested arrays and tables.
This results in the following:
1. Data used by the services can be built at compile time or run time.
2. For efficient loading, system data is pre-compiled to .dll files or files
that can be mapped into memory.
3. Data for services can be added and modified without source code changes.
### ICU Threading Model and Open and Close Model
The "open and close" model supports multi-threading. It enables ICU users to use
the same kind of service for different locales, either in the same thread or in
different threads.
For example, a thread can open many collators for different languages, and
different threads can use different collators for the same locale
simultaneously. Constant data can be shared so that only the current state is
allocated for each editor.
The ICU threading model is designed to avoid contention for resources, and
enable you to use the services for multiple locales simultaneously within the
same thread. The ICU threading model, like the rest of the ICU architecture, is
the same model used for the international services in Java™.
When you use a service such as collation, the client opens the service using an
ID, typically a locale. This service allocates a small chunk of memory used for
the state of the service, with pointers to shared, read-only data in support of
that service. (In Java, you call `getInstance()` to create an object; in C++,
`createInstance()`. ICU uses the open and close metaphor in C because it is more
familiar to C programmers.)
If no locale is supplied when a service is opened, ICU uses the default locale.
Once a service is open, changing the default locale has no effect. Thus, there
can not be any thread synchronization between the default locales and open
services.
When you open a second service for the same locale, another small chunk of
memory is used for the state of the service, with pointers to the same shared,
read-only data. Thus, the majority of the memory usage is shared. When any
service is closed, then the chunk of memory is deallocated. Other connections
that point to the same shared data stay valid.
Any number of services, for the same locale or different locales, can be open
within the same thread or in different threads.
#### Thread-safe const APIs
In recent ICU releases, we have worked to make any service object *thread-safe*
(usable concurrently) *as long as all of the threads are using only const APIs*:
APIs that are declared const in C++, take a const this-like service pointer in
C, or are "logically const" in Java. This is an enhancement over the original
Java/ICU threading model. (Originally, concurrent use of even only const APIs
was not thread-safe.)
However, you cannot use a reference to an open service object in two threads at
the same time *if either of them calls any non-const API*. An individual open
service object is not thread-safe for concurrent "writes". Rather, for non-const
use, you must use the clone function to create a copy of the service you want
and then pass this copy to the second thread. This procedure allows you to use
the same service in different threads, but avoids any thread synchronization or
deadlock problems.
#### Freezable
Some classes also implement the `Freezable` interface (or similar pattern in
C++), for example `UnicodeSet` or `Collator`: An object that typically starts
out mutable can be set up and then "frozen", which makes it immutable and thus
usable concurrently because all non-const APIs are disabled. A frozen object can
never be "thawed". For example, a `Collator` can be created, various attributes
set, then frozen and then used from many threads for comparing strings and
getting sort keys.
#### Clone vs. open
Clone operations are designed to be much faster than reopening the service with
initial parameters and copying the source's state. (With objects in C++ and
Java, the clone function is also much safer than trying to recreate a service,
since you get the proper subclass.) Once a service is cloned, changes will not
affect the original source service, or vice-versa.
Thus, the normal mode of operation is to:
1. Open a service with a given locale.
2. Use the service as long as needed. However, do not keep opening and closing
a service within a tight loop.
3. Clone a service if it needs to be used in parallel in another thread.
4. Close any clones that you open as well as any instances of the services that
are owned.
> :point_right: **Note**: These service instances may be closed in any sequence.
The preceding steps are given as an example.
#### Cloning Customization
Typically, the services supplied with ICU cover the vast majority of usages.
However, there are circumstances where the service needs to be customized for a
new locale. ICU (and Java) enable you to create customized services. For
example, you can create a `RuleBasedCollator` by merging the rules for French and
Arabic to get a custom French-Arabic collation sequence. By merging these rules,
the pointer does not point to a read-only table that is shared between threads.
Instead, the pointer refers to a table that is specific to your particular open
service. If you clone the open service, the table is copied. When you close the
service, the table is destroyed.
For some services, ICU supplies registration. You can register a customized open
service under an ID; keeping a copy of that service even after you close the
original. A client in that thread or in other threads can recreate a copy of the
service by opening with that ID.
ICU may cache service instances. Therefore, registration should be done during
startup, before opening services by locale ID.
These registrations are not persistent; once your program finishes, ICU flushes
all the registrations. While you still might have multiple copies of data
tables, it is faster to create a service from a registered ID than it is to
create a service from rules.
> :point_right: **Note**: To work around the lack of persistent registration,
query the service for the parameters used to create it and then store those
parameters in a file on a disk.
For services whose IDs are locales, such as collation, the registered IDs must
also be locales. For those services (like Transliteration or Timezones) that are
cross-locale, the IDs can be any string.
Prospective future enhancements for this model are:
1. Having custom services share data tables, by making those tables reference
counted. This will reduce memory consumption and speed clone operations (a
performance enhancement chiefly useful for multiple threads using the same
customized service).
2. Expanding registration for all the international services.
3. Allowing persistent registration of services.
#### Per-client Locale ID vs Per-thread Locale ID
Some application environments operate by setting a per thread (or per process)
locale ID, and then not passing the locale ID as a parameter during processing.
If this usage model were used with ICU in a multi-threaded server, it might
result in ICU being requested to constantly open, use, and then close service
objects. Instead, it is recommended that locale IDs be associated with each
client be stored with other per-client data, along with any service objects
(such as collators or formatters) that client might use. If operations involving
a single client are short-lived, it might be more efficient to keep a pool of
service objects, organized according to locale. Then, if a particular locale's
formatter is in high demand, that formatter can be used, and then returned to
the pool.
### ICU Memory Usage
ICU4C APIs are designed to allow separate heaps for its libraries vs. the
application. This is achieved by providing functions to allocate and release
objects owned by ICU4C using only ICU4C library functions. For more details see
the Memory Usage section in the [Coding Guidelines](dev/codingguidelines.md).
### ICU Initialization and Termination
The ICU library does not normally require any explicit initialization prior to
use. An application begins use simply by calling any ICU API in the usual way.
(There is one exception to this, described below.)
In C++ programs, ICU objects and APIs may safely be used during static
initialization of other application-defined classes or objects. There are no
order-of-initialization problems between ICU and static objects from other
libraries because ICU does not rely on C++ static object initialization for its
normal operation.
When an application is terminating, it may optionally call the function
`u_cleanup(void)`, which will free any heap storage that has been allocated and
held by the ICU library. The main benefit of `u_cleanup()` occurs when using
memory leak checking tools while debugging or testing an application. Without
`u_cleanup()`, memory being held by the ICU library will be reported as leaks.
(For some platforms, the configure option `--enable-auto-cleanup` (or
defining the option `UCLN_NO_AUTO_CLEANUP` to 0) will add code which
automatically cleans up ICU when its shared library is unloaded. See comments in
`ucln_imp.h`)
#### Initializing ICU in Multithreaded Environments
There is one specialized case where extra care is needed to safely initialize
ICU. This situation will arise only when ALL of the following conditions occur:
1. The application main program is written in plain C, not C++.
2. The application is multithreaded, with the first use of ICU within the
process possibly occurring simultaneously in more than one thread.
3. The application will be run on a platform that does not handle C++ static
constructors from libraries when the main program is not in C++. Platforms
known to exhibit this behavior are Mac OS X and HP/UX. Platforms that handle
C++ libraries correctly include Windows, Linux and Solaris.
To safely initialize the ICU library when all of the above conditions apply, the
application must explicitly arrange for a first-use of ICU from a single thread
before the multi-threaded use of ICU begins (see below for basic steps in safely
initializing the ICU library). A convenient ICU operation for this purpose is
`uloc_getDefault()` , declared in the header file `unicode/uloc.h`.
#### Steps in Safely Initializing ICU in Single and Multi-threaded Environments
1. If needed, certain data loading functions, such as `u_setCommonData()`,
`u_setAppData()`, and `u_setDataDirectory()`, must be called before any other
ICU function. In addition there are some other heap, mutex, and trace
functions, such as `u_setMemoryFunctions()` and `u_setMutexFunctions()`, which
also must be called during the initial and unused state of ICU.
2. Next, `u_init()` can be called to ensure proper loading and initialization of
data that are required internally by various ICU functions. Explicit use of
this function is needed in a multi-threaded application by the main thread.
Each subsequent thread does not need to call `u_init()` again after the main
thread has successfully executed this function. In a single threaded
program, calls to this function is not needed but
recommended.
3. After the successful initialization of ICU, normal use of ICU, whether using
multiple threads or just a single one, is permitted.
4. When the application is done using ICU, the individual threads must cease
all ICU services leaving only the main thread.
5. After all but the main thread have released ICU, `u_cleanup()` can be called.
The releasing of the individual threads to ICU is necessary because
`u_cleanup()` is not thread safe. In addition, all ICU items, including
collators, resource bundles, and converters, must be closed before calling
this function. `u_cleanup()` will free/delete all memory owned by the ICU
libraries returning them to their original load state. Generally, this
function should be called only once just before an application exits.
However, applications needing to dynamically load and unload the ICU
libraries can call this function just before the library unloads.
`u_cleanup()` also clears any ICU heap functions, mutex functions, or trace
functions that may haven been set for the process. If ICU is to be
reinitialized after calling `u_cleanup()`, these runtime override functions
will need to be setup again if they are still required. Great care needs to
be exercised when using `u_cleanup()` and should only be implemented by those
who know what they are doing. In any event, if the application doesn't exit
and requires ICU again after correctly calling `u_cleanup()`, go back to step
(1).
### Error Handling
In order for ICU to maximize portability, this version includes only the subset
of the C++ language that compile correctly on older C++ compilers and provide a
usable C interface. Thus, there is no use of the C++ exception mechanism in the
code or Application Programming Interface (API).
To communicate errors reliably and support multi-threading, this version uses an
error code parameter mechanism. Every function that can fail takes an error-code
parameter by reference. This parameter is always the last parameter listed for
the function.
The `UErrorCode` parameter is defined as an enumerated type. Zero represents no
error, positive values represent errors, and negative values represent non-error
status codes. Macros (`U_SUCCESS` and `U_FAILURE`) are provided to check the
error code.
The `UErrorCode` parameter is an input-output function. Every function tests the
error code before performing any other task and immediately exits if it produces
a FAILURE error code. If the function fails later on, it sets the error code
appropriately and exits without performing any other work, except for any
cleanup it needs to do. If the function encounters a non-error condition that it
wants to signal, such as "encountered an unmapped character" in conversion, the
function sets the error code appropriately and continues. Otherwise, the
function leaves the error code unchanged.
Generally, only the functions that do not take a `UErrorCode` parameter, but
call functions that do, must declare a variable. Almost all functions that take
a `UErrorCode` parameter, and also call other functions that do, merely have to
propagate the error code that they were passed to the functions they call.
Functions that declare a new `UErrorCode` parameter must initialize it to
`U_ZERO_ERROR` before calling any other functions.
ICU enables you to call several functions (that take error codes) successively
without having to check the error code after each function. Each function
usually must check the error code before doing any other processing, since it is
supposed to stop immediately after receiving an error code. Propagating the
error-code parameter down the call chain saves the programmer from having to
declare the parameter in every instance and also mimics the C++ exception
protocol more closely.
### Extensibility
There are 3 major extensibility elements in ICU:
1. **Data Extensibility**:
The user installs new locales or conversion data to enhance the existing ICU
support. For more details, refer to the package tool (:construction: **TODO**: need link)
chapter for more information.
2. **Code Extensibility**:
The classes, data, and design are fully extensible. Examples of this
extensibility include the BreakIterator , RuleBasedBreakIterator and
DictionaryBasedBreakIterator classes.
3. **Error Handling Extensibility**:
There are mechanisms available to enhance the built-in error handling when
it is necessary. For example, you can design and create your own conversion
callback functions when an error occurs. Refer to the
[Conversion](conversion/index.md) chapter callback section for more
information.
### Resource Bundle Inheritance Model
A resource bundle is a set of \<key,value> pairs that provide a mapping from key
to value. A given program can have different sets of resource bundles; one set
for error messages, one for menus, and so on. However, the program may be
organized to combine all of its resource bundles into a single related set.
The set is organized into a tree with "root" at the top, the language at the
first level, the country at the second level, and additional variants below
these levels. The set must contain a root that has all keys that can be used by
the program accessing the resource bundles.
Except for the root, each resource bundle has an immediate parent. For example,
if there is a resource bundle `X_Y_Z`, then there must be the resource bundles:
`X_Y`, and `X`. Each child resource bundle can omit any \<key,value> pair that is
identical to its parent's pair. (Such omission is strongly encouraged as it
reduces data size and maintenance effort). It must override any \<key,value> pair
that is different from its parent's pair. If you have a resource bundle for the
locale ID `language_country_variant`, you must also have
a bundle for the ID `language_country` and one for the ID `language`.
If a program doesn't find a key in a child resource bundle, it can be assumed
that it has the same key as the parent. The default locale has no effect on
this. The particular language used for the root is commonly English, but it
depends on the developer's preference. Ideally, the language should contain
values that minimize the need for its children to override it.
The default locale is used only when there is not a resource bundle for a given
language. For example, there may not be an Italian resource bundle. (This is
very different than the case where there is an Italian resource bundle that is
missing a particular key.) When a resource bundle is missing, ICU uses the
parent unless that parent is the root. The root is an exception because the root
language may be completely different than its children. In this case, ICU uses a
modified lookup and the default locale. The following are different lookup
methods available:
**Lookup chain** : Searching for a resource bundle.
1. `en_US_<some-variant>`
2. `en_US`
3. `en`
4. `<defaultLang>_<defaultCountry>`
5. `<defaultLang>`
6. `root`
**Lookup chain** : Searching for a \<key, value> pair after
`en_US_<some-variant>` has ben loaded. ICU does not use the default locale in
this case.
1. `en_US_<some-variant>`
2. `en_US`
3. `en`
4. `root`
## Other ICU Design Principles
ICU supports extensive version code and data changes and introduces namespace
usage.
### Version Numbers in ICU
Version changes show clients when parts of ICU change. ICU; its components (such
as Collator); each resource bundle, including all the locale data resource
bundles; and individual tagged items within a resource bundle, have their own
version numbers. Version numbers numerically and lexically increase as changes
are made.
All version numbers are used in Application Programming Interfaces (APIs) with a
`UVersionInfo` structure. The `UVersionInfo` structure is an array of four
unsigned bytes. These bytes are:
1. Major version number
2. Minor version number
3. Milli version number
4. Micro version number
Two `UVersionInfo` structures may be compared using binary comparison (`memcmp`)
to see which is larger or newer. Version numbers may be different for different
services. For instance, do not compare the ICU library version number to the ICU
collator version number.
`UVersionInfo` structures can be converted to and from string representations as
dotted integers (such as "1.4.5.0") using the `u_versionToString()` and
`u_versionFromString()` functions. String representations may omit trailing zeros.
The interpretation of version numbers depends on what is being described.
#### ICU Release Version Number (ICU 49 and later)
The first version number field contains the ICU release version number, for
example 49. Each new version might contain new features, new locale data, and
modified behavior. (See below for more information on
[ICU Binary Compatibility](###icu-binary-compatibility).)
The second field is 1 for the initial release (e.g., 49.1). The second and
sometimes third fields are incremented for binary compatible maintenance
releases.
* For maintenance releases for only either C or J, the third field is
incremented (e.g., ICU4C 49.1.1).
* For shared updates for C & J, the second field is incremented to 2 and
higher (e.g., ICU4C & ICU4J 49.2).
(The second field is 0 during development, with milestone numbers in the third
field during that time. For example, 49.0.1 for 49 milestone 1.)
#### ICU Release Version Number (ICU 1.4 to ICU 4.8)
In earlier releases, the first two version fields together indicated the ICU
release, for example 4.8. The third field was 0 for the initial release, and 1
and higher for binary compatible (bug fixes only) maintenance releases (e.g.,
4.8.1). The fourth field was used for updates specific to only one of Java, C++,
or ICU-in-Eclipse.
The second version field was *even* for formal releases ("reference releases")
(e.g., 1.6 or 4.8) and *odd* during their development (unreleased unstable
snapshot versions; e.g., 4.7). During development, the third field contained the
milestone number (e.g., 4.7.1 for 4.8 milestone 1). For very old ICU code, we
published semi-formal “enhancement” releases with odd second-field numbers
(e.g., 1.7).
Library filenames and some other internal uses already used a concatenation of
the first two fields ("48" for 4.8).
Resource Bundles and Elements
The data stored in resource bundles is tagged with version numbers. A resource
bundle can contain a tagged string named "Version" that declares the version
number in dotted-integer format. For example,
```Text
en {
Version { "1.0.3.5" }
...
}
```
A resource bundle may omit the "version" element and thus, will inherit a
version along the usual chain. For example, if the resource bundle **en_US**
contained no "version" element, it would inherit "1.0.3.5" from the parent en
element. If inheritance passes all the way to the root resource bundle and it
contains no "version" resource, then the resource bundle receives the default
version number 0.
Elements within a resource bundle may also contain version numbers. For example:
```Text
be {
CollationElements {
Version { "1.0.0.0" }
...
}
}
```
In this example, the CollationElements data is version 1.0.0.0. This element
version is not related to the version of the bundle.
#### Internal version numbers
Internally, data files carry format and other version numbers. These version
numbers ensure that ICU can use the data file. The interpretation depends
entirely on the data file type. Often, the major number in the format version
stays the same for backwards-compatible changes to a data file format. The minor
format version number is incremented for additions that do not violate the
backwards compatibility of the data file.
#### Component Version Numbers
ICU component version numbers may be found using:
1. `u_getVersion()` returns the version number of ICU as a whole in C++. In C,
`ucol_getVersion()` returns the version number of ICU as a whole.
2. `ures_getVersion()` and `ResourceBundle::getVersion()` return the version
number of a ResourceBundle. This is a data version number for the bundle as a
whole and subject to inheritance.
3. `u_getUnicodeVersion()` and `Unicode::getUnicodeVersion()` return the version
number of the Unicode character data that underlies ICU. This version
reflects the numbering of the Unicode releases. See
<http://www.unicode.org/> for more information.
4. `Collator::getVersion()` in C++ and `ucol_getVersion()` in C return the version
number of the Collator. This is a code version number for the collation code
and algorithm. It is a combination of version numbers for the collation
implementation, the Unicode Collation Algorithm data (which is the data that
is used for characters that are not mentioned in a locale's specific
collation elements), and the collation elements.
#### Configuration and Management
A major new feature in ICU 2.0 is the ability to link to different versions of
ICU with the same program. Using this new feature, a program can keep using ICU
1.8 collation, for example, while using ICU 2.0 for other services. ICU now can
also be unloaded if needed, to free up resources, and then reloaded when it is
needed.
### Namespace in C++
ICU 2.0 introduced the use of a C++ namespace to avoid naming collision between
ICU exported symbols and other libraries. All the public ICU C++ classes are
defined in the "icu_VersionNumber::" namespace, which is also aliased as
namespace "icu". Starting with ICU 2.0, including any public ICU C++ header by
default includes a "using namespace icu_VersionNumber" statement. This is for
backward compatibility, and should be turned off in favor of explicitly using
`icu::UnicodeString` etc. (see [How To Use ICU](howtouseicu.md)). (If entry point
renaming is turned off, then only the unversioned "icu" namespace is used.)
Starting with ICU 49, ICU4C requires namespace support.
### Library Dependencies (C++)
It is sometimes useful to see a dependency chart between the public ICU APIs and
ICU libraries. This chart can be useful to people that are new to ICU or to
people that want only certain ICU libraries.
> :construction: **TODO**: The dependency chart is currently not available.
Here are some things to realize about the chart.
1. It gives a general overview of the ICU library dependencies.
2. Internal dependencies, like the mutex API, are left out for clarity.
3. Similar APIs were lumped together for clarity (e.g. Formatting). Some of
these dependency details can be viewed from the ICU API reference.
4. The descriptions of each API can be found in our [ICU API
reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/)
### Code Dependencies (C++)
Starting with ICU 49, the dependencies of code files (.o files compiled from
.c/.cpp) are documented in
[source/test/depstest/dependencies.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/test/depstest/dependencies.txt).
Adjacent Python code is used to parse this file and to
[verify](http://site.icu-project.org/processes/release/tasks/healthy-code#TOC-Check-library-dependencies)
that it matches the actual dependencies of the code files.
The dependency list can be used to build subset libraries. In addition, by
reducing intra-library dependencies, the code size of statically linked ICU code
has been reduced.
### ICU API categories
ICU APIs, as defined in header and class files, are either "external" or
"internal". External APIs are meant to be used by applications, while internal
APIs should be used only within ICU. APIs are marked to indicate whether they
are external or internal, as follows. Every external API has a lifecycle label,
see below.
#### External ICU4C APIs
External ICU4C APIs are
1. declared in header files in unicode folders and exported at build/install
time to an `include/unicode` folder
2. when C++ class members, are `public` or `protected`
3. do not have an `@internal` label
Exception: Layout engine header files are not in a unicode folder, although the
public ones are still copied to the `include/unicode` folder at build/install
time. External layout engine APIs are the ones that have lifecycle labels and
not an `@internal` label.
#### External ICU4J APIs
External ICU4J APIs are
1. declared in one of the ICU4J core packages (`com.ibm.icu.lang`,
`com.ibm.icu.math`, `com.ibm.icu.text`, or `com.ibm.icu.util`).
2. `public` or `protected` class members
3. `public` or `protected` contained classes
4. do not have an `@internal` label
#### "System" APIs
"System" APIs are external APIs that are intended only for special uses for
system-level code, for example `u_cleanup()`. Normal users should not use them,
although they are public and supported. System APIs have a `@system` label
in addition to the lifecycle label that all external APIs have (see below).
#### Internal APIs
All APIs that do not fit any of the descriptions above are internal, which means
that they are for ICU internal use only and may change at any time without
notice. Some of them are member functions of public C++ or Java classes, and are
"technically public but logistically internal" for implementation reasons;
typically because programming languages don't provide sufficiently access
control (without clumsy mechanisms). In this case, such APIs have an
`@internal` label.
### ICU API compatibility
As ICU develops, it adds external APIs - functions, classes, constants, and so
on. Occasionally it is also necessary to remove or change external APIs. In
order to make this work, we use the following process:
For all API changes (and for significant/controversial/difficult implementation
changes), we use proposals to announce and discuss them. A proposal is simply an
email to the icu-design mailing list that details what is proposed to be
changed, with an expiration date of typically a week. This gives all mailing
list members a chance to review upcoming changes, and to discuss them. A
proposal often changes significantly as a result of discussion. Most proposals
will eventually find consensus among list members; otherwise, the ICU-TC decides
what to do. If the addition or change of APIs would affect you, please subscribe
to the main [icu-design mailing list](http://icu-project.org/contacts.html) .
When a **new API** is added to ICU, it **is marked as draft with a `@draft ICU
x.y` label in the API documentation, **where x.y is the ICU version when the
API *signature* was introduced or last changed**. A draft API is not guaranteed
to be stable! Although we will not make gratuitous changes, sometimes the draft
APIs turns out to be unsatisfactory in actual practice and may need to be
changed or even removed. Changes of "draft" API are subject to the proposal
process described above.
**When a `@draft ICU x.y` API is changed, it must remain `@draft` and its version
number must be updated.**
In ICU4J 3.4.2 and earlier, `@draft` APIs were also marked with Java's `@deprecated`
tag, so that uses of draft APIs in client code would be flagged by the compiler.
These uses of the `@deprecated` tag were indicated with the comment “This is a
draft API and might change in a future release of ICU.” Many clients found this
confusing and/or undesireable, so ICU4J 3.4.3 no longer marks draft APIs with
the `@deprecated` tag by default. For clients who prefer the earlier behavior,
ICU4J provides an ant build target, `restoreDeprecated`, which will update the
source files to use the `@deprecated` tag. Then clients can just rebuild the ICU4J
jar as usual.
When an API is judged to be stable and has not been changed for at least one ICU
release, it is relabeled as stable with a `@stable ICU x.y**` label in the API
documentation. A stable API is expected to be available in this form for a long
time. The ICU version **x.y** indicates the last time the API *signature* was
introduced or changed. **The promotion from `@draft ICU x.y` to `@stable ICU x.y`
must not change the x.y version number.**
We occasionally make an exception and allow adding new APIs marked as
`@stable ICU x.y` APIs in the x.y release itself if we believe that they have to
be stable. We might do this for enum constants that reflect 1:1 Unicode property
aliases and property value aliases, for a Unicode upgrade in the x.y release.
We sometimes **"broaden" a `@stable`** API function by changing its signature
in a compatible way. For example, in Java, we might change an input parameter
from a `String` to a `CharSequence`. In this case we keep the `@stable` but
update the ICU version number indicating the function signature change.
Even a stable API may eventually need to become deprecated or obsolete. Such
APIs are strongly discouraged from use. Typically, an improved API is introduced
at the time of deprecation/obsolescence of the old one.
1. Use of deprecated APIs is strongly discouraged, but they are retained for
backward compatibility. These are marked with labels like
`@deprecated ICU x.y Use u_abc() instead.`. **The ICU version x.y shows the
ICU release in which the API was first declared "deprecated".**
2. In ICU4J, starting with release 57, a custom Javadoc tag `@discouraged`
was added. While similar to `@deprecated` it is used when either ICU wants
to discourage a particular API from use but the JDK hasn't deprecated it or
ICU needs to keep it for compatibility reasons. These are marked with labels
like `@discouraged ICU x.y. Use u_abc() instead.`.
3. Obsolete APIs are are those whose continued retention will cause severe
conflicts or user error, or whose continued support would be a very
significant maintenance burden. We make every effort to keep these to a
minimum. Obsolete APIs are marked with labels like `@obsolete ICU x.y. Use
u_abc() instead since this API will be removed in that release.`.
**The x.y indicates that we plan to remove it in ICU version x.y.**
Stable C or Java APIs will not be obsoleted because doing so would break
forward binary compatibility of the ICU library. Stable APIs may be
deprecated, but they will be retained in the library.
An "obsolete" API will remain unchanged until it is removed in the indicated
ICU release, which will be usually one year after the API was declared
obsolete. Sometimes we still keep it available for some time via a
compile-time switch but stop maintaining it. In rare occasions, an API must
be replaced right away because of naming conflicts or severe defects; in
such cases we provide compile-time switches (`#ifdef` or other mechanisms) to
select the old API.
For example, here is how an API might be tagged in various versions:
* **In ICU 0.2**: The API is newly introduced as a draft in this release.
```Text
@draft ICU 0.2
f(x)
```
* **In ICU 0.4**: The draft version number is updated, because the signature
changed.
```Text
@draft ICU 0.4
f(x, y)
```
* **In ICU 0.6**: The API is promoted from draft to stable, but the version
number does not change, as the signature is the same.
```Text
@stable ICU 0.4
f(x, y)
```
* **In ICU 1.0**: The API is "broadened" in a compatible way. For example,
changing an input parameter from char to int or from some class to a base
class. The signature is changed (so we update the ICU version number), but old
calling code continues to work unchanged (so we retain @stable if that's what
it was.)
```Text
@stable ICU 1.0
f(xbase, y)
```
* **In ICU 1.2**: The API is demoted to deprecated (or obsolete) status.
```Text
@deprecated ICU 1.2 Use g(x,y,z) instead.
f(xbase, y)
```
or, when this API is planned to be removed in ICU 1.4:
```Text
@obsolete ICU 1.4. Use g(x,y,z) instead.
f(xbase, y)
```
### ICU Binary Compatibility
ICU4C may be configured for use as a system library in an environment where
applications that are built with one version of ICU must continue to run without
change with later versions of the ICU shared library.
Here are the requirements for enabling binary compatibility for ICU4C:
1. Applications must use only APIs that are marked as stable.
2. Applications must use only plain C APIs, never C++.
3. ICU must be built with function renaming disabled.
4. Applications must be built using an ICU that was configured for binary
compatibility.
5. Use ICU version 3.0 or later.
**Stable APIs Only.** APIs in the ICU library that are tagged as being stable
will be maintained in future versions of the library. Stable functions will
continue to exist with the same signature and the same meaning, allowing
applications to continue to work without change.
Stable APIs do not guarantee that the results from every function will always be
completely identical between ICU versions (see the
[Version Numbers in ICU](#version-numbers-in-icu) section above). Bugs may be
fixed. The Unicode character data may change with new versions of the Unicode
standard. Locale data may be updated or changed, yielding different results for
operations like formatting or collation. Applications that require exact
bit-for-bit, bug-for-bug compatibility of ICU results should not rely on ICU
release-to-release binary compatibility, but should instead link against a
specific version of ICU.
To verify that an application uses only stable APIs, build it with the C
preprocessor symbols `U_HIDE_DRAFT_API` and `U_HIDE_DEPRECATED_API` defined. This
will produce build errors if any draft, deprecated or obsolete APIs are used. An
operating system level installation of ICU may set this option permanently.
**C APIs only.** Only plain C APIs remain compatible across ICU releases. The
reason C++ binary compatibility is not supported is primarily because the design
of C++ language and runtime environments present extreme technical difficulties
to doing so. Stable C++ APIs are *source* compatible, but applications using
them must be recompiled when moving between ICU releases.
**Function renaming disabled.** Function renaming is an ICU feature that allows
an application to explicitly link against a specific version of the ICU library,
and to continue to use that version even when other ICU versions exist in the
runtime environment. This is the exact opposite of release-to-release binary
compatibility instead of being able to transparently change ICU versions, an
application is explicitly tied to one specific version.
Function renaming is enabled by default, and must be disabled at ICU build time
to enable release to release binary compatibility. To disable renaming, use the
configure option
```Shell
configure -disable-renaming [other configure options]
```
(Configure options may also be passed to the runConfigureICU script.)
To enable release-to-release binary compatibility, ICU must be built with
`--disable-renaming`, *and* applications must be built using the headers and
libraries that resulted from the `-disable-renaming` ICU build
**ICU Version 3.0 or Later.** Binary compatibility of ICU releases is supported
beginning with ICU version 3.0. Older versions of ICU (2.8 and earlier) do not
provide for binary compatibility between versions.
#### Linking against multiple versions of ICU4C
This section is intended to aid software developers who are implementing or
integrating solutions based on ICU, that may need to consider having multiple
versions of ICU running within the same executable (address space) at once.
Typically, users of ICU are encouraged to update to the latest stable version.
Under certain circumstances, however, behavior from earlier versions is desired,
or else, an application is linking together code which is already built against
a different version of ICU.
The major and minor numbers are the first and second numbers in a version
number, separated by a period. For example, in the version numbers 3.4.2.1,
3.4.2, or 3.4, "3" is the major, and "4" is the minor. Normally, ICU employs
"symbol renaming", such that the C function names and C++ object names are
`#defined` to contain the major and minor numbers. So, for example, if your
application calls the function `ucnv_open()`, it will link against
`ucnv_open_3_4` if compiled against ICU 3.4, 3.4.2, or even 3.4.2.1. However, if
compiled against ICU 3.8, the same code will link against `ucnv_open_3_8`.
Similarly, `UnicodeString` is renamed to `UnicodeString_3_4`, etc. This is normally
transparent to the user, however, if you inspect the symbols of the library or
your code, you will see the modified symbols.
If there are multiple versions of ICU being linked against in one application,
it will need to link against all relevant libraries for each version, for
example, common, i18n, and data. ICU uses standard library renaming, where, for
example, `libicuuc.so` on one platform will actually be a symbolic link to
`libicuuc.so.3.4`. When multiple ICU versions are used, the application may need
to explicitly link against the exact versions of ICU being used.
To disable renaming, build ICU with `--disable-renaming` passed to configure.
Or, set the equivalent `#define U_DISABLE_RENAMING 1`. Renaming must be disabled
both in the ICU build, and in the calling application.
### ICU Data Compatibility
Starting in ICU 3.8 and later, the data library that comes with ICU is binary
compatible and structurally compatible with versions of ICU with the same major
and minor version, or a maintenance release. This allows multiple maintenance
releases of ICU to share the same data, but generally the latest maintenance
release of the data should be used.
The binary compatibility of the data refers to the resource bundle binary format
that is contains the locale data, charset conversion tables and other file
formats supported by ICU. These binary formats are readable by many versions of
ICU. For example, resource bundles written with ICU 3.6 are readable by ICU 3.8.
The structural compatibility of the data refers to the structural contents of
the ICU data. The structure of the locale data may change between reference
releases, but the keys to reference specific types of data will be the same
between maintenance releases. This means that resource keys to access data
within resource bundles will work between maintenance releases of a specific
reference release. For example, an ICU 3.8 calendar will be able to use ICU
3.8.1 data, and vis versa; however ICU 3.6 may not be able to read ICU 3.8
locale data. Generally, these keys are not accessible by ICU users because only
the ICU implementation uses these resource keys.
The contents of the data library may change between ICU maintenance releases and
give you different results due to important updates and bug fixes. An example of
an important update would be a timezone rule update for when a country changes
when daylight saving time occurs. So the results may be different between
maintenance releases.
### ICU4J Serialization Compatibility
Starting in ICU4J 3.6, ICU4J stable API classes (marked as `@stable`) implementing
`java.io.Serializable` support serialized objects to be deserialized by ICU4J 3.6
or newer version of ICU4J. Some classes perform only shallow serialization,
therefore, it is not guaranteed that a deserialized object behaves exactly same
with the original object across ICU4J versions. Also, when it is difficult to
maintain serialization compatibility in a certain class across different ICU4J
versions for technical or other reasons, the ICU project committee may approve
the breakage. In such event, a note explaining the compatibility issue will be
posted in the ICU public mailing lists and also documented in the release note
of the new ICU4J version introducing the incompatibility.

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,122 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Contributions to the ICU library
## Why Contribute?
ICU is an open source library that is a de-facto industry standard for
internationalization libraries. Our goal is to provide top of the line i18n
support on all widely used platforms. By contributing your code to the ICU
library, you will get the benefit of continuing improvement by the ICU team and
the community, as well as testing and multi-platform portability. In addition,
it saves you from having to re-merge your own additions into ICU each time you
upgrade to a new ICU release.
## Current Process
See <http://site.icu-project.org/processes/contribute>.
## Historical
### Legal Issues ICU 1.8.1-57
The following process was in place up to ICU 57, when the old ICU license was
used.
### Old Process
In order for your code to be contributed, you need to assign to IBM joint
copyright ownership in the contribution. You retain joint ownership in the
contribution without restriction. (For the complete set of terms, please see the
forms mentioned below.)
The sections below describe two processes, for one-time and ongoing
contributors. In either case, please complete the form(s) electronically and
send it/them to IBM for review. After review by IBM, please print and sign the
form(s), send it/them by mail, and send the code. The code will then be
evaluated.
Please consult a legal representative if you do not understand the implications
of the copyright assignment.
### One-Time Contributors
If you would like to make a contribution only once or infrequently, please use
the *Joint Copyright Assignment - One-time Contribution* form.
(<https://github.com/unicode-org/icu-docs/blob/master/legal/contributions/Copyright_Assignment.rtf>).
The contribution will be identified by a bug ID which is unique to the
contribution and entered into the form. Therefore, please make sure that there
is an appropriate bug (or Request For Enhancement) in the ICU bug database, or
submit one.
The code contribution will be checked into a special part of the ICU source code
repository and evaluated. The ICU team may request updates, for example for
better conformance with the ICU [design](../design.md) principles,
[coding](codingguidelines.md) and testing guidelines, or performance. (See also
the Requirements (§) above.) Such updates can be contributed without exchanging
another form: An ICU team member commits related materials into the ICU source
code repository using the same bug ID that was entered into the copyright
assignment form.
### Ongoing Contributors
If you are interested in making frequent contributions to ICU, then the ICU
Project Management Committee may agree to invite you as an ongoing contributor.
Ongoing contributors may be individuals but are more typically expected to be
companies with one or more people ("authors") writing different parts of one or
more contributions.
In this case, the relationship between the contributor and the ICU team is much
closer: One or more authors belonging to the contributor will have commit access
to the ICU source code repository. With this direct access come additional
responsibilities including an understanding that the contributor will work to
follow the technical Requirements (§) above for contributions, and agreement to
adhere to the terms of the copyright assignment forms for all future
contributions.
The process for ongoing contributors involves two types of forms: Initially, and
only once, an ongoing contributor submits a *Joint Copyright Assignment by
Ongoing Contributor* form, agreeing to essentially the same terms as in the
one-time contributor form, for all future contributions. (See the form at
<https://github.com/unicode-org/icu-docs/blob/master/legal/contributions/Copyright_Assignment_ongoing.rtf>).
The contributor must also send another form, *Addendum to Joint Copyright
Assignment by Ongoing Contributor: Authors*, for the initial set and each
addition of authors to ICU contributions, **before** any contributions from
these authors are committed into the ICU source code repository. (Only new,
additional authors need to be listed on each such form.) The contributor agrees
to ensure that all of these authors agree to adhere to the terms of the
associated *Joint Copyright Assignment by Ongoing Contributor Agreement*. (See
the Authors Addendum form at
<https://github.com/unicode-org/icu-docs/blob/master/legal/contributions/Copyright_Assignment_authors.rtf>).
Some of an ongoing contributor's authors will have commit access to the ICU
source code repository. Their committer IDs need to be established before
completing the Authors Addendum form, so that these committer IDs can be entered
there. (The committer IDs should be activated only after the form is received.)
Committer authors commit materials directly into the appropriate parts of the
ICU source code repository. Contributions from an ongoing contributor are
identified by their association with the contributor's committer IDs.
### Previous Contributions
All previous "one-off" contributions from non-IBM sources to ICU are listed on
the code contributions page in ICU's source code repository. The page contains
links to the softcopies of the Joint Copyright Assignment forms. See
<https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/legal/contributions/code_contributions.html>
In addition, the following non-IBM companies are registered as Ongoing
Contributors:
* Apple
* Google
See the repository folder that contains the contributions page for the full set
of softcopies of contributor agreements including one-off contributions,
ongoing-contributor agreements and author-addendum documents to
ongoing-contributor agreements:
<https://github.com/unicode-org/icu-docs/tree/master/legal/contributions>

View file

@ -0,0 +1,15 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Development
Top-level page for topics for ICU developers. See the subpages listed below for
details:
[Coding Guidelines](codingguidelines.md)
[Contributions to the ICU library](contributions.md)
[Synchronization Issues](sync/index.md)

View file

@ -0,0 +1,226 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Custom ICU4C Synchronization
### Build Time User Provided Synchronization
Build time user synchronization provides a mechanism for platforms with special
requirements to provide their own mutex and one-time initialization
implementations to ICU. This facility was introduced in ICU 53. It may change
over time.
The alternative implementations are compiled directly into the ICU libraries.
Alternative implementations cannot be plugged in at run time.
The tables below show the items that must be defined by a custom ICU
synchronization implementation. The list includes both functions that are used
throughout ICU code and additional functions are for internal by other ICU
synchronization primitives.
**Low Level Atomics**, a set of platform or compiler dependent typedefs and
inlines. Provided in the internal header file
[umutex.h](../../../../icu4c/source/common/umutex.h).
| Type/Function | Description |
|-------------------------------------------------------|-----------------------------------------------------------------------------|
| typedef u_atomic_int32_t | A 32 bit integer that will work with low level atomic operations. (typedef) |
| umtx_loadAcquire(u_atomic_int32_t &var) | |
| umtx_storeRelease(u_atomic_int32_t &var, int32_t val) | |
| umtx_atomic_inc(u_atomic_int32_t &var) | |
| umtx_atomic_dec(u_atomic_int32_t &var) | |
**Mutexes**. Type declarations for ICU mutex wrappers. Provided in a header file.
| Type | Description |
|---------------------|---------------------------------------------------------------------------------------------------|
| struct UMutex | An ICU mutex. All instances will be static. Typically just contains an underlying platform mutex. |
| U_MUTEX_INITIALIZER | A C style initializer for a static instance of a UMutex. |
**Mutex and InitOnce implementations**. Out-of-line platform-specific code.
Provided in a .cpp file.
| Function | Description |
|---------------------------------------|------------------------------------------|
| umtx_lock(UMutex *mutex) | Lock a mutex. |
| umtx_unlock(UMutex* mutex) | Unlock a mutex. |
| umtx_initImplPreInit(UInitOnce &uio) | umtx_initOnce() implementation function. |
| umtx_initImplPostInit(UInitOnce &uio) | umtx_initOnce() implementation function. |
`UInitOnce` and `umtx_initOnce()` are used internally by ICU for thread-safe
one-time initialization. Their implementation is split into a
platform-independent part (contained in
[umutex.h](../../../../icu4c/source/common/umutex.h)),
and the pair of platform-dependent implementation functions listed above.
**Build Setup**
Compiler preprocessor variables are used to name the custom files to be included
in the ICU build. If defined, the files are included at the top of the normal
platform `#ifdef` chains in the ICU sources, and effectively define a new
platform.
| Macro | Description |
|------------------|-------------------------------------------------------|
| U_USER_ATOMICS_H | Set to the name of the low level atomics header file. |
| U_USER_MUTEX_H | Mutexes header file. |
| U_USER_MUTEX_CPP | Mutexes and InitOnce implementation file. |
It is possible (and reasonable) to supply only the two mutex files, while
retaining the ICU default implementation for the low level atomics.
Example ICU configure with user mutexes specified:
CPPFLAGS='-DU_USER_ATOMICS_H=atomic_c11.h -DU_USER_MUTEX_H=mutex_c11.h -DU_USER_MUTEX_CPP=mutex_c11.cpp' ./runConfigureICU --enable-debug Linux
**Stability**
This interface may change between ICU releases. The required set of functions
may be be extended, or details of the behavior required may be altered.
The types and functions defined by this interface reach deeply into the ICU
implementation, and we need to retain the ability to make changes should the
need arise.
**Examples**
The code below shows a complete set of ICU user synchronization files.
This implementation uses C++11 language mutexes and atomics. These make for a
convenient reference implementation because the C++11 constructs are well
defined and straight forward to use.
Similar implementations for POSIX and WIndows can be found in files
`common/umutex.h` and `common/umutex.cpp`, in the platform `#ifdef` chains; these are
part of the standard ICU distribution.
**Mutex Header**
```c++
// Example of an ICU build time customized mutex header.
//
// Must define struct UMutex and an initializer that will work with static instances.
// All UMutex instances in ICU code will be static.
#ifndef ICU_MUTEX_C11_H
#define ICU_MUTEX_C11_H
#include <mutex>
#include <condition_variable>
struct UMutex {
std::mutex fMutex;
};
#define U_MUTEX_INITIALIZER {}
#endif
```
**Atomics Header**
```c++
#include <atomic>
typedef std::atomic<int32_t> u_atomic_int32_t;
#define ATOMIC_INT32_T_INITIALIZER(val) ATOMIC_VAR_INIT(val)
inline int32_t umtx_loadAcquire(u_atomic_int32_t &var) {
return var.load(std::memory_order_acquire);
}
inline void umtx_storeRelease(u_atomic_int32_t &var, int32_t val) {
var.store(val, std::memory_order_release);
}
inline int32_t umtx_atomic_inc(u_atomic_int32_t &var) {
return var.fetch_add(1) + 1;
}
inline int32_t umtx_atomic_dec(u_atomic_int32_t &var) {
return var.fetch_sub(1) - 1;
}
```
**Mutex and InitOnce implementations**
```c++
//
// Example ICU build time custom mutex cpp file.
//
// Must implement these functions:
// umtx_lock(UMutex *mutex);
// umtx_unlock(UMutex *mutex);
// umtx_initImplPreInit(UInitOnce &uio);
// umtx_initImplPostInit(UInitOnce &uio);
U_CAPI void U_EXPORT2
umtx_lock(UMutex *mutex) {
if (mutex == NULL) {
// Note: globalMutex is pre-defined in the platform-independent ICU code.
mutex = &globalMutex;
}
mutex->fMutex.lock();
}
U_CAPI void U_EXPORT2
umtx_unlock(UMutex* mutex) `{
if (mutex == NULL) {
mutex = &globalMutex;
}
mutex->fMutex.unlock();
}
// A mutex and a condition variable are used by the implementation of umtx_initOnce()
// The mutex is held only while the state of the InitOnce object is being changed or
// tested. It is not held while initialization functions are running.
// Threads needing to block, waiting for an initialization to complete, will wait
// on the condition variable.
// All InitOnce objects share a common mutex and condition variable. This means that
// all blocked threads will wake if any (possibly unrelated) initialization completes.
// Which does no harm, it should be statistically rare, and any spuriously woken
// threads will check their state and promptly wait again.
static std::mutex initMutex;
static std::condition_variable initCondition;
// This function is called from umtx_initOnce() when an initial test of a UInitOnce::fState flag
// reveals that initialization has not completed, that we either need to call the
// function on this thread, or wait for some other thread to complete the initialization.
//
// The actual call to the init function is made inline by template code
// that knows the C++ types involved. This function returns TRUE if
// the inline code needs to invoke the Init function, or FALSE if the initialization
// has completed on another thread.
//
// UInitOnce::fState values:
// 0: Initialization has not yet begun.
// 1: Initialization is in progress, not yet complete.
// 2: Initialization is complete.
//
UBool umtx_initImplPreInit(UInitOnce &uio) {
std::unique_lock<std::mutex> initLock(initMutex);
int32_t state = uio.fState;
if (state == 0) {
umtx_storeRelease(uio.fState, 1);
return TRUE; // Caller will next call the init function.
} else {
while (uio.fState == 1) {
// Another thread is currently running the initialization.
// Wait until it completes.
initCondition.wait(initLock);
}
U_ASSERT(uio.fState == 2);
return FALSE;
}
}
// This function is called from umtx_initOnce() just after an initializationfunction completes.
// Its purpose is to set the state of the UInitOnce object to initialized, and to
// unblock any threads that may be waiting on the initialization.
//
// Some threads may be waiting on the condition variable, requiring the notify_all().
// Some threads may be racing to test the fState flag outside of the mutex,
// requiring the use of store-release when changing its value.
void umtx_initImplPostInit(UInitOnce &uio) {
std::unique_lock<std::mutex> initLock(initMutex);
umtx_storeRelease(uio.fState, 2);
initCondition.notify_all();
}
```

View file

@ -0,0 +1,71 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Synchronization Issues
## Overview
ICU is designed for use in multi-threaded environments. Guidelines for
developers using ICU are in the [ICU Design](../../design.md) section of the
user guide.
Within the ICU implementation, access to shared or global data sometimes must be
protected in order to provide the threading model promised by the ICU design.
The information on this page is intended for developers of ICU library code
itself.
ICU4J uses normal JDK synchronization services.
ICU4C faces a more difficult problem, as there is no standard, fully portable
set of C or C++ synchronization primitives. Internally, ICU4C provides a small
set of synchronization operations, and requires that all synchronization needed
within the ICU library code be implemented using them.
The ICU4C synchronization primitives are for internal use only; they are not
exported as API to normal users of ICU.
ICU provides implementations of its synchronization functions for Windows, POSIX
and C++11 platforms, and provides a build-time interface to allow [custom
implementations](custom.md) for other platforms.
## ICU4C Synchronization Primitives
The functions and types listed below are intended for use throughout the ICU
library code, where ever synchronization is required. They are defined in the
internal header
[umutex.h](../../../../icu4c/source/common/umutex.h).
All synchronization within ICU4C implementation code must use these, and avoid
direct use of functions provided by a particular operating system or compiler.
For examples of use, search the ICU library code.
**Low Level Atomics**
| Type/Function | Description |
|----------------------------------------|-----------------------------------------------------------------|
| typedef u_atomic_int32_t | A 32 bit integer type for use with low level atomic operations. |
| umtx_atomic_inc(u_atomic_int32_t &var) | |
| umtx_atomic_dec(u_atomic_int32_t &var) | |
**Mutexes**
| Type/Function | Description |
|----------------------------|--------------------------------------------------------------------|
| struct UMutex | An ICU mutex. All instances must be static. |
| U_MUTEX_INITIALIZER | A C style initializer for a UMutex. |
| umtx_lock(UMutex *mutex) | Lock a mutex. |
| umtx_unlock(UMutex* mutex) | Unlock a mutex. |
| class Mutex | C++ Mutex wrapper withautomatic lock & unlock. See header mutex.h. |
**One Time Initialization**
| Type/Function | Description |
|-------------------------------|-----------------------------------------------------------------------------------------|
| struct UInitOnce | Provides an efficient facility for one-time initialization of static or global objects. |
| umtx_initOnce(UInitOnce, ...) | A family of initialization functions. |
All of these functions are for internal ICU implementation use only. They are
not exported, and not intended for external use.

84
docs/userguide/editing.md Normal file
View file

@ -0,0 +1,84 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Editing the ICU User Guide
## Overview
> :construction: **TODO**: Adjust this page for use of GitHub Markdown (since 2020)
rather than Google Sites.
See the [migration page](https://docs.google.com/document/d/1uK91cxv1amCrt75TBw1PlCC5wZhJH_w3dW_8unWL9EQ/edit)
for details and tips.
This version of the ICU User Guide is maintained via Google Sites. The Site
address is <http://sites.google.com/site/icuprojectuserguide/>
Editors are also usually ICU committers. Edit rights are granted by other Site
owners and collaborators.
The change from editing of Open Office Writer documents and generating HTML and
PDF to editing a Google Site simplifies the User Guide maintenance and
encourages us to keep it more up to date than before, at the cost of not being
able to easily generate a single PDF document with the entire contents.
## Document Structure
Major chapters have Introduction pages, and further sections in a chapter are
subpages of that main chapter page. The navigation bar is a manually edited
sidebar accessible (if you are logged in and have edit rights) from Site
settings/Change appearance.
Page URLs should use lowercase letters and no hyphens.
See the sitemap linked from the bottom of the navigation bar.
Most pages have an automatic Table of contents. On a new page, after entering
some contents, return to the very top of the page contents, select Insert/Table
of contents, save, then change it to Right-aligned and turn on Wrap.
## Common Styles
We want to use common styles for code samples, notes and such. Since Google
Sites does not offer a site-wide CSS style sheet, please copy special items from
here, paste and modify their text, rather than creating them from scratch.
For headings, and for standard text styles like **bold**, *italic*,
~~strike-through~~, ... please use standard headings styles from Sites.
### Code
**New:** Use the Format menu styles for Code (inline) and Blockquote Code
(multi-line).
**Obsolete:**
For inline class/type/function/constant names and similar use Sites' Courier New
font which is close enough to the Courier font we used to use.
For a block of code, please copy/paste the following and edit its contents:
U16_NEXT(s, i, length, c)
U16_PREV(s, start, i, c)
U16_APPEND(s, i, length, c, isError)
### Notes
*Endianness is not an issue on this level because the interpretation of an
integer is fixed within any given platform.*
## Bookmarks & Links
For internal links, please select the Sites page as a destination rather than
specifying the full URL as a generic web link.
Unfortunately, Sites makes it hard to define an anchor on a page and create a
link to that specific anchor (whether from the same page or another one).
* For links to a specific section on the same page, please remove the link,
underline the former link text, and put "(§)" right after it.
* For links to a specific section on another page, just link to the page and
name the section. Please also put "(§)" right after it.
If and when Sites offers a reasonable way of defining anchors and linking to
them, we can search our pages for "(§)" and fix the links.

View file

@ -0,0 +1,277 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Date and Time Formatting Examples
## Format
The ICU DateFormat interface enables you to format a date in milliseconds into a
string representation of the date. Also, the interface enables you to parse the
string back to the internal date representation in milliseconds.
### C++
```cpp
DateFormat* df = DateFormat::createDateInstance();
UnicodeString myString;
UDate myDateArr[] = { 0.0, 100000000.0, 2000000000.0 };
for (int32_t i = 0; i < 3; ++i) {
myString.remove();
cout << df->format( myDateArr[i], myString ) << endl;
}
```
### C
```c
/* 1st example: format the dates in millis 100000000 and 2000000000 */
UErrorCode status=U_ZERO_ERROR;
int32_t i, myStrlen=0;
UChar* myString;
UDate myDateArr[] = { 0.0, 100000000.0, 2000000000.0 }; // test values
UDateFormat* df = udat_open(UCAL_DEFAULT, UCAL_DEFAULT, NULL, "GMT", &status);
for (i = 0; i < 3; ++i) {
myStrlen = udat_format(df, myDateArr[i], NULL, myStrlen, NULL, &status);
if(status==U_BUFFER_OVERFLOW_ERROR){
status=U_ZERO_ERROR;
myString=(UChar*)malloc(sizeof(UChar) * (myStrlen+1) );
udat_format(df, myDateArr[i], myString, myStrlen+1, NULL, &status);
printf("%s\n", austrdup(myString) );
/* austrdup( a function used to convert UChar* to char*) */
free(myString);
}
}
```
## Parse
To parse a date for a different locale, specify it in the locale call. This call
creates a formatting object.
### C++
```cpp
DateFormat* df = DateFormat::createDateInstance
( DateFormat::SHORT, Locale::getFrance());
```
### C
```c
/* 2nd example: parse a date with short French date/time formatter */
UDateFormat* df = udat_open(UDAT_SHORT, UDAT_SHORT, "fr_FR", "GMT", &status);
UErrorCode status = U_ZERO_ERROR;
int32_t parsepos=0;
UDate myDate = udat_parse(df, myString, u_strlen(myString), &parsepos,
&status);
```
### Java
```java
import java.text.FieldPosition;
import java.text.ParseException;
import java.util.Calendar;
import java.util.Date;
import java.util.Locale;
import com.ibm.icu.text.DateFormat;
public class TestDateTimeFormat {
public void run() {
// Formatting Dates
DateFormat dfUS = DateFormat.getDateInstance(DateFormat.FULL, Locale.US);
DateFormat dfFrance = DateFormat.getDateInstance(DateFormat.FULL, Locale.FRANCE);
StringBuffer sb = new StringBuffer();
Calendar c = Calendar.getInstance();
Date d = c.getTime();
sb = dfUS.format(d, sb, new FieldPosition(0));
System.out.println(sb.toString());
StringBuffer sbf = new StringBuffer();
sbf = dfFrance.format(d, sbf, new FieldPosition(0));
System.out.println(sbf.toString());
StringBuffer sbg = new StringBuffer();
DateFormat dfg = DateFormat.getDateTimeInstance(DateFormat.FULL, DateFormat.SHORT);
FieldPosition pos = new FieldPosition(DateFormat.MINUTE_FIELD);
sbg = dfg.format(d, sbg, pos);
System.out.println(sbg.toString());
System.out.println(sbg.toString().substring(pos.getBeginIndex(), pos.getEndIndex()));
// Parsing Dates
String dateString_US = "Thursday, February 7, 2008";
String dateString_FRANCE = "jeudi 7 février 2008";
try {
Date parsedDate_US = dfUS.parse(dateString_US);
Date parsedDate_FRANCE = dfFrance.parse(dateString_FRANCE);
System.out.println(parsedDate_US.toString());
System.out.println(parsedDate_FRANCE.toString());
} catch (ParseException pe) {
System.out.println("Exception while parsing :" + pe);
}
}
public static void main(String args[]) {
new TestDateTimeFormat().run();
}
}
```
## Getting Specific Date Fields
To get specific fields of a date, you can use the FieldPosition function for C++
or UFieldPosition function for C.
### C++
```cpp
UErrorCode status = U_ZERO_ERROR;
FieldPosition pos(DateFormat::YEAR_FIELD)
UDate myDate = Calendar::getNow();
UnicodeString str;
DateFormat* df = DateFormat::createDateInstance
( DateFormat::LONG, Locale::getFrance());
df->format(myDate, str, pos, status);
cout << pos.getBeginIndex() << "," << pos. getEndIndex() << endl;
```
### C
```c
UErrorCode status = U_ZERO_ERROR;
UFieldPosition pos;
UChar *myString;
int32_t myStrlen = 0;
char buffer[1024];
pos.field = 1; /* Same as the DateFormat::EField enum */
UDateFormat* dfmt = udat_open(UCAL_DEFAULT, UCAL_DEFAULT, NULL, "PST",
&status);
myStrlen = udat_format(dfmt, myDate, NULL, myStrlen, &pos, &status);
if (status==U_BUFFER_OVERFLOW_ERROR){
status=U_ZERO_ERROR;
myString=(UChar*)malloc(sizeof(UChar) * (myStrlen+1) );
udat_format(dfmt, myDate, myString, myStrlen+1, &pos, &status);
}
printf("date format: %s\n", u_austrcpy(buffer, myString));
buffer[pos.endIndex] = 0; // NULL terminate the string.
printf("UFieldPosition position equals %s\n", &buffer[pos.beginIndex]);
```
## DateTimePatternGenerator
This class lets you get a different variety of patterns, such as month+day. The
following illustrates this in Java, C++ and C.
### Java
```java
// set up the generator
DateTimePatternGenerator generator
= DateTimePatternGenerator.getInstance(locale);
// get a pattern for an abbreviated month and day
final String pattern = generator.getBestPattern("MMMd");
SimpleDateFormat formatter = new SimpleDateFormat(pattern, locale);
// use it to format (or parse)
String formatted = formatter.format(new Date());
// for French, the result is "13 sept."
```
### C++
```cpp
// set up the generator
status = U_ZERO_ERROR;
DateTimePatternGenerator *generator = DateTimePatternGenerator::createInstance( locale, status);
if (U_FAILURE(status)) {
return;
}
// get a pattern for an abbreviated month and day
UnicodeString pattern = generator->getBestPattern(UnicodeString("MMMd"), status);
SimpleDateFormat *formatter = new SimpleDateFormat(pattern, locale, status);
// use it to format (or parse)
UnicodeString formatted;
formatted = formatter->format(Calendar::getNow(), formatted, status);
// for French, the result is "13 sept."
```
### C
```c
const UChar skeleton[]= {'M', 'M', 'M', 'd', 0};
status=U_ZERO_ERROR;
generator=udatpg_open(locale, &status);
if(U_FAILURE(status)) {
return;
}
/* get a pattern for an abbreviated month and day */
length = udatpg_getBestPattern(generator, skeleton, 4,
pattern, patternCapacity, &status);
formatter = udat_open(UDAT_IGNORE, UDAT_DEFAULT, locale, NULL, -1,
pattern, length, &status);
/* use it to format (or parse) */
formattedCapacity = (int32_t)(sizeof(formatted)/sizeof((formatted)[0]));
resultLen=udat_format(formatter, ucal_getNow(), formatted, formattedCapacity,
NULL, &status);
/* for French, the result is "13 sept." */
```
## Changing the TimeZone Formatting Style
It also contains some helper functions for parsing patterns. Here's an example
of replacing the kind of timezone used in a pattern.
### Java
```cpp
/**
* Replace the zone string with a different type, eg v's for z's, etc.
* <p>Called with a pattern, such as one gotten from
* <pre>
* String pattern = ((SimpleDateFormat)
* DateFormat.getTimeInstance(style, locale)).toPattern();
* </pre>
* @param pattern original pattern to change, such as "HH:mm zzzz"
* @param newZone Must be: z, zzzz, Z, ZZZZ, v, vvvv, V, or VVVV
* @return
*/
public String replaceZoneString(String pattern, String newZone) {
DateTimePatternGenerator.FormatParser formatParser =
new DateTimePatternGenerator.FormatParser();
final List itemList = formatParser.set(pattern).getItems();
boolean found = false;
for (int i = 0; i < itemList.size(); ++i) {
Object item = itemList.get(i);
if (item instanceof VariableField) {
// the first character of the variable field determines the type,
// according to CLDR.
String variableField = item.toString();
switch (variableField.charAt(0)) {
case 'z': case 'Z': case 'v': case 'V':
if (!variableField.equals(newZone)) {
found = true;
itemList.set(i, new VariableField(newZone));
}
break;
}
}
}
return found ? formatParser.toString() : pattern;
}
```

View file

@ -0,0 +1,371 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Formatting Dates and Times
## Formatting Dates and Times Overview
Date and time formatters are used to convert dates and times from their internal
representations to textual form and back again in a language-independent manner.
The date and time formatters use `UDate`, which is the internal representation.
Converting from the internal representation (milliseconds since midnight,
January 1, 1970) to text is known as "formatting," and converting from text to
milliseconds is known as "parsing." These processes involve two mappings:
* A mapping between a point in time (UDate) and a set of calendar fields,
which in turn depends on:
* The rules of a particular calendar system (e.g. Gregorian, Buddhist,
Chinese Lunar)
* The time zone
* A mapping between a set of calendar fields and a formatted textual
representation, which depends on the fields selected for display, their
display style, and the conventions of a particular locale.
## DateFormat
DateFormat helps format and parse dates for any locale. Your code can be
completely independent of the locale conventions for months, days of the week,
or calendar format.
### Formatting Dates
The DateFormat interface in ICU enables you to format a Date in milliseconds
into a string representation of the date. It also parses the string back to the
internal Date representation in milliseconds.
```cpp
DateFormat* df = DateFormat::createDateInstance();
UnicodeString myString;
UDate myDateArr[] = { 0.0, 100000000.0, 2000000000.0 };
for (int32_t i = 0; i < 3; ++i) {
myString.remove();
cout << df->format( myDateArr[i], myString ) << endl;
}
```
To format a date for a different Locale, specify it in the call to:
```cpp
DateFormat* df = DateFormat::createDateInstance
( DateFormat::SHORT, Locale::getFrance());
```
### Parsing Dates
Use a DateFormat to parse also:
```cpp
UErrorCode status = ZERO_ERROR;
UDate myDate = df->parse(myString, status);
```
When numeric fields abut one another directly, with no intervening delimiter
characters, they constitute a run of abutting numeric fields. Such runs are
parsed specially. For example, the format "HHmmss" parses the input text
"123456" to 12:34:56, parses the input text "12345" to 1:23:45, and fails to
parse "1234". In other words, the leftmost field of the run is flexible, while
the others keep a fixed width. If the parse fails anywhere in the run, then the
leftmost field is shortened by one character, and the entire run is parsed
again. This is repeated until either the parse succeeds or the leftmost field is
one character in length. If the parse still fails at that point, the parse of
the run fails.
### Producing Normal Date Formats for a Locale
Use createDateInstance to produce the normal date format for that country. There
are other static factory methods available. Use createTimeInstance to produce
the normal time format for that country. Use createDateTimeInstance to produce a
DateFormat that formats both date and time. You can pass different options to
these factory methods to control the length of the result; from SHORT to MEDIUM
to LONG to FULL. The exact result depends on the locale, but generally:
1. SHORT is numeric, such as 12/13/52 or 3:30pm
2. MEDIUM is longer, such as Jan. 12, 1952
3. LONG is longer, such as January 12, 1952 or 3:30:32pm
4. FULL is completely specified, such as Tuesday, April 12, 1952 AD or
3:30:42pm PST
For more general flexibility, the [DateTimePatternGenerator](index.md) can map a
custom selection of time and date fields, along with various display styles for
those fields, to a locale-appropriate format that can then be set as the format
to use by the DateFormat.
### Producing Relative Date Formats for a Locale
ICU currently provides limited support for formatting dates using a “relative”
style, specified using RELATIVE_SHORT, RELATIVE_MEDIUM, RELATIVE_LONG. or
RELATIVE_FULL. As currently implemented, relative date formatting only affects
the formatting of dates within a limited range of calendar days before or after
the current date, based on the CLDR `<field type="day">`/`<relative>` data: For
example, in English, "Yesterday", "Today", and "Tomorrow". Within this range,
the specific relative style currently makes no difference. Outside of this
range, relative dates are formatted using the corresponding non-relative style
(SHORT, MEDIUM, etc.). Relative time styles are not currently supported, and
behave just like the corresponding non-relative style.
### Setting Time Zones
You can set the time zone on the format. If you want more control over the
format or parsing, cast the DateFormat you get from the factory methods to a
SimpleDateFormat. This works for the majority of countries.
> :point_right: **Note**: *Remember to check getDynamicClassID() before carrying out the cast.*
### Working with Positions
You can also use forms of the parse and format methods with ParsePosition and
FieldPosition to enable you to:
1. Progressively parse through pieces of a string.
2. Align any particular field, or find out where it is for selection on the
screen.
## SimpleDateFormat
SimpleDateFormat is a concrete class used for formatting and parsing dates in a
language-independent manner. It allows for formatting, parsing, and
normalization. It formats or parses a date or time, which is the standard
milliseconds since 24:00 GMT, Jan. 1, 1970.
SimpleDateFormat is the only built-in implementation of DateFormat. It provides
a programmable interface that can be used to produce formatted dates and times
in a wide variety of formats. The formats include almost all of the most common
ones.
Create a date-time formatter using the following methods rather than
constructing an instance of SimpleDateFormat. In this way, the program is
guaranteed to get an appropriate formatting pattern of the locale.
1. DateFormat::getInstance()
2. getDateInstance()
3. getDateTimeInstance()
If you need a more unusual pattern, construct a SimpleDateFormat directly and
give it an appropriate pattern.
### Date/Time Format Syntax
A date pattern is a string of characters, where specific strings of characters
are replaced with date and time data from a calendar when formatting or used to
generate data for a calendar when parsing.
The Date Field Symbol Table below contains the characters used in patterns to
show the appropriate formats for a given locale, such as yyyy for the year.
Characters may be used multiple times. For example, if y is used for the year,
'yy' might produce '99', whereas 'yyyy' produces '1999'. For most numerical
fields, the number of characters specifies the field width. For example, if h is
the hour, 'h' might produce '5', but 'hh' produces '05'. For some characters,
the count specifies whether an abbreviated or full form should be used, but may
have other choices, as given below.
Two single quotes represents a literal single quote, either inside or outside
single quotes. Text within single quotes is not interpreted in any way (except
for two adjacent single quotes). Otherwise all ASCII letter from a to z and A to
Z are reserved as syntax characters, and require quoting if they are to
represent literal characters. In addition, certain ASCII punctuation characters
may become variable in the future (eg ":" being interpreted as the time
separator and '/' as a date separator, and replaced by respective
locale-sensitive characters in display).
"Stand Alone" values refer to those designed to stand on their own, as opposed
to being with other formatted values. "2nd quarter" would use the stand alone
format (QQQQ), whereas "2nd quarter 2007" would use the regular format (qqqq
yyyy).
The pattern characters used in the Date Field Symbol Table are defined by CLDR;
for more information see [CLDR Date Field Symbol Table](https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table).
Note that the examples may not reflect current CLDR data.
#### Date Field Symbol Table
| Symbol | Meaning | Pattern | Example Output |
| --- | --- | --- | --- |
| G | era designator | G, GG, or GGG<br/>GGGG<br/>GGGGG | AD<br/>Anno Domini<br/>A |
| y | year | yy<br/>y or yyyy | 96<br/>1996 |
| Y | year of "Week of Year" | Y | 1997 |
| u | extended year | u | 4601 |
| U | cyclic year name, as in Chinese lunar calendar | U | 甲子 |
| r | related Gregorian year | r | 1996 |
| Q | quarter | Q<br/>QQ<br/>QQQ<br/>QQQQ<br/>QQQQQ | 2<br/>02<br/>Q2<br/>2nd quarter<br/>2 |
| q | Stand Alone quarter | q<br/>qq<br/>qqq<br/>qqqq<br/>qqqqq | 2<br/>02<br/>Q2<br/>2nd quarter<br/>2 |
| M | month in year | M<br/>MM<br/>MMM<br/>MMMM<br/>MMMMM | 9<br/>09<br/>Sep<br/>September<br/>S |
| L | Stand Alone month in year | L<br/>LL<br/>LLL<br/>LLLL<br/>LLLLL | 9<br/>09<br/>Sep<br/>September<br/>S |
| w | week of year | w<br/>ww | 27<br/>27 |
| W | week of month | W | 2 |
| d | day in month | d<br/>dd | 2<br/>02 |
| D | day of year | D | 189 |
| F | day of week in month | F | 2 (2nd Wed in July) |
| g | modified julian day | g | 2451334 |
| E | day of week | E, EE, or EEE<br/>EEEE<br/>EEEEE<br/>EEEEEE | Tue<br/>Tuesday<br/>T<br/>Tu |
| e | local day of week<br/>example: if Monday is 1st day, Tuesday is 2nd ) | e or ee<br/>eee<br/>eeee<br/>eeeee<br/>eeeeee | 2<br/>Tue<br/>Tuesday<br/>T<br/>Tu |
| c | Stand Alone local day of week | c or cc<br/>ccc<br/>cccc<br/>ccccc<br/>cccccc | 2<br/>Tue<br/>Tuesday<br/>T<br/>Tu |
| a | am/pm marker | a | pm |
| h | hour in am/pm (1~12) | h<br/>hh | 7<br/>07 |
| H | hour in day (0~23) | H<br/>HH | 0<br/>00 |
| k | hour in day (1~24) | k<br/>kk | 24<br/>24 |
| K | hour in am/pm (0~11) | K<br/>KK | 0<br/>00 |
| m | minute in hour | m<br/>mm | 4<br/>04 |
| s | second in minute | s<br/>ss | 5<br/>05 |
| S | fractional second - truncates (like other time fields)<br/>to the count of letters when formatting. Appends<br/>zeros if more than 3 letters specified. Truncates at<br/>three significant digits when parsing. | S<br/>SS<br/>SSS<br/>SSSS | 2<br/>23<br/>235<br/>2350 |
| A | milliseconds in day | A | 61201235 |
| z | Time Zone: specific non-location | z, zz, or zzz<br/>zzzz | PDT<br/>Pacific Daylight Time |
| Z | Time Zone: ISO8601 basic hms? / RFC 822<br/>Time Zone: long localized GMT (=OOOO)<br/>TIme Zone: ISO8601 extended hms? (=XXXXX) | Z, ZZ, or ZZZ<br/>ZZZZ<br/>ZZZZZ | -0800<br/>GMT-08:00<br/>-08:00, -07:52:58, Z |
| O | Time Zone: short localized GMT<br/>Time Zone: long localized GMT (=ZZZZ) | O<br/>OOOO | GMT-8<br/>GMT-08:00 |
| v | Time Zone: generic non-location<br/>(falls back first to VVVV) | v<br/>vvvv | PT<br/>Pacific Time or Los Angeles Time |
| V | Time Zone: short time zone ID<br/>Time Zone: long time zone ID<br/>Time Zone: time zone exemplar city<br/>Time Zone: generic location (falls back to OOOO) | V<br/>VV<br/>VVV<br/>VVVV | uslax<br/>America/Los_Angeles<br/>Los Angeles<br/>Los Angeles Time |
| X | Time Zone: ISO8601 basic hm?, with Z for 0<br/>Time Zone: ISO8601 basic hm, with Z<br/>Time Zone: ISO8601 extended hm, with Z<br/>Time Zone: ISO8601 basic hms?, with Z<br/>Time Zone: ISO8601 extended hms?, with Z | X<br/>XX<br/>XXX<br/>XXXX<br/>XXXXX | -08, +0530, Z<br/>-0800, Z<br/>-08:00, Z<br/>-0800, -075258, Z<br/>-08:00, -07:52:58, Z |
| x | Time Zone: ISO8601 basic hm?, without Z for 0<br/>Time Zone: ISO8601 basic hm, without Z<br/>Time Zone: ISO8601 extended hm, without Z<br/>Time Zone: ISO8601 basic hms?, without Z<br/>Time Zone: ISO8601 extended hms?, without Z | x<br/>xx<br/>xxx<br/>xxxx<br/>xxxxx | -08, +0530<br/>-0800<br/>-08:00<br/>-0800, -075258<br/>-08:00, -07:52:58 |
| ' | escape for text | ' | (nothing) |
| ' ' | two single quotes produce one | ' ' | ' |
> :point_right: **Note**: *Any characters in the pattern that are not in the ranges of
['a'..'z'] and ['A'..'Z'] will be treated as quoted text. For instance,
characters like ':', '.', ' ', '#' and '@' will appear in the resulting time
text even they are not enclosed within single quotes.The single quote is used to
'escape' letters. Two single quotes in a row, whether inside or outside a quoted
sequence, represent a 'real' single quote.*
> :point_right: **Note**: *A pattern containing any invalid pattern letter results in a failing UErrorCode
result during formatting or parsing.*
| Format Pattern | Result |
| --- | --- |
| yyyy.MM.dd G 'at' HH:mm:ss zzz | 1996.07.10 AD at 15:08:56 PDT |
| EEE, MMM d, ''yy | Wed, July 10, '96 |
| h:mm a | 12:08 PM |
| hh 'o''clock' a, zzzz | 12 o'clock PM, Pacific Daylight Time |
| K:mm a, z | 0:00 PM, PST |
| yyyyy.MMMM.dd GGG hh:mm aaa | 01996.July.10 AD 12:08 PM |
### Time Zone Display Names
ICU supports time zone display names defined by the LDML ([Unicode Locale Data
Markup Language](http://www.unicode.org/reports/tr35/) ) specification. Since
ICU 3.8, the vast majority of localized time zone names are no longer associated
with individual time zones. Instead, a set of localized time zone names are
associated with a *metazone* and one or more individual time zones are mapped to
the same *metazone*. For example, *metazone* “America_Pacific” has its own
display name data such as “PST” “PDT” “PT” “Pacific Standard Time” “Pacific
Daylight Time” “Pacific Time” and these names are shared by multiple individual
time zones “America/Los_Angeles”, “America/Vancouver”, “America/Tijuana” and so
on. The mapping from individual time zone to *metazone* is not a simple 1-to-1
mapping, but it changes time to time. For example, time zone
“America/Indiana/Tell_City” uses name data from *metazone* “America_Eastern”
until April 2, 2006, but it changes to *metazone* “America_Central” after the
date. So the display name used for “America/Indiana/Tell_City” before the date
(e.g. “Eastern Time”) differs from the one after the date (e.g. “Central Time).
> :point_right: **Note**: *Prior to ICU 3.8, a localized time zone name (except GMT format) and a time
zone ID was in 1-to-1 relationship always. Therefore, a time zone name produced
by DateFormat can be parsed back to the original time zone. This assumption no
longer applies to ICU 3.8 and later releases for all time zone format types. If
you program requires to roundtrip specific time zone ID, you must use the
generic location format (“VVVV”) explained below.*
There are several different display name types available in the LDML
specification.
#### Time Zone Display Name Types
| Type | Description | Examples |
| --- | --- | --- |
| Generic non-location | Reflects wall time, suited for displaying recurring events, meetings or anywhere people do not want to be overly specific. Available in two length options long and short. | Pacific Time<br/>PT |
| Generic partial location | Reflects wall time, used as a fallback format when the generic non-location format is not specific enough. A generic partial location name is constructed from a generic non-location name with a location name. For example, “PT” is shared by multiple time zones via metazone “America_Pacific”. When GMT offset in the time zone at the given time differs from the preferred time zone of the metazone for the locale, location name is appended to generic non-location name to distinguish the time zone from the preferred zone. Available in two length options long and short. | Pacific Time (Canada)<br/>PT (Yellowknife) |
| Generic location | Reflects wall time, suited for populating choice list for time zones. If the time zone is the single time zone available in the region (country), the generic location name is constructed with the region name. Otherwise, the name is constructed from the region name and the city name. Unlike other format types, this name is unique per time zone. | United States (Los Angeles) Time<br/>Italy Time |
| Specific non-location | Reflects a specific standard or daylight time. Available in two length options long and short. | Pacific Standard Time<br/>PDT |
| Localized GMT | A constant, specific offset from GMT in a localized form. | GMT-08:00 |
| RFC822 GMT | A constant, specific offset from GMT in a locale insensitive format. | -0800 |
Each format type in the above table is used as a primary type or a fallback in
SimpleDateFormat. The table below explains how ICU time zone format pattern work
and its characteristics.
#### Time Zone Pattern Usage
| Pattern | Behavior | Round-trip time at daylight transitions(\*) | Round-trip Time Zone | Suggested Usage |
| --- | --- | --- | --- | --- |
| z, zz, zzz | Short specific non-location format (e.g. “PST”).If the localized data is not available or the short abbreviation is not commonly used for the locale, localized GMT format is used (e.g. GMT-08:00). | yes | no | For displaying a time with a user friendly time zone name. |
| zzzz | Long specific non-location format (e.g. “Pacific Standard Time”).If the localized data is not available, localized GMT format is used (e.g. GMT-08:00). | yes | no | Same as “z”, but longer format. |
| v | Short generic non-location format (e.g. “PT”).If the localized data is not available or the short abbreviation is not commonly used for the locale, generic location format (e.g. “United States(Los Angeles) Time”) is used.If the localized data comes from metazone and the GMT offset at the given time in the specified time zone differs from the preferred time zone of the metazone for the locale, generic partial location format (e.g. “PT (Canada)”) is used. | no | no | For displaying a recurring wall time (e.g. events, meetings) or anywhere people do not want to be overly specific. |
| vvvv | Long generic non-location format (e.g. “Pacific Time”).If the localized data is not available, generic location format (e.g. “United States(Los Angeles) Time”) is used. | no | no | Same as “v”, but longer format. |
| V | Same as “z”, except using the short abbreviation even it is not commonly used for the locale. | yes | no | Same as “z”. |
| VVVV | Generic location format (e.g. “United States (Los Angeles) Time”). | no | yes | For populating a choice list for time zones, because it supports 1-to-1 name/zone ID mapping and is more uniform than other text formats.Also, this is only the pattern supporting time zone round-trip. If your program requires to preserve the original time zone information, use this pattern. |
| Z, ZZ, ZZZ | Localized GMT format (e.g. “GMT-08:00”). | yes | no | For displaying a time in UI in a uniformed manner. |
| ZZZZ | RFC822 GMT format (e.g. “-0800”). | yes | no | For formatting a time for non-user-facing data. |
\* At a transition from daylight saving time to standard time, there is a wall
time interval occurs twice.
## DateTimePatternGenerator
The DateTimePatternGenerator class provides a way to map a request for a set of
date/time fields, along with their width, to a locale-appropriate format
pattern. The request is in the form of a “skeleton” which just contains pattern
letters for the desired fields using the representation for the desired width.
In a skeleton, anything other than a pattern letter is ignored, field order is
insignificant, and there are two special additional pattern letters that may be
used: 'j' requests the preferred hour-cycle type for the locale (it gets mapped
to one of 'H', 'h', 'k', or 'K'); 'J' is similar but requests no AM/PM marker
even if the locales preferred hour-cycle type is 'h' or 'K'.
For example, a skeleton of “MMMMdjmm” might result in the following format
patterns for different locales:
| locale | format pattern for skeleton “MMMMdjmm” | example |
| ------ | -------------------------------------- | ------------------ |
| en_US | "MMMM d  'at'  h:mm a" | April 2 at 5:00 PM |
| es_ES | "d 'de' MMMM, H:mm" | 2 de abril, 17:00 |
| ja_JP | "M月d日 H:mm" | 4月2日 17:00 |
The most important DateTimePatternGenerator methods are the varieties of
getBestPattern.
Note that the fields in the format pattern may be adjusted as appropriate for
the locale and may not exactly match those in the skeleton. For example:
* In Russian (locale "ru"), the skeleton "yMMMM" will produce the format
pattern "LLLL y" (or "LLLL y 'г'.") since a month name without a day number
must be in nominative form, as indicated by LLLL.
* When using the Japanese calendar in the Japanese locale (locale
"ja@calendar=japanese"), the skeleton "yMMMd" will produce the format
pattern "Gy年M月d日" since the era must always be shown with the year in the
Japanese calendar.
## DateFormatSymbols
DateFormatSymbols is a public class for encapsulating localizable date-time
formatting data, including time zone data. DateFormatSymbols is used by
DateFormat and SimpleDateFormat.
DateFormatSymbols specifies the exact character strings to use for various parts
of a date or time For example, the names of the months and days of the week, the
strings for AM and PM and the day of the week considered to be the first day of
the week (used in drawing calendar grids) are controlled by DateFormatSymbols.
Create a date-time formatter using the `createTimeInstance`, `createDateInstance`,
or `createDateTimeInstance` methods in DateFormat. Each of these methods can
return a date/time formatter initialized with a default format pattern, along
with the date-time formatting data for a given or default locale. After a
formatter is created, modify the format pattern using `applyPattern`.
If you want to create a date-time formatter with a particular format pattern and
locale, use one of the SimpleDateFormat constructors:
```cpp
UnicodeString aPattern("GyyyyMMddHHmmssSSZ", "");
new SimpleDateFormat(aPattern, new DateFormatSymbols(Locale::getUS())
```
This loads the appropriate date-time formatting data from the locale.s
## Programming Examples
See [date and time formatting examples](examples.md) .

View file

@ -11,9 +11,9 @@ returned by a number of ICU formatters. APIs for FormattedValue are available
in Java, C++, and C. For more details and a list of all implementing classes,
refer to the API docs:
- [C++ FormattedValue](http://icu-project.org/apiref/icu4c/classicu_1_1FormattedValue.html)
- [C UFormattedValue](http://icu-project.org/apiref/icu4c/globals_u.html) -- search for "resultAsValue"
- [Java FormattedValue](http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/FormattedValue.html)
- [C++ FormattedValue](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1FormattedValue.html)
- [C UFormattedValue](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/globals_u.html) -- search for "resultAsValue"
- [Java FormattedValue](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/FormattedValue.html)
## Nested Span Fields

View file

@ -0,0 +1,210 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Formatting and Parsing
## Overview
Formatters translate between binary data and human-readable textual
representations of these values. For example, you cannot display the computer
representation of the number 103. You can only display the numeral 103 as a
textual representation (using three text characters). The result from a
formatter is a string that contains text that the user will recognize as
representing the internal value. A formatter can also parse a string by
converting a textual representation of some value back into its internal
representation. For example, it reads the characters 1, 0 and 3 followed by
something other than a digit, and produces the value 103 as an internal binary
representation.
These classes encapsulate information about the display of localized times,
days, numbers, currencies, and messages. Formatting classes do both formatting
and parsing and allow the separation of the data that the end-user sees from the
code. Separating the program code from the data allows a program to be more
easily localized. Formatting is converting a date, time, number, message or
other object from its internal representation into a string. Parsing is the
reverse operation. It is the process of converting a string to an internal
representation of the date, time, number, message or other object.
Using the formatting classes is an important step in internationalizing your
software because the `format()` and `parse()` methods in each of the classes make
your software language neutral, by replacing implicit conversions with explicit
formatting calls.
## Internationalization Formatting Tips
This section discusses some of the ways you can format and parse numbers,
currencies, dates, times and text messages in your program so that the data is
separate from the code and can be easily localized. This is the information your
users see on their computer screens, so it needs to be in a language and format
that conforms to their local conventions.
Some things you need to keep in mind while you are creating your code are the
following:
* Keep your code and your data separate
* Format the data in a locale-sensitive manner
* Keep your code locale-independent
* Avoid writing special routines to handle specific locales
* String objects formatted by `format()` are parseable by the `parse()` method\*
> :point_right: **Note**: Although parsing is supported in several legacy ICU APIs,
it is generally considered bad practice to parse localized strings.
For more information, read [Why You Should Not Parse
Localized Strings](https://blog.sffc.xyz/post/190943794505/why-you-should-not-parse-localized-strings).
### Numbers and Currencies
Programs store and operate on numbers using a locale-independent binary
representation. When displaying or printing a number it is converted to a
locale-specific string. For example, the number 12345.67 is "12,345.67" in the
US, "12 345,67" in France and "12.345,67" in Germany.
By invoking the methods provided by the `NumberFormat` class, you can format
numbers, currencies, and percentages according to the specified or default
locale. `NumberFormat` is locale-sensitive so you need to create a new
`NumberFormat` for each locale. `NumberFormat` methods format primitive-type
numbers, such as double and output the number as a locale-specific string.
For currencies you call `getCurrencyInstance` to create a formatter that returns a
string with the formatted number and the appropriate currency sign. Of course,
the `NumberFormat` class is unaware of exchange rates so, the number output is the
same regardless of the specified currency. This means that the same number has
different monetary values depending on the currency locale. If the number is
9988776.65 the results will be:
* 9 988 776,65 € in France
* 9.988.776,65 € in Germany
* $9,988,776.65 in the United States
In order to format percentages, create a locale-specific formatter and call the
`getPercentInstance` method. With this formatter, a decimal fraction such as 0.75
is displayed as 75%.
#### Customizing Number Formats
If you need to customize a number format you can use the DecimalFormat (§) and
the DecimalFormatSymbols (§) classes in the [Formatting
Numbers](formatparse/numbers/index.md) chapter. This not usually necessary and
it makes your code much more complex, but it is available for those rare
instances where you need it. In general, you would do this by explicitly
specifying the number format pattern.
If you need to format or parse spelled-out numbers, you can use the
RuleBasedNumberFormat class (§) (see the [Formatting
Numbers](formatparse/numbers/index.md) chapter). You can instantiate a default
formatter for a locale, or by using the RuleBasedNumberFormat rule syntax,
specify your own.
Using NumberFormat (§) class methods (see the [Formatting
Numbers](formatparse/numbers/index.md) chapter) with a predefined locale is the
easiest and the most accurate way to format numbers, and currencies.
> :point_right: **Note**: *See [Properties and ICU Rule Syntax](strings/properties.md) for
information regarding syntax characters.*
### Date and Times
You display or print a Date by first converting it to a locale-specific string
that conforms to the conventions of the end user's Locale. For example, Germans
recognize 20.4.98 as a valid date, and Americans recognize 4/20/98.
> :point_right: **Note**: *The appropriate Calendar support is required for different locales. For
example, the Buddhist calendar is the official calendar in Thailand so the
typical assumption of Gregorian Calendar usage should not be used. ICU will pick
the appropriate Calendar based on the locale you supply when opening a Calendar
or DateFormat.*
### Messages
Message format helps make the order of display elements localizable. It helps
address problems of grammatical differences in languages. For example, consider
the sentence, "I go to work by car everyday." In Japanese, the grammar
equivalent can be "Everyday, I to work by car go." Another example will be the
plurals in text, for example, "no space for rent, one room for rent and many
rooms for rent," where "for rent" is the only constant text among the three.
## Formatting and Parsing Classes
ICU provides four major areas and twelve classes for formatting numbers, dates
and messages:
### General Formatting
* `Format`:
The abstract superclass of all format classes. It provides the basic methods
for formatting and parsing numbers, dates, strings and other objects.
* `FieldPosition`:
A concrete class for holding the field constant and the begin and end
indices for number and date fields.
* `ParsePosition`:
A concrete class for holding the parse position in a string during parsing.
* `Formattable`:
Formattable objects can be passed to the Format class or its subclasses for
formatting. It encapsulates a polymorphic piece of data to be formatted and
is used with MessageFormat. Formattable is used by some formatting
operations to provide a single "type" that encompasses all formattable
values (e.g., it can hold a number, a date, or a string, and so on).
* `UParseError`:
UParseError is used to returned detailed information about parsing errors.
It is used by the ICU parsing engines that parse long rules, patterns, or
programs. This is helpful when the text being parsed is long enough that
more information than a UErrorCode is needed to localize the error.
**Formatting Numbers**
* [NumberFormat](formatparse/numbers/index.md) (§)
The abstract superclass that provides the basic fields and methods for
formatting Number objects and number primitives to localized strings and
parsing localized strings to Number objects.
* [DecimalFormat](formatparse/numbers/index.md) (§)
A concrete class for formatting Number objects and number primitives to
localized strings and parsing localized strings to Number objects, in base
10.
* [RuleBasedNumberFormat](formatparse/numbers/index.md) (§)
A concrete class for formatting Number objects and number primitives to
localized text, especially spelled-out format such as found in check writing
(e.g. "two hundred and thirty-four"), and parsing text into Number objects.
* [DecimalFormatSymbols](formatparse/numbers/index.md) (§)
A concrete class for accessing localized number strings, such as the
grouping separators, decimal separator, and percent sign. Used by
DecimalFormat.
**Formatting Dates and Times**
* [DateFormat](formatparse/datetime/index.md) (§)
The abstract superclass that provides the basic fields and methods for
formatting Date objects to localized strings and parsing date and time
strings to Date objects.
* [SimpleDateFormat](formatparse/datetime/index.md) (§)
A concrete class for formatting Date objects to localized strings and
parsing date and time strings to Date objects, using a GregorianCalendar.
* [DateFormatSymbols](formatparse/datetime/index.md) (§)
A concrete class for accessing localized date-time formatting strings, such
as names of the months, days of the week and the time zone.
**Formatting Messages**
* [MessageFormat](formatparse/messages/index.md) (§)
A concrete class for producing a language-specific user message that
contains numbers, currency, percentages, date, time and string variables.
* [ChoiceFormat](formatparse/messages/index.md) (§)
A concrete class for mapping strings to ranges of numbers and for handling
plurals and names series in user messages.

View file

@ -0,0 +1,381 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Message Formatting Examples
## MessageFormat Class
ICU's MessageFormat class can be used to format messages in a locale-independent
manner to localize the user interface (UI) strings.
### C++
```cpp
/* The strings below can be isolated into a resource bundle
* and retrieved dynamically
*/
#define LANGUAGE_NAMES "{0}<{1}languages {2}>\n"
#define LANG_ATTRIB "{0}<language id=\"{1}\" >{2}</language>\n"
#define MONTH_NAMES "{0}<monthNames>\n"
#define END_MONTH_NAMES "{0}</monthNames>\n"
#define MONTH "{0}<month id=\"{1}\">{2}</month>\n"
#define MONTH_ABBR "{0}<monthAbbr>\n"
#define END_MONTH_ABBR "{0}</monthAbbr>\n"
UnicodeString CXMLGenerator::formatString(UnicodeString& str,UnicodeString&
argument){
Formattable args[] ={ argument};
UnicodeString result;
MessageFormat format(str,mError);
FieldPosition fpos=0;
format.format(args,1, result,fpos,mError);
if(U_FAILURE(mError)) {
return UnicodeString("Illegal argument");
}
return result;
}
void CXMLGenerator::writeLanguage(UnicodeString& xmlString){
UnicodeString *itemTags, *items;
char* key="Languages";
int32_t numItems;
if(U_FAILURE(mError)) {
return;
}
mRBundle.getTaggedArray(key,itemTags, items, numItems, mError);
if(mError!=U_USING_DEFAULT_ERROR && U_SUCCESS(mError) &&
mError!=U_ERROR_INFO_START){
Formattable args[]={indentOffset,"",""};
xmlString= formatString(UnicodeString(LANGUAGE_NAMES),args,3);
indentOffset.append("\t");
for(int32_t i=0;i<numItems;i++){
args[0] = indentOffset;
args[1] =itemTags[i] ;
args[2] = items[i] ;
xmlString.append(formatString(UnicodeString(LANG_ATTRIB),args,3));
}
chopIndent();
args[0]=indentOffset;
args[1] =(UnicodeString(XML_END_SLASH));
args[2] = "";
xmlString.append(formatString(UnicodeString(LANGUAGE_NAMES),args,3));
return;
}
mError=U_ZERO_ERROR;
xmlString.remove();
}
void CXMLGenerator::writeMonthNames(UnicodeString& xmlString){
int32_t lNum;
const UnicodeString* longMonths=
mRBundle.getStringArray("MonthNames",lNum,mError);
if(mError!=U_USING_DEFAULT_ERROR && mError!=U_ERROR_INFO_START && mError !=
U_MISSING_RESOURCE_ERROR){
xmlString.append(formatString(UnicodeString(MONTH_NAMES),indentOffset));
indentOffset.append("\t");
for(int i=0;i<lNum;i++){
char c;
itoa(i+1,&c,10);
Formattable args[]={indentOffset,UnicodeString(&c),longMonths[i]};
xmlString.append(formatString(UnicodeString(MONTH),args,3));
}
chopIndent();
xmlString.append(formatString(UnicodeString(END_MONTH_NAMES),indentOffset));
mError=U_ZERO_ERROR;
return;
}
xmlString.remove();
mError= U_ZERO_ERROR;
}
```
### C
```c
void msgSample1(){
UChar *result, *tzID, *str;
UChar pattern[100];
int32_t resultLengthOut, resultlength;
UCalendar *cal;
UDate d1;
UErrorCode status = U_ZERO_ERROR;
str=(UChar*)malloc(sizeof(UChar) * (strlen("disturbance in force") +1));
u_uastrcpy(str, "disturbance in force");
tzID=(UChar*)malloc(sizeof(UChar) * 4);
u_uastrcpy(tzID, "PST");
cal=ucal_open(tzID, u_strlen(tzID), "en_US", UCAL_TRADITIONAL, &status);
ucal_setDateTime(cal, 1999, UCAL_MARCH, 18, 0, 0, 0, &status);
d1=ucal_getMillis(cal, &status);
u_uastrcpy(pattern, "On {0, date, long}, there was a {1} on planet
{2,number,integer}");
resultlength=0;
resultLengthOut=u_formatMessage( "en_US", pattern, u_strlen(pattern),
NULL,
resultlength, &status, d1, str, 7);
if(status==U_BUFFER_OVERFLOW_ERROR){
status=U_ZERO_ERROR;
resultlength=resultLengthOut+1;
result=(UChar*)realloc(result, sizeof(UChar) * resultlength);
u_formatMessage( "en_US", pattern, u_strlen(pattern), result,
resultlength, &status, d1, str, 7);
}
printf("%s\n",austrdup(result) ); //austrdup( a function used to convert
UChar* to char*)
free(tzID);
free(str);
free(result);
}
char *austrdup(const UChar* unichars)
{
int length;
char *newString;
length = u_strlen ( unichars );
newString = (char*)malloc ( sizeof( char ) * 4 * ( length + 1 ) );
if ( newString == NULL )
return NULL;
u_austrcpy ( newString, unichars );
return newString;
}
This is a more practical sample which retrieves data from a resource bundle
and
feeds the data
to u_formatMessage to produce a formatted string
void msgSample3(){
char* key="Languages";
int32_t numItems;
/* This constant string can also be in the resouce bundle and retrieved at
the time
* of formatting
* eg:
* UResouceBundle* myResB = ures_open("myResources",currentLocale,&err);
* UChar* Lang_Attrib = ures_getString(myResb,"LANG_ATTRIB",&err);
*/
UChar* LANG_ATTRIB =(UChar*) "{0}<language id=\"{1}\"
>{2}</language>\n";
UChar *result;
UResourceBundle* pResB,*pDeltaResB=NULL;
UErrorCode err=U_ZERO_ERROR;
UChar* indentOffset = (UChar*)"\t\t\t";
pResB = ures_open("","en",&err);
if(U_FAILURE(err)) {
return;
}
ures_getByKey(pResB, key, pDeltaResB, &err);
if(U_SUCCESS(err)) {
const UChar *value = 0;
const char *key = 0;
int32_t len = 0;
int16_t indexR = -1;
int32_t resultLength=0,resultLengthOut=0;
numItems = ures_getSize(pDeltaResB);
for(;numItems-->0;){
key= ures_getKey(pDeltaResB);
value = ures_get(pDeltaResB,key,&err);
resultLength=0;
resultLengthOut=u_formatMessage( "en_US", LANG_ATTRIB,
u_strlen(LANG_ATTRIB),
NULL, resultLength, &err,
indentOffset, value, key);
if(err==U_BUFFER_OVERFLOW_ERROR){
err=U_ZERO_ERROR;
resultLength=resultLengthOut+1;
result=(UChar*)realloc(result, sizeof(UChar) * resultLength);
u_formatMessage("en_US",LANG_ATTRIB,u_strlen(LANG_ATTRIB),
result,resultLength,&err,indentOffset,
value,key);
printf("%s\n", austrdup(result) );
}
}
return;
}
err=U_ZERO_ERROR;
}
```
### Java
```java
import com.ibm.icu.text.*;
import java.util.Date;
import java.text.FieldPosition;
public class TestMessageFormat{
public void runTest() {
String format = "At {1,time,::jmm} on {1,date,::dMMMM}, there was {2} on planet {3,number,integer}.";
MessageFormat mf = new MessageFormat(format);
Object objectsToFormat[] = { new Date(System.currentTimeMillis()), new Date(System.currentTimeMillis()), "a Disturbance in the Force", new Integer(5)};
FieldPosition fp = new FieldPosition(1);
StringBuffer sb = new StringBuffer();
try{
sb = mf.format(objectsToFormat, sb, fp);
System.out.println(sb.toString());
}catch(IllegalArgumentException e){
System.out.println("Exception during formating of type :" +e);
}
}
public static void main(String args[]){
try{
new TestMessageFormat().runTest();
}catch(Exception e){
System.out.println("Exception of type: "+e);
}
}
}
```
## ChoiceFormat Class
**Important:** The following documentation is outdated. *ChoiceFormat is
probably not what you need. Please use MessageFormat with plural arguments for
proper plural selection, and select arguments for simple selection among a fixed
set of choices!*
ICU's ChoiceFormat class provides more flexibility than the printf() and scanf()
style functions for formatting UI strings. This interface can be useful if you
would like a message to change according to the number of items you are
displaying. Note: Some Asian languages do not have plural words or phrases.
### C++
```cpp
void msgSample1(){
UChar *result, *tzID, *str;
UChar pattern[100];
int32_t resultLengthOut, resultlength;
UCalendar *cal;
UDate d1;
UErrorCode status = U_ZERO_ERROR;
str=(UChar*)malloc(sizeof(UChar) * (strlen("disturbance in force") +1));
u_uastrcpy(str, "disturbance in force");
tzID=(UChar*)malloc(sizeof(UChar) * 4);
u_uastrcpy(tzID, "PST");
cal=ucal_open(tzID, u_strlen(tzID), "en_US", UCAL_TRADITIONAL, &status);
ucal_setDateTime(cal, 1999, UCAL_MARCH, 18, 0, 0, 0, &status);
d1=ucal_getMillis(cal, &status);
u_uastrcpy(pattern, "On {0, date, long}, there was a {1} on planet
{2,number,integer}");
resultlength=0;
resultLengthOut=u_formatMessage( "en_US", pattern, u_strlen(pattern),
NULL,
resultlength, &status, d1, str, 7);
if(status==U_BUFFER_OVERFLOW_ERROR){
status=U_ZERO_ERROR;
resultlength=resultLengthOut+1;
result=(UChar*)realloc(result, sizeof(UChar) * resultlength);
u_formatMessage( "en_US", pattern, u_strlen(pattern), result,
resultlength, &status, d1, str, 7);
}
printf("%s\n",austrdup(result) ); //austrdup( a function used to convert
UChar* to char*)
free(tzID);
free(str);
double filelimits[] = {0,1,2};
UErrorCode err;
UnicodeString filepart[] = {"are no files","is one file","are {2} files"};
ChoiceFormat fileform(filelimits, filepart,err);
Format testFormats[] = {fileform, null, NumberFormat.getInstance()};
MessageFormat pattform("There {0} on {1}",err);
pattform.setFormats(testFormats);
Formattable testArgs[] = {null, "ADisk", null};
for (int i = 0; i < 4; ++i) {
testArgs[0] = i;
testArgs[2] = testArgs[0];
FieldPosition fpos=0;
format.format(args,1, result,fpos,mError);
UnicodeString result = pattform.format(testArgs);
}
```
### C
```c
void msgSample2(){
UChar* str;
UErrorCode status = U_ZERO_ERROR;
UChar *result;
UChar pattern[100];
int32_t resultlength,resultLengthOut, i;
double testArgs[3]= { 100.0, 1.0, 0.0};
str=(UChar*)malloc(sizeof(UChar) * 10);
u_uastrcpy(str, "MyDisk");
u_uastrcpy(pattern, "The disk {1} contains {0,choice,0#no files|1#one
file|1<{0,number,integer} files}");
for(i=0; i<3; i++){
resultlength=0;
resultLengthOut=u_formatMessage( "en_US", pattern, u_strlen(pattern),
NULL, resultlength, &status, testArgs[i], str);
if(status==U_BUFFER_OVERFLOW_ERROR){
status=U_ZERO_ERROR;
resultlength=resultLengthOut+1;
result=(UChar*)malloc(sizeof(UChar) * resultlength);
u_formatMessage( "en_US", pattern, u_strlen(pattern), result,
resultlength, &status, testArgs[i], str);
}
}
printf("%s\n", austrdup(result) ); //austrdup( a function used to
convert
UChar* to char*)
free(result);
}
```
### Java
```java
import java.text.ChoiceFormat;
import com.ibm.icu.text.*;
import java.text.Format;
public class TestChoiceFormat{
public void run(){
double[] filelimits = {0,1,2};
String[] filepart = {"are no files","is one file","are {2} files"};
ChoiceFormat fileform = new ChoiceFormat(filelimits,filepart);
Format[] testFormats = {fileform,null,NumberFormat.getInstance()};
MessageFormat pattform = new MessageFormat("There {0} on {1}");
Object[] testArgs = {null,"ADisk",null};
for(int i=0;i<4;++i) {
testArgs[0] = new Integer(i);
testArgs[2] = testArgs[0];
System.out.println(pattform.format(testArgs));
}
}
public static void main(String args[]){
new TestChoiceFormat().run();
}
}
```

View file

@ -0,0 +1,217 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Formatting Messages
## Overview
Messages are user-visible strings, often with variable elements like names,
numbers and dates. Message strings are typically translated into the different
languages of a UI, and translators move around the variable elements according
to the grammar of the target language.
For this to work in many languages, a message has to be written and translated
as a single unit, typically a string with placeholder syntax for the variable
elements. If the user-visible string were concatenated directly from fragments
and formatted elements, then translators would not be able to rearrange the
pieces, and they would have a hard time translating each of the string
fragments.
## MessageFormat
The ICU **MessageFormat** class uses message "pattern" strings with
variable-element placeholders (called "arguments" in the API docs) enclosed in
{curly braces}. The argument syntax can include formatting details, otherwise a
default format is used. For details about the pattern syntax and the formatting
behavior see the MessageFormat API docs
([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/MessageFormat.html),
[C++](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classMessageFormat.html#_details),
[C](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/umsg_8h.html#_details)).
### Complex Argument Types
Certain types of arguments select among several choices which are nested
MessageFormat pattern strings. Keeping these choices together in one message
pattern string facilitates translation in context, by one single translator.
(Commercial translation systems often distribute different messages to different
translators.)
* Use a "plural" argument to select sub-messages based on a numeric value,
together with the plural rules for the specified language.
* Use a "select" argument to select sub-messages via a fixed set of keywords.
* Use of the old "choice" argument type is discouraged. It cannot handle
plural rules for many languages, and is clumsy for simple selection.
It is tempting to cover only a minimal part of a message string with a complex
argument (e.g., plural). However, this is difficult for translators for two
reasons: 1. They might have trouble understanding how the sentence fragments in
the argument sub-messages interact with the rest of the sentence, and 2. They
will not know whether and how they can shrink or grow the extent of the part of
the sentence that is inside the argument to make the whole message work for
their language.
**Recommendation:** If possible, use complex arguments as the outermost
structure of a message, and write **full sentences** in their sub-messages. If
you have nested select and plural arguments, place the **select** arguments
(with their fixed sets of choices) on the **outside** and nest the plural
arguments (hopefully at most one) inside.
For example:
"{gender_of_host, **select**, "
  "**female** {"
    "{num_guests, **plural**, offset:1 "
      "=0 {{host} does not give a party.}"
      "=1 {{host} invites {guest} to **her** party.}"
      "=2 {{host} invites {guest} and one other person to her party.}"
      "other {{host} invites {guest} and # other people to her party.}}}"
  "**male** {"
    "{num_guests, **plural**, offset:1 "
      "=0 {{host} does not give a party.}"
      "=1 {{host} invites {guest} to **his** party.}"
      "=2 {{host} invites {guest} and one other person to his party.}"
      "other {{host} invites {guest} and # other people to his party.}}}"
  "**other** {"
    "{num_guests, **plural**, offset:1 "
      "=0 {{host} does not give a party.}"
      "=1 {{host} invites {guest} to **their** party.}"
      "=2 {{host} invites {guest} and one other person to their party.}"
      "other {{host} invites {guest} and # other people to their party.}}}}"
**Note:** In a plural argument like in the example above, if the English message
has both `=0` and `=1` (up to `=offset`+1) then it does not need a "`one`"
variant because that would never be selected. It does always need an "`other`"
variant.
**Note:** *The translation system and the translator together need to add
["`one`", "`few`" etc. if and as necessary per target
language](http://cldr.unicode.org/index/cldr-spec/plural-rules).*
### Quoting/Escaping
If syntax characters occur in the text portions, then they need to be quoted by
enclosing the syntax in pairs of ASCII apostrophes. A pair of ASCII apostrophes
always represents one ASCII apostrophe, similar to %% in printf representing one
%, although this rule still applies inside quoted text. ("This '{isn''t}'
obvious" → "This {isn't} obvious")
* Before ICU 4.8, ASCII apostrophes always started quoted text and had
inconsistent behavior in nested sub-messages, which was a source of problems
with authoring and translating message pattern strings.
* Starting with ICU 4.8, an ASCII apostrophe only starts quoted text if it
immediately precedes a character that requires quoting (that is, "only where
needed"), and works the same in nested messages as on the top level of the
pattern. The new behavior is otherwise compatible; for details see the
MessageFormat and MessagePattern (new in ICU 4.8) API docs.
* Recommendation: Use the real apostrophe (single quote) character (U+2019)
for human-readable text, and use the ASCII apostrophe ' (U+0027) only in
program syntax, like quoting in MessageFormat. See the annotations for
U+0027 Apostrophe in The Unicode Standard.
### Argument formatting
Arguments are formatted according to their type, using the default ICU
formatters for those types, unless otherwise specified. For unknown types the
Java `MessageFormat` will call `toString()`.
There are also several ways to control the formatting.
#### Predefined styles (recommended)
You can specify the `argStyle` to be one of the predefined values `short`, `medium`,
`long`, `full` (to get one of the standard forms for dates / times) and `integer`,
`currency`, `percent` (for number formatting).
#### Skeletons (recommended)
Numbers, dates, and times can use a skeleton in `argStyle`, prefixed with `::` to
distinguish them from patterns. These are locale-independent ways to specify the
format, and this is the recommended mechanism if the predefined styles are not
appropriate.
Date skeletons:
- **ICU4J:**
<https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/SimpleDateFormat.html>
- **ICU4C:** <https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classSimpleDateFormat.html>
Number formatter skeletons:
- **ICU4J:**
<https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberFormatter.html>
- **ICU4C:** <https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1NumberFormat.html>
#### Format the parameters separately (recommended)
You can format the parameter as you need **before** calling `MessageFormat`, and
then passing the resulting string as a parameter to `MessageFormat`.
This offers maximum control, and is preferred to using custom format objects
(see below).
#### String patterns (discouraged)
These can be used for numbers, dates, and times, but they are locale-sensitive,
and they therefore would need to be localized by your translators, which adds
complexity to the localization, and placeholder details are often not accessible
by translators. If such a pattern is not localized, then users see confusing
formatting. Consider using skeletons instead of patterns in your message
strings.
Allowing translators to localize date patterns is error-prone, as translators
might make mistakes (resulting in invalid ICU date formatter syntax). Also, CLDR
provides curated patterns for many locales, and using your own pattern means
that you don't benefit from that CLDR data and the results will likely be
inconsistent with the rest of the patterns that ICU uses.
It is also a bad internationalization practice, because most companies only
translate into “generic” versions of the languages (French, or Spanish, or
Arabic). So the translated patterns get used in tens of countries. On the other
hand, skeletons are localized according to the MessageFormat locale, which
should include regional variants (e.g., “fr-CA”).
#### Custom Format Objects (discouraged)
The MessageFormat class allows setting custom Format objects to format
arguments, overriding the arguments' pattern specification. This is discouraged:
For custom formatting of some values it should normally suffice to format them
externally and to provide the formatted strings to the MessageFormat.format()
methods.
Only the top-level arguments are accessible and settable via setFormat(),
getFormat() etc. Arguments inside nested sub-messages, inside
choice/plural/select arguments, are "invisible" via these API methods.
Some of these methods (the ones corresponding to the original JDK MessageFormat
API) address the top-level arguments in their order of appearance in the pattern
string, which is usually not useful because it varies with translations. Newer
methods address arguments by argument number ("index") or name.
### Examples
The following code fragment created this output: "At 4:34 PM on March 23, there
was a disturbance in the Force on planet 7."
```cpp
UErrorCode err = U_ZERO_ERROR;
Formattable arguments[] = {
(int32_t)7,
Formattable(Calendar.getNow(), Formattable::kIsDate),
"a disturbance in the Force"
};
UnicodeString result;
result = MessageFormat::format(
"At {1,time,::jmm} on {1,date,::dMMMM}, there was {2} on planet{0,number,integer}.",
arguments,
3,
result,
err);
```
There are several more usage examples for the MessageFormat and ChoiceFormat
classes in [C , C++ and Java](examples.md).

View file

@ -0,0 +1,23 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Formatting Numbers
Since ICU 60, the recommended mechanism for formatting numbers is
[NumberFormatter](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/numberformatter_8h.html)
([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberFormatter.html)). NumberFormatter supports the formatting of:
- Decimal Formatting
- Currencies
- Measurement Units
- Percentages
- Scientific Notation
- Compact Notation
For number ranges, including currency and measurement unit ranges, see [NumberRangeFormatter](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/numberrangeformatter_8h.html) ([Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/number/NumberRangeFormatter.html)).
For rule-based number formatting, including spellout rules and support for traditional numbering systems not covered by base-10 decimal digits, see [rbnf.md](rbnf.md).
For the classic NumberFormat class, which also includes legacy parsing support for localized number strings, see [legacy-numberformat.md](legacy-numberformat.md).

View file

@ -0,0 +1,196 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Legacy NumberFormat
Since ICU 60, the recommended way to format numbers is NumberFormatter; see [index.md](index.md). This page is here for reference for the older NumberFormat heirarchy in ICU4C and ICU4J.
## NumberFormat
[NumberFormat](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classNumberFormat.html) is
the abstract base class for all number formats. It provides an interface for
formatting and parsing numbers. It also provides methods to determine which
locales have number formats, and what their names are. NumberFormat helps format
and parse numbers for any locale. Your program can be written to be completely
independent of the locale conventions for decimal points or
thousands-separators. It can also be written to be independent of the particular
decimal digits used or whether the number format is a decimal. A normal decimal
number can also be displayed as a currency or as a percentage.
```
1234.5 //Decimal number
$1234.50 //U.S. currency
1.234,57€ //German currency
123457% //Percent
```
### Usage
#### Formatting for a Locale
To format a number for the current Locale, use one of the static factory methods
to create a format, then call a format method to format it. To format a number
for a different Locale, specify the Locale in the call to createInstance(). You
can control the numbering system to be used for number formatting by creating a
Locale that uses the @numbers keyword defined. For example, by default, the Thai
locale "th" uses the western digits 0-9. To create a number format that uses the
native Thai digits instead, first create a locale with "@numbers=thai" defined.
See [the description on Locales](../../locale/index.md) for details.
> :point_right: **Note**: If you are formatting multiple numbers, save processing time by constructing the formatter once and then using it several times.
#### Instantiating a NumberFormat
The following methods are used for instantiating NumberFormat objects:
1. **createInstance()**
Returns the normal number format for the current locale or for a specified
locale.
2. **createCurrencyInstance()**
Returns the currency format for the current locale or for a specified
locale.
3. **createPercentInstance()**
Returns the percentage format for the current locale or for a specified
locale.
4. **createScientificInstance()**
Returns the scientific number format for the current locale or for a
specified locale.
To create a format for spelled-out numbers, use a constructor on
RuleBasedNumberFormat (§).
#### Currency Formatting
Currency formatting, i.e., the formatting of monetary values, combines a number
with a suitable display symbol or name for a currency. By default, the currency
is set from the locale data from when the currency format instance is created,
based on the country code in the locale ID. However, for all but trivial uses,
this is fragile because countries change currencies over time, and the locale
data for a particular country may not be available.
For proper currency formatting, both the number and the currency must be
specified. Aside from achieving reliably correct results, this also allows to
format monetary values in any currency with the format of any locale, like in
exchange rate lists. If the locale data does not contain display symbols or
names for a currency, then the 3-letter ISO code itself is displayed.
The locale ID and the currency code are effectively independent: The locale ID
defines the general format for the numbers, and whether the currency symbol or
name is displayed before or after the number, while the currency code selects
the actual currency with its symbol, name, number of digits, and [rounding
mode](rounding-modes.md).
In ICU and Java, the currency is specified in the form of a 3-letter ISO 4217
code. For example, the code "USD" represents the US Dollar and "EUR" represents
the Euro currency.
In terms of APIs, the currency code is set as an attribute on a number format
object (on a currency instance), while the number value is passed into each
format() call or returned from parse() as usual.
1. ICU4C (C++) NumberFormat.setCurrency() takes a Unicode string (const UChar
\*) with the 3-letter code.
2. ICU4C (C API) allows to set the currency code via unum_setTextAttribute()
using the UNUM_CURRENCY_CODE selector.
3. ICU4J NumberFormat.setCurrency() takes an ICU Currency object which
encapsulates the 3-letter code.
4. The base JDK's NumberFormat.setCurrency() takes a JDK Currency object which
encapsulates the 3-letter code.
The functionality of Currency and setCurrency() is more advanced in ICU than in
the base JDK. When using ICU, setting the currency automatically adjusts the
number format object appropriately, i.e., it sets not only the currency symbol
and display name, but also the correct number of fraction digits and the correct
[rounding mode](rounding-modes.md). This is not the case with the base JDK. See
the API references for more details.
There is ICU4C sample code at
[icu4c/source/samples/numfmt/main.cpp]](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/numfmt/main.cpp)
which illustrates the use of NumberFormat.setCurrency().
#### Displaying Numbers
You can also control the display of numbers with methods such as
getMinimumFractionDigits. If you want even more control over the format or
parsing, or want to give your users more control, cast the NumberFormat returned
from the factory methods to a DecimalNumberFormat. This works for the vast
majority of countries.
#### Working with Positions
You can also use forms of the parse and format methods with ParsePosition and
UFieldPosition to enable you to:
1. progressively parse through pieces of a string.
2. align the decimal point and other areas.
For example, you can align numbers in two ways:
1. If you are using a mono-spaced font with spacing for alignment, pass the
FieldPosition in your format call with field = INTEGER_FIELD. On output,
getEndIndex is set to the offset between the last character of the integer
and the decimal. Add (desiredSpaceCount - getEndIndex) spaces at the front
of the string. You can also use the space padding feature available in
DecimalFormat.
2. If you are using proportional fonts, instead of padding with spaces, measure
the width of the string in pixels from the start to getEndIndex. Then move
the pen by (desiredPixelWidth - widthToAlignmentPoint) before drawing the
text. It also works where there is no decimal, but additional characters at
the end (that is, with parentheses in negative numbers: "(12)" for -12).
#### Emulating printf
NumberFormat can produce many of the same formats as printf.
| printf | ICU |
|--------|-----|
| Width specifier, e.g., "%5d" has a width of 5. | Use DecimalFormat. Either specify the padding, with can pad with any character, or specify a minimum integer count and a minimum fraction count, which will emit a specific number of digits, with zero padded to the left and right. |
| Precision specifier for %f and %e, e.g. "%.6f" or "%.6e". This defines the number of digits to the right of the decimal point. | Use DecimalFormat. Specify the maximum fraction digits. |
| General scientific notation, %g. This format uses either %f or %e, depending on the magnitude of the number being displayed. | Use ChoiceFormat with DecimalFormat. For example, for a typical %g, which has 6 significant digits, use a ChoiceFormat with thresholds of 1e-4 and 1e6. For values between the two thresholds, use a fixed DecimalFormat with the pattern "@#####". For values outside the thresholds, use a DecimalFormat with the pattern "@#####E0". |
## DecimalFormat
DecimalFormat is a NumberFormat that converts numbers into strings using the
decimal numbering system. This is the formatter that provides standard number
formatting and parsing services for most usage scenarios in most locales. In
order to access features of DecimalFormat not exposed in the NumberFormat API,
you may need to cast your NumberFormat object to a DecimalFormat. You may also
construct a DecimalFormat directly, but this is not recommended because it can
hinder proper localization.
For a complete description of DecimalFormat, including the pattern syntax,
formatting and parsing behavior, and available API, see the [ICU4J DecimalFormat
API](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/DecimalFormat.html) or
[ICU4C DecimalFormat
API](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classDecimalFormat.html) documentation.
## DecimalFormatSymbols
[DecimalFormatSymbols](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classDecimalFormatSymbols.html)
specifies the exact characters a DecimalFormat uses for various parts of a
number (such as the characters to use for the digits, the character to use as
the decimal point, or the character to use as the minus sign).
This class represents the set of symbols needed by DecimalFormat to format
numbers. DecimalFormat creates its own instance of DecimalFormatSymbols from its
locale data. The DecimalFormatSymbols can be adopted by a DecimalFormat
instance, or it can be specified when a DecimalFormat is created. If you need to
change any of these symbols, can get the DecimalFormatSymbols object from your
DecimalFormat and then modify it.
## Additional Sample Code
C/C++: See
[icu4c/source/samples/numfmt/](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/numfmt/)
in the ICU source distribution for code samples showing the use of ICU number
formatting.

View file

@ -0,0 +1,95 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# RuleBasedNumberFormat Examples
## Annotated RuleBasedNumberFormat Example
The following example provides a quick idea of how the rules work. The
[RuleBasedNumberFormat API
documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html)
describes the rule syntax in more detail.
This ruleset formats a number using standard decimal place-value notation, but
using words instead of digits, e.g. 123.4 formats as 'one two three point four':
```
"-x: minus >>;\n"
+ "x.x: << point >>;\n"
+ "zero; one; two; three; four; five; six;\n"
+ " seven; eight; nine;\n"
+ "10: << >>;\n"
+ "100: << >>>;\n"
+ "1000: <<, >>>;\n"
+ "1,000,000: <<, >>>;\n"
+ "1,000,000,000: <<, >>>;\n"
+ "1,000,000,000,000: <<, >>>;\n"
+ "1,000,000,000,000,000: =#,##0=;\n";
```
In this example, the rules consist of one (unnamed) ruleset. It lists nineteen
rules, each terminated by a semicolon. It starts with two special rules for
handling negative numbers and non-integers. (This is true of most rulesets.)
Following are rules for increasing integer ranges, up to 10e15. The portion of
the rule before a colon, if any, provides information about the range and some
additional information about how to apply the rule. Most rule bodies (following
the colon) consist of recursion instructions and/or plain text substitutions.
The rules in this example work as follows:
1. **-x: minus >>;**
If the number is negative, output the string 'minus ' and recurse using the
absolute value.
2. **x.x: << point >>;**
If the number is not an integer, recurse using the integral part, emit the
string ' point ', and process the ruleset in 'fractional mode' for the
fractional part. Generally, this emits single digits.
3. **zero; one; ... nine;**
Each of these ten rules applies to a range. By default, the first range
starts at zero, and succeeding ranges start at the previous start + 1. These
ranges all default, so each of these ten rules has a 'range' of a single
integer, 0 to 9. When the current value is in one of these ranges, the rules
emit the corresponding text (e.g. 'one', 'two', and so on).
4. **10: << >>;**
This starts a new range at 10 (not default) and sets the limit of the range
for the previous rule. Divide the number by the divisor (which defaults to
the highest power of 10 lower or equal to range start value, e.g. 10),
recurse using the integral part, emit the string ' ' (space), then recurse
using the remainder.
5. **100: << >>>;**
This starts a new range at 100 (again, limiting the previous rule's range).
It is similar to the previous rule, except for the use of '>>>'. '>>' means
to recurse by matching the value against all the ranges to find the rule,
'>>>' means to recurse using the previous rule. We must force the previous
rule in order to get the rule for 'ten' invoked in order to emit '0' when
processing numbers like 105.
6. **1000: <<, >>>; 1,000,000: ...**
These start new ranges at intervals of 1000. They are all similar to the
rule for 100 except they output ', ' (comma space) to delimit thousands.
Note that the range value can include commas for readability.
7. **1,000... =#,##0=;**
This last rule in the ruleset applies to all values at or over 10e15. The
pattern '==' means to use the current unmodified value, and text within in
the pattern (this works for '<<' and similar patterns as well) describes the
ruleset or decimal format to use. If this text starts with '0' or '#', it is
presumed to be a decimal format pattern. So this rule means to format the
unmodified number using a decimal format constructed with the pattern
'#,##0'.
Rulesets are invoked by first applying negative and fractional rules, then by
finding the rule whose range includes the current value and applying that rule,
recursing as directed by the rule. Again, a complete description of the rule
syntax can be found in the [API
Documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html)
.
More rule examples can be found in the RuleBasedNumberFormat [demo
source](https://github.com/unicode-org/icu/blob/master/icu4j/demos/src/com/ibm/icu/dev/demo/rbnf/RbnfSampleRuleSets.java)
.

View file

@ -0,0 +1,120 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# RuleBasedNumberFormat
[RuleBasedNumberFormat](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html)
can format and parse numbers in spelled-out format, e.g. "one hundred and
thirty-four". For example:
```
"one hundred and thirty-four" // 134 using en_US spellout
"one hundred and thirty-fourth" // 134 using en_US ordinal
"hundertvierunddreissig" // 134 using de_DE spellout
"MCMLVIII" // custom, 1958 in roman numerals
```
RuleBasedNumberFormat is based on rules describing how to format a number. The
rule syntax is designed primarily for formatting and parsing numbers as
spelled-out text, though other kinds of formatting are possible. As a
convenience, custom API is provided to allow selection from three predefined
rule definitions, when available: SPELLOUT, ORDINAL, and DURATION. Users can
request formatters either by providing a locale and one of these predefined rule
selectors, or by specifying the rule definitions directly.
> :point_right: **Note**: ICU provides number spellout rules for several locales, but not for all of the
locales that ICU supports, and not all of the predefined rule types. Also, as of
release 2.6, some of the provided rules are known to be incomplete.
## Instantiation
Unlike the other standard number formats, there is no corresponding factory
method on NumberFormat. Instead, RuleBasedNumberFormat objects are instantiated
via constructors. Constructors come in two flavors, ones that take rule text,
and ones that take one of the predefined selectors. Constructors that do not
take a Locale parameter use the current default locale.
The following constructors are available:
1. **RuleBasedNumberFormat(int)**
Returns a format using predefined rules of the selected type from the
current locale.
2. **RuleBasedNumberFormat(Locale, int)**
As above, but specifies locale.
3. **RuleBasedNumberFormat(String)**
Returns a format using the provided rules, and symbols (if required) from
the current locale.
4. **RuleBasedNumberFormat(String, Locale)**
As above, but specifies locale.
## Usage
RuleBasedNumberFormat can be used like other NumberFormats. For example, in
Java:
```java
double num = 2718.28;
NumberFormat formatter =
new RuleBasedNumberFormat(RuleBasedNumberFormat.SPELLOUT);
String result = formatter.format(num);
System.out.println(result);
// output (in en_US locale):
// two thousand seven hundred and eighteen point two eight
```
## Rule Sets
Rule descriptions can provide multiple named rule sets, for example, the rules
for en_US spellout provides a '%simplified' rule set that displays text without
commas or the word 'and'. Rule sets can be queried and set on a
RuleBasedNumberFormat. This lets you customize a RuleBasedNumberFormat for use
through its inherited NumberFormat API. For example, in Java:
You can also format a number specifying the ruleset directly, using an
additional overload of format provided by RuleBasedNumberFormat. For example, in
Java:
> :point_right: **Note**: There is no standardization of rule set names, so you must either query the
names, as in the first example above, or know the names that are defined in the
rules for that formatter.
## Rules
The following example provides a quick look at the RuleBasedNumberFormat rule
syntax.
These rules format a number using standard decimal place-value notation, but
using words instead of digits, e.g. 123.4 formats as 'one two three point four':
```
"-x: minus >>;\n"
+ "x.x: << point >>;\n"
+ "zero; one; two; three; four; five; six;\n"
+ " seven; eight; nine;\n"
+ "10: << >>;\n"
+ "100: << >>>;\n"
+ "1000: <<, >>>;\n"
+ "1,000,000: <<, >>>;\n"
+ "1,000,000,000: <<, >>>;\n"
+ "1,000,000,000,000: <<, >>>;\n"
+ "1,000,000,000,000,000: =#,##0=;\n";
```
Rulesets are invoked by first applying negative and fractional rules, and then
using a recursive process. It starts by finding the rule whose range includes
the current value and applying that rule. If the rule so directs, it emits text,
including text obtained by recursing on new values as directed by the rule. As
you can see, the rules are designed to accomodate recursive processing of
numbers, and so are best suited for formatting numbers in ways that are
inherently recursive.
A full explanation of this example can be found in the [RuleBasedNumberFormat
examples](rbnf-examples.md) . A complete description of the rule syntax can be
found in the [RuleBasedNumberFormat API
Documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRuleBasedNumberFormat.html).

View file

@ -0,0 +1,121 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Rounding Modes
The following rounding modes are used with ICU's Decimal Formatter. Note that
ICU's use of the terms "Down" and "Up" here are somewhat at odds with other
definitions, but are equivalent to the same modes used in Java's JDK.
## Comparison of Rounding Modes
This chart shows the values -2.0 through 2.0 in increments of 0.1, and shows the
resulting ICU format when formatted with no decimal digits.
| # | CEILING | FLOOR | DOWN | UP | HALFEVEN | HALFDOWN | HALFUP | # |
|------|---------|-------|------|----|----------|----------|--------|------|
| -2.0 | -2 | -2 | -2 | -2 | -2 | -2 | -2 | -2.0 |
| -1.9 | -1 | -2 | -1 | -2 | -2 | -2 | -2 | -1.9 |
| -1.8 | -1 | -2 | -1 | -2 | -2 | -2 | -2 | -1.8 |
| -1.7 | -1 | -2 | -1 | -2 | -2 | -2 | -2 | -1.7 |
| -1.6 | -1 | -2 | -1 | -2 | -2 | -2 | -2 | -1.6 |
| -1.5 | -1 | -2 | -1 | -2 | -2 | -1 | -2 | -1.5 |
| -1.4 | -1 | -2 | -1 | -2 | -1 | -1 | -1 | -1.4 |
| -1.3 | -1 | -2 | -1 | -2 | -1 | -1 | -1 | -1.3 |
| -1.2 | -1 | -2 | -1 | -2 | -1 | -1 | -1 | -1.2 |
| -1.1 | -1 | -2 | -1 | -2 | -1 | -1 | -1 | -1.1 |
| -1.0 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1.0 |
| -0.9 | -0 | -1 | -0 | -1 | -1 | -1 | -1 | -0.9 |
| -0.8 | -0 | -1 | -0 | -1 | -1 | -1 | -1 | -0.8 |
| -0.7 | -0 | -1 | -0 | -1 | -1 | -1 | -1 | -0.7 |
| -0.6 | -0 | -1 | -0 | -1 | -1 | -1 | -1 | -0.6 |
| -0.5 | -0 | -1 | -0 | -1 | -0 | -0 | -1 | -0.5 |
| -0.4 | -0 | -1 | -0 | -1 | -0 | -0 | -0 | -0.4 |
| -0.3 | -0 | -1 | -0 | -1 | -0 | -0 | -0 | -0.3 |
| -0.2 | -0 | -1 | -0 | -1 | -0 | -0 | -0 | -0.2 |
| -0.1 | -0 | -1 | -0 | -1 | -0 | -0 | -0 | -0.1 |
| 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
| 0.1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0.1 |
| 0.2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0.2 |
| 0.3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0.3 |
| 0.4 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0.4 |
| 0.5 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0.5 |
| 0.6 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0.6 |
| 0.7 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0.7 |
| 0.8 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0.8 |
| 0.9 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0.9 |
| 1.0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1.0 |
| 1.1 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1.1 |
| 1.2 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1.2 |
| 1.3 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1.3 |
| 1.4 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1.4 |
| 1.5 | 2 | 1 | 1 | 2 | 2 | 1 | 2 | 1.5 |
| 1.6 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 1.6 |
| 1.7 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 1.7 |
| 1.8 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 1.8 |
| 1.9 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 1.9 |
| 2.0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2.0 |
| # | CEILING | FLOOR | DOWN | UP | HALFEVEN | HALFDOWN | HALFUP | # |
### Half Even
This is ICU's default rounding mode. Values exactly on the 0.5 (half) mark
(shown dotted in the chart) are rounded to the nearest even digit. This is often
called Banker's Rounding because it is, on average, free of bias. It is the
default mode specified for IEEE 754 floating point operations.
Also known as ties-to-even, round-to-nearest, RN or RNE.
### Half Down
Values exactly on the 0.5 (half) mark are rounded down (next smaller absolute
value, towards zero).
### Half Up
Values exactly on the 0.5 (half) mark are rounded up (next larger absolute
value, away from zero).
### Down
All values are rounded towards the next smaller absolute value (rounded towards
zero, or RZ).
Also known as: truncation, because the insignificant decimal places are simply
removed.
### Up
All values are rounded towards the next greater absolute value (away from zero).
### Ceiling
All values are rounded towards positive infinity (+∞). Also known as RI for
Rounds to Infinity.
### Floor
All values are rounded towards negative infinity (-∞). Also known as RMI for
Rounds to Minus Infinity.
### Unnecessary
The mode "Unnecessary" doesn't perform any rounding, but instead returns an
error if the value cannot be represented exactly without rounding.
## **Other References/Comparison**
* Decimal Context docs (used by ICU4C to implement rounding):
<http://speleotrove.com/decimal/decifaq1.html#rounding>
* Java 7 docs:
<http://docs.oracle.com/javase/7/docs/api/java/math/RoundingMode.html>
* IEEE 754 rounding rules:
<http://en.wikipedia.org/wiki/IEEE_754-2008#Rounding_rules>
* Wikipedia article on Rounding:
<http://en.wikipedia.org/wiki/Rounding#Tie-breaking>
* Live rounding mode chart: [Rounding Mode
Chart](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-demos/blob/master/roundmode/round.html)
and [Source
Code](https://github.com/unicode-org/icu-demos/tree/master/roundmode)

View file

@ -168,7 +168,7 @@ The unit width can be specified by the following stems:
- `unit-width-hidden`
For more details, see
[UNumberUnitWidth](http://icu-project.org/apiref/icu4c/unumberformatter_8h.html).
[UNumberUnitWidth](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
### Precision
@ -256,8 +256,7 @@ The rounding mode can be specified by the following stems:
- `rounding-mode-half-up`
- `rounding-mode-unnecessary`
For more details, see [Rounding
Modes](http://userguide.icu-project.org/formatparse/numbers/rounding-modes).
For more details, see [Rounding Modes](rounding-modes.md).
### Integer Width
@ -298,7 +297,7 @@ skeletons:
The decimal number should conform to a standard decimal number syntax. In
C++, it is parsed using the decimal number library described in
[LocalizedNumberFormatter::formatDecimal](http://icu-project.org/apiref/icu4c/classicu_1_1number_1_1LocalizedNumberFormatter.html).
[LocalizedNumberFormatter::formatDecimal](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1number_1_1LocalizedNumberFormatter.html).
In Java, it is parsed using
[BigDecimal](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html#BigDecimal%28java.lang.String%29).
For maximum compatibility, it is highly recommended that your decimal number
@ -315,7 +314,7 @@ The grouping strategy can be specified by the following stems:
- `group-thousands` (no concise equivalent)
For more details, see
[UNumberGroupingStrategy](http://icu-project.org/apiref/icu4c/unumberformatter_8h.html).
[UNumberGroupingStrategy](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
### Symbols
@ -339,7 +338,7 @@ The following stems specify sign display:
- `sign-accounting-except-zero` or `()?` (concise)
For more details, see
[UNumberSignDisplay](http://icu-project.org/apiref/icu4c/unumberformatter_8h.html).
[UNumberSignDisplay](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).
### Decimal Separator Display
@ -349,4 +348,4 @@ The following stems specify decimal separator display:
- `decimal-always`
For more details, see
[UNumberDecimalSeparatorDisplay](http://icu-project.org/apiref/icu4c/unumberformatter_8h.html).
[UNumberDecimalSeparatorDisplay](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unumberformatter_8h.html).

137
docs/userguide/glossary.md Normal file
View file

@ -0,0 +1,137 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Glossary
## ICU-specific Words and Acronyms
For additional Unicode terms, please see the official
[Unicode Standard Glossary](https://www.unicode.org/glossary/).
Term | Definition
------------|------------
**- A -** |
**accent** | A modifying mark on a character to indicate a change in vocal tone for pronunciation. For example, the accent marks in Latin script (acute, tilde, and ogonek) and the tone marks in Thai. Synonymous with diacritic.
**accent character** | A character that has a diacritic attached to it.
**alphabetic language** | A written language in which symbols represent vowels and consonants, and in which syllables and words are formed by a phonetic combination of symbols. Examples of alphabetic languages are English, Greek, and Russian. Contrast with ideographic language.
**Arabic numerals** | Forms of decimal numerals used in most parts of the Arabic world (for instance, U+0660, U+0661, U+0662, U+0663). Although European digits (1, 2, 3...) derive historically from these forms, they are visually distinct and are coded separately. (Arabic digits are sometimes called Indic numerals; however, this nomenclature leads to confusion with the digits currently used with the scripts of India.) Arabic digits are referred to as Arabic-Indic digits in the Unicode Standard. Variant forms of Arabic digits used chiefly in Iran and Pakistan are referred to as Eastern Arabic-Indic digits.
**Arabic script** | A cursive script used in Arabic countries. Other writing systems such as Latin and Japanese also have a cursive handwritten form, but usually are typeset or printed in discrete letter form. Arabic script has only the cursive form. Arabic script is also used for Urdu, (spoken in Pakistan, Bangladesh, and India), Farsi and Persian (spoken in Iran, Iraq, and Afghanistan).
**ASCII** | "American Standard Code for Information Interchange." A standard 7-bit character set used for information interchange. ASCII encodes the basic Latin alphabet and punctuation used in American English, but does not encode the accented characters used in many European languages.
**- B -** |
**base character** | A base character is a Unicode character that does not graphically combine with any preceding character. This does not include control or formatting characters. This is a characteristic of most Unicode characters.
**baseline** | A conceptual line with respect to which successive characters are aligned.
**Basic Multilingual Plane** | As defined by International Standard [ISO/IEC 10646](http://std.dkuug.dk/jtc1/sc2/wg2/), Unicode values `0000` through `FFFF`. This range covers all of the major living languages around the world.
**bidi** | See bidirectional.
**bidirectional** | Text which has a mixture of languages that read and write either left-to-right or right-to-left. Languages such as Arabic, Hebrew, and Yiddish have a general flow of text that proceeds horizontally from right to left, but numbers and Latin based languages like English are written from left to right.
**big-endian** | A computer architecture that stores multiple-byte numerical values with the most significant byte (MSB or big end) values first in a computer's addressable memory. This is the opposite from little-endian.
**BMP** | See Basic Multilingual Plane.
**boundary** | A boundary is a location between user characters, words, or at the start or end of a string. Boundaries break the string into logical groups of characters.
**boundary position** | Each boundary has a boundary position in a string. The boundary position is an integer that is the index of the character that follows it.
**- C -** |
**canonical decomposition** | The decomposition of a character which results from recursively applying the canonical mappings until no characters can be further decomposed and then re-ordering non-spacing marks according to the canonical behavior rules. For instance, an acute accented A will decompose into an A character followed by an acute accent combining character. Canonical mappings do not remove formatting information, which is the opposite of what happens during a compatibility decomposition.
**canonical equivalent** | Two character sequences are said to be canonical equivalents if their full canonical decomposition are identical.
**CCSID** | Coded Character Set IDentifier. A number which IBM® uses to refer to the combination of particular code page(s), character set(s), and other information. This is defined formally in the CDRA (Coded Character Representation Architecture) documents from IBM.
**character boundary** | A location between characters.
**character properties** | The given properties of a character. These properties include, but are not limited to, case, numeric meaning, and direction to layout successive characters of the same type.
**character set** | The set of characters represented with reference to the binary codes used for the characters. One character set can be encoded into more than one code page.
**Chinese numerals** | Chinese characters that represent numbers. For example, the Chinese characters for 1, 2, and 3 are written with one, two, and three horizontal brush strokes, respectively. Contrast with Arabic numerals, Hindi numerals, and Roman numerals.
**CJK** | Acronym for Chinese/Japanese/Korean characters.
**code page** | An ordered set or characters in which a numeric index (code point value) is associated with each character. This term can also be called a "character set" or "charset."
**code point value** | The encoding value for a character in the specified character set. For example the code point value of "A" in Unicode 3.0 is `0x0041`.
**code set** | UNIX term equivalent to code page.
**collation** | Text comparison using language-sensitive rules as opposed to bitwise comparison of numeric character codes. This is usually done to sort a list of strings.
**collation element** | A collation element consists of the primary, secondary and tertiary weights of a user character.
**combining character** | A combining character is a Unicode character that graphically combines with any preceding base character. A combining character does not stand alone unless it is being described. Accents are examples of combining characters.
**combining character sequence** | A combining character sequence consists of a Unicode base character and zero or more Unicode combining characters. The base and combining characters are dynamically composed at printout time to a user character.
**compatibility character** | A character that has a compatibility decomposition.
**compatibility decomposition** | The decomposition of a character which results from recursively applying both compatibility mappings and canonical mappings until no characters can be further decomposed then re-ordering non-spacing marks according to the canonical behavior rules. Compatibility decomposition may remove formatting information, which is the opposite of what happens during a canonical decomposition.
**compatibility equivalent** | Two characters sequences are said to be compatibility equivalent if their full compatibility decompositions are equivalent.
**core product** | The language independent portion of a software product (as distinct from any particular localized version of that product - including the English language version). Sometimes, however, this term is used to refer to the English product as opposed to other localizations.
**cursive script** | A script whose adjacent characters touch or are connected to each other. For example, Arabic script is cursive.
**- D -** |
**DBCS (double-byte character set)** | A set of characters in which each character is represented by 2 bytes. Scripts such as Japanese, Chinese, and Korean contain more characters than can be represented by 256 code points, thus requiring two bytes to uniquely represent each character. The term DBCS is often used to mean MBCS (multi-byte character set). See multi-byte character set.
**decomposable character** | A character that is comparable to a sequence of one or more other characters.
**decomposition** | A sequence of one or more characters that is equivalent to a decomposable character.
**diacritic** | A modifying mark on a character. For example, the accent marks in Latin script (acute, tilde, and ogonek) and the tone marks in Thai. Synonymous with accent.
**digit** | A general term for a number character. A digit may or may not be base ten.
**display string** | A display string is a string that may be shown to a user. Normally a display string is visible in GUI. These strings need to be translated for different countries.
**- E -** |
**EBCDIC** | Extended Binary-Coded Decimal Interchange Code. A group of coded character sets that consists of eight-bit coded characters. EBCDIC-coded character sets map specified graphic and control characters onto code points, each consisting of 8 bits. EBCDIC is an extension of BCD (Binary-Coded Decimal), which uses only 7 bits for each character.
**ECMA** | European Computer Manufacturers Association. A nonprofit organization formed by European computer vendors to announce standards applicable to the functional design and use of data processing equipment.
**encoding scheme** | A set of specific definitions that describe the philosophy used to represent character data. Examples of specifications in such a definition are: the number of bits, the number of bytes, the allowable ranges of bytes, maximum number of characters, and meanings assigned to some generic and specific bit patterns.
**European numerals** | A number comprised of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and/or 9.
**expansion** | The process of sorting a character as if it were expanded to two characters.
**- F -** |
**font** | A set of graphic characters that have a characteristic design, or a font designer's concept of how the graphic characters should appear. The characteristic design specifies the characteristics of its graphic characters. Examples of characteristics are shape, graphic pattern, style, size, weight, and increment.
**- G -** |
**globalization** | The process of developing, manufacturing, and marketing software products that are intended for worldwide distribution. This term combines two aspects of the work: internationalization (enabling the product to be used without language or culture barriers) and localization (translating and enabling the product for a specific locale).
**glyph** | The actual shape (bit pattern, outline) of a character image. For example, an italic "A" and a roman "A" are two different glyphs representing the same underlying character. Strictly speaking, any two images that differ in shape constitute different glyphs. In this usage, glyph is a synonym for character image, or simply image.
**graphic character** | A character, other than a control function, that has a visual representation normally handwritten, printed, or displayed.
**Graphical User Interface** | Graphical User Interface is normally written as the acronym GUI. It is the display the end-user sees when running a program. Strings that are visible in the GUI need to localized to the end-user's language.
**global application** | An application that can be completely translated for use in different locales. All text shown to the user is in the native language, and user expectations are met for dates, times, and other locale conventions.
**GMT** | Greenwich mean time. In the 1840s the standard time kept by the Royal Greenwich Observatory located at Greenwich, England was established for all of England, Scotland, and Wales, replacing many local times in use in those days. Subsequently GMT became the official time reference for the world until 1972 when it was subsumed by the atomic clock-based coordinated universal time (UTC). GMT is also known as universal time.
**GUI** | Acronym for "Graphical User Interface".
**- H -** |
**Han Characters** | Ideographic characters of Chinese origin.
**Hangul** | The Korean alphabet that consists of fourteen consonants and ten vowels. Hangul was created by a team of scholars in the 15th century at the behest of King Sejong. See jamo.
**Hanja** | The Korean term for characters derived from Chinese.
**Hiragana** | A Japanese phonetic syllabary. The symbols are cursive or curvilinear in style. See Kanji and Katakana.
**- I -** |
**i18n** | Synonym for internationalization ("i" + 18 letters + "n"; lower case i is used to distinguish it from the numeral 1 (one)).
**ideographic language** | A written language in which each character (ideogram) represents a thing or an idea (but not necessarily a particular word or phrase). An example of such a language is written Chinese (Zhongen). Contrast with alphabetic language.
**Indic numerals** | A set of numerals used in India and many Arabic countries instead of, or in addition to, the Arabic numerals. Indic numeral shapes correspond to the Arabic numeral shapes. Contrast with Arabic numerals, Chinese numerals, and Roman numerals. See numbers.
**internationalization** | Designing and developing a software product to function in multiple locales. This process involves identifying the locales that must be supported, designing features which support those locales, and writing code that functions equally well in any of the supported locales. Internationalized applications store their text in external resources, and use locale-sensitive utilities for formatting and collation.
**ISO** | International Organization for Standardization. Contrary to popular belief, ISO does NOT stand for International Standards Organization because it is not an acronym. The ISO name is derived from the Greek word isos, which means "equal." ISO is a non-governmental international organization, and it promotes the development of standards on goods and services.
**- J -** |
**jamo** | A set of consonants and vowels used in Korean Hangul. The word jamo is derived from ja, which means consonant, and mo, which means vowel.
**- K -** |
**Kanji** | Chinese characters or ideograms used in Japanese writing. The characters may have different meanings from their Chinese counterparts. See Hiragana and Katakana.
**Katakana** | A Japanese phonetic syllabary used primarily for foreign names and place names and words of foreign origin. The symbols are angular, while those of Hiragana are cursive. Katakana is written left to right, or top to bottom. See Kanji.
**- L -** |
**L10n** | Synonym for "localization" ("L" + 10 letters + "n"; upper case L is used to distinguish it from the numeral 1 (one)).
**L12y** | Acronym for "localizability" ("L" + 12 letters + "y"; upper case L is used to distinguish it from the numeral 1 (one)).
**language** | A set of characters, phonemes, conventions, and rules used for conveying information. The aspects of a language are pragmatics, semantics, syntax, phonology, and morphology.
**legacy** | An inherited obligation. For example, a legacy database might contain strategic data that must be maintained for a long time after the database has become technologically obsolete.
**locale** | A set of conventions affected or determined by human language and customs, as defined within a particular geo-political region. These conventions include (but are not necessarily limited to) the written language, formats for dates, numbers and currency, sorting orders, etc.
**locale-sensitive** | Exhibiting different behavior or returning different data, depending on the locale.
**localizability** | The degree to which a software product can be localized. Localizable products separate data from code, correctly display the target language and function properly after being localized.
**localization** | Modifying or adapting a software product to fit the requirements of a particular locale. This process includes (but may not be limited to) translating the user interface, documentation and packaging, changing dialog box geometries, customizing features (if necessary), and testing the translated product to ensure that it still works (at least as well as the original).
**lowercase** | The small alphabetic characters, whether accented or not, as distinguished from the capital alphabetic characters. The concept of case applies to alphabets such as Latin, Cyrillic, and Greek, but not to Arabic, Hebrew, Thai, Japanese, Chinese, Korean, and many other scripts. Examples of lowercase letters are a, b, and c. Contrast with uppercase.
**- M -** |
**MBCS** | Multi-byte Character Set. A set of characters in which each character is represented by 1 or more bytes. Contrast with DBCS and SBCS.
**modifier characters** | '`@`' (French secondary collation rule)
**multilingual** | An application that can simultaneously display and manipulate text in multiple languages. For example, a word processor that allows Japanese and English in the same document is multilingual.
**- N -** |
**National Standard** | A linguistic rule, measurement, educational guideline, or technology-related convention as defined by a government or an industry standards organization. Examples include character sets, keyboard layouts, and some cultural conventions, such as punctuations.
**NLS** | National Language Support. The features of a product that accommodate a specific region, its language, script, local conventions, and culture. See internationalization and localization.
**non-display string** | A non-display string is a string such as a URL that is used programmatically and is not visible to an end-user. A non-display string does not need to be translated.
**normalization** | The process of converting Unicode text into one of several standardized forms in which precomposed and combining characters are used consistently.
**numbers** | Numbers express either quantity (cardinal) or order (ordinal). Many cultures have different forms for cardinal and ordinal numbers. For example, in French the cardinal number five is cinq, but the ordinal fifth is cinquième or 5eme or 5e. Numbers are written with symbols that are usually referred to as numerals. See Arabic numerals, Chinese numerals, Indic numerals, European numerals, and Roman numerals.
**- P -** |
**pinyin** | A system to phonetically render Chinese ideograms in a Latin alphabet.
**- R -** |
**relation characters** | '`<`' (primary difference collation rule) <br>'`;`' (secondary difference collation rule) <br>'`,`' (tertiary difference collation rule) <br>'`=`' (identical difference collation rule)
**reset character** | '`&`' (reset the collation rules)
**resource** | 1. Any part of a program which can appear to the user or be changed or configured by the user. <br>2. Any piece of the program's data, as opposed to its code.
**resource bundle** | A set of culturally dependent data used by locale-sensitive classes in an internationalized software program to provide Locale specific responses to the end-user.
**Roman numerals** | A system of writing numbers in which the characters I, V, X, L, C, D, and M have the value of 1, 5, 10, 50, 100, 500, and 1000, respectively. Lesser numbers in prefix positions indicate subtraction. For example MCMLXIV is 1964 in decimal because CM is 900, LX is 60, and IV is 4. Contrast with Arabic numerals, European numerals, Chinese numerals, and Indic numerals.
**- S -** |
**SBCS (Single-byte character set)** | A set of characters in which each character is represented by 1 byte.
**script** | A set of characters used to write a particular set of languages. For example, the Latin (or Roman) script is used to write English, French, Spanish, and most other European languages; the Cyrillic script is used to write Russian and Serbian.
**separator** | The thousands separator (or digit grouping separator) is the local symbol used to separate every third digit in large numbers or lengthy decimal fractions. The decimal separator is the local symbol used to indicate the decimal position in a number. It may be a comma, period or some other language specific symbol.
**string** | A set of consecutive characters treated by a computer as a single item.
**- T -** |
**titlecase** | A set of words that usually have the first character of each word in uppercase characters. The rules for titlecase are specific to each locale. Titlecase words usually go on titles of literature and other publications.
**transcoding** | Conversion of character data from one character set to another.
**translation** | The conversion of text from one human language to another. This includes properly converting the grammar, spelling and meaning of the text into the target language.
**transliteration** | Transformation of text from one script to another, usually based on phonetic equivalences and not word meanings. For example, Greek text might be transliterated into the Latin script so that it can be pronounced by English speakers.
**- U -** |
**UCS** | Universal Multiple-Octet Coded Character Set. The Unicode standard is based upon this ISO/IEC 10646 standard. UCS characters look the same Unicode characters, but they do not have any character properties. Synonymous with UTF.
**Unicode** | A character set that encompasses all of the world's living scripts. Unicode is the basis of most modern software internationalization.
**Unicode character** | A Unicode character enables a computer to store, manipulate, and transfer to other computers multilingual text. A Unicode character has the binary range of 0..10FFFF.
**uppercase** | The larger alphabetic characters, whether accented or not, as distinguished from the lowercase alphabetic characters. The concept of case applies to alphabets such as Latin, Cyrillic, and Greek, but not to Arabic, Hebrew, Thai, Japanese, Chinese, Korean, and many other scripts. Examples of uppercase letters are A, B, and C. Contrast with lowercase.
**user character** | A character made up of two or more Unicode characters that are combined to form a more complex character that has its own semantic value. A user character is the smallest component of written language that has a semantic value to a native language user.
**UTC time** | UTC stands for Coordinated Universal Time. This was formerly known as Greenwich Mean Time (GMT). It is used as a time constant that can be transformed to display an accurate date and time in any world calendar and time zone. This is a time scale based on a cesium atomic clocks.
**UTF** | Unicode Transformation Format. A binary format of representing a Unicode character. There are several encoding forms for a Unicode character, which include UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE. The numbers in these encoding form names refer to the bit size of each number, and the BE and LE stands for big endian or little endian respectively. The UTF-8 and UTF-16 formats can take multiple units of binary numbers to represent a Unicode character.

View file

@ -0,0 +1,207 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# How To Use ICU
ICU builds and installs as relatively standard libraries. For details about
building, installing and porting see the [ICU4C
readme](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html) and the
[ICU4J readme](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4j/readme.html).
In addition, ICU4C installs several scripts and makefile fragments that help
build other code using ICU.
For C++, note that there are [Recommended Build
Options](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#RecBuild)
(both for normal use and for ICU as system-level libraries) which are not
default simply for compatibility with older ICU-using code.
Starting with ICU 49, the ICU4C readme has a short section about
[User-Configurable
Settings](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#UserConfig).
## C++ Makefiles
The recommended way to use ICU in Makefiles is to use the
[pkg-config](http://pkg-config.freedesktop.org/) files which are installed by
ICU upon "`make install`". There are files for various libraries and components.
This is preferred over the deprecated icu-config script.
This table shows the package names used within pkg-config.
|**Package**|**Contents**|
|------|--------------------|
|icu-uc|Common (uc) and Data (dt/data) libraries|
|icu-i18n|Internationalization (in/i18n) library|icu-le [Layout Engine](layoutengine/index.md)|
|icu-lx|Paragraph Layout|
|icu-io|[Ustdio](io/ustdio.md)/[iostream](io/ustream.md) library (icuio)
For example, to compile a simple application, you could run the following
command. See the [pkg-config](http://pkg-config.freedesktop.org/) manpage for
more details.
c++ -o test test.c `pkg-config --libs --cflags icu-uc icu-io`
ICU installs the pkg-config (.pc) files in `$(prefix)/lib/pkgconfig` (where
`$(prefix)` is the installation prefix for ICU). Note that you may need to add
`$(prefix)/lib/pkgconfig` to the `PKG_CONFIG_PATH` variable.
### ICU in a small project
For small projects, it may be convenient to take advantage of
ICU's `autoconf`'ed files. ICU `make install` writes
`$(prefix)/lib/icu/Makefile.inc` which defines most of the necessary *make*
variables such as `$(CXX)`, `$(CXXFLAGS)`, `$(ICULIBS)`, `$(INVOKE)`, `$(ICUPKG)`,
`$(datadir)`, etc.
By itself, `Makefile.inc` is incomplete. It assumes that it will be included into another
`Makefile` which will define `$(srcdir)`, `$(DYNAMICCXXFLAGS)` and similar values.
In this case, it is probably best to copy ICU's
`autoconf`'ed top-level `./Makefile` and/or library-target-style `i18n/Makefile` and/or
binary-target-style `tools/icupkg/Makefile`. Then modify them as needed.
### ICU in a medium-sized project
If you use your own `autoconf`/`CMake`/... setup, consider cherry-picking only the
definitions needed, for example paths to specific ICU data and tools.
This is often preferable to taking the entire `Makefile.inc` and
overriding (many) definitions that are different.
For selective ICU definitions, use the installed
`$(prefix)/bin/icu-config` script.
Its contents are synchronized with `$(prefix)/lib/icu/Makefile.inc`.
For example, use `icu-config --invoke=icupkg` to invoke the ICU .dat packaging tool.
### ICU in a large project
In this case, you probably have your own build system. Just use ICU's public header
files, `.so` files, etc. See the next section, "C++ With Your Own Build System".
## Notes on `icu-config`
> :point_right: **Note**: **icu-config is deprecated, and no longer recommended for production
use. Please use pkg-config files or other options.**
As of ICU 63.1, [icu-config has been deprecated
(ICU-10464)](https://unicode-org.atlassian.net/browse/ICU-10464).
`icu-config` may be disabled by default in the future.
As of ICU 63.1, you may enable or disable 63.1 with a configure flag:
`--enable-icu-config` or `--disable-icu-config`
`icu-config` is installed (by ICU's `make install`) into `$(prefix)/bin/icu-config`.
It can be convenient for **trivial, single-file programs** that use ICU. For
example, you could compile and build a small program with this command line:
icu-config --cxx --cxxflags --cppflags --ldflags -o sample sample.cpp
Detailed usage of `icu-config` script is described in its `man` page.
## C++ With Your Own Build System
If you are not using the standard build system, you will need to construct your
own system. Here are a couple of starting points:
* At least for initial bring-up, use pre-built data files from the ICU
download or from a normally-built ICU. Copy the icudt***XXx*.dat file from
`icu/source/data/in/` or `icu/source/data/out/tmp/` in either of these two
locations, into `icu/source/data/in/`** on your target ICU system. That way,
you won't need to build ICU's data-generation tools.
* Don't compile all files. Look in the `Makefile.in` files for `OBJECTS=`
clauses which will indicate which source files should be compiled. (Some .c
files are #included into others and cannot be compiled by themselves.)
* ICU does not throw or handle exceptions. Consider turning them off via g++'s
`-fno-exceptions` or equivalent.
* Each ICU library needs to be compiled with -DU_COMMON_IMPLEMENTATION,
-DU_I18N_IMPLEMENTATION etc. as appropriate. See unicode/utypes.h for the
set of such macros. If you build one single DLL (shared library) for all of
ICU, also use -DU_COMBINED_IMPLEMENTATION. If you build ICU as
statically-linked libraries, use -DU_STATIC_IMPLEMENTATION.
* Use the [icu-support mailing list](http://site.icu-project.org/contacts).
Ask for help and guidance on your strategy.
* Up until ICU 4.8, there are one or two header files (platform.h, icucfg.h)
that are generated by autoconf/configure and thus differ by platform,
sometimes even by target settings on a single platform (e.g., AIX 32-bit vs.
64-bit, Mac OS X universal binaries PowerPC vs. x86). If you do not use
autoconf, you probably need to configure-generate these header files for
your target platforms and select among them, or merge the generated headers
if they are similar, or simulate their generation by other means.
* Starting with ICU 49, all source code files are fixed (not generated). In
particular, there is one single platform.h file which determines
platform-specific settings via `#if ...`
## C++ Namespace
ICU C++ APIs are normally defined in a versioned namespace, for example
"icu_50". There is a stable "icu" alias which should be used instead. (Entry
point versioning is only to allow for multiple ICU versions linked into one
program. [It is optional and should be off for system
libraries.](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#RecBuild))
By default, and only for backward compatibility, the ICU headers contain a line
`using namespace icu_50;` which makes all ICU APIs visible in/with the global
namespace (and potentially collide with non-ICU APIs there). One of the
[Recommended Build
Options](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#RecBuild)
is to turn this off.
To write forward declarations, use
U_NAMESPACE_BEGIN
class UnicodeSet;
class UnicodeString;
U_NAMESPACE_END
To qualify an ICU class name, use the "icu" alias:
static myFunction(const icu::UnicodeString &s) {...}
Frequently used ICU classes can be made easier to use in .cpp files with
using icu::UnicodeSet;
using icu::UnicodeString;
## Other Notes
### Helper Install Utilities
ICU installs `$(prefix)/share/icu/$(VERSION)/install-sh` and
`$(prefix)/share/icu/$(VERSION)/mkinstalldirs`. These may be used by ICU tools and
samples. Their paths are given in the installed `Makefile.inc` (see above).
### Data Packaging Settings
The `pkgdata` tool (see [Packaging ICU4C](packaging/index.md) ) makes use of the
installed file `**$(prefix)/lib/icu/pkgdata.inc**` to set parameters for data
packaging operations that require use of platform compilers and linkers ( in
`static` or `dll` mode). `pkgdata` uses the icu-config script in order to locate
**pkgdata.inc**. If you are not building ICU using the supplied tools, you may
need to modify this file directly to allow `static` and `dll` modes to function.
### Building and Running Trivial C/C++ Programs with `icurun`
For building and running trivial (one-compilation-unit) programs with an
installed ICU4C, the shell script
[icurun](http://bugs.icu-project.org/trac/browser/trunk/tools/scripts/icurun)
may be used. For detailed help, see the top of that script.
As an example, if ICU is installed to the prefix **/opt/local** and the current
directory contains two sample programs "test1.cpp" and "test2.c", they may be
compiled and run with any of the following commands. The "-i" option specifies
either the installed icu-config script, or the directory containing that script,
or the path to a 'bin' directory.
* `icurun **-i /opt/local** test1.cpp`
* `icurun **-i /opt/local/bin** test2.c`
* `icurun **-i /opt/local/bin/icu-config** test1.cpp`
If "icu-config" is on the PATH, the -i option may be omitted:
* `icurun test1.cpp`
Any additional arguments will be passed to the program.
* `icurun test1.cpp *args...*`
*This feature is a work in progress. Please give feedback at [Ticket
#8481](https://unicode-org.atlassian.net/browse/ICU-8481).*

272
docs/userguide/i18n.md Normal file
View file

@ -0,0 +1,272 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Software Internationalization
## Overview of Software Internationalization
Developing globalized software is a continuous balancing act as software
developers and project managers inadvertently underestimate the level of effort
and detail required to create foreign-language software releases.
Software developers must understand the ICU services to design and deploy
successful software releases. The services can save ICU users time in dealing
with the kinds of problems that typically arise during critical stages of the
software life cycle.
In general, the standard process for creating globalized software includes
"internationalization," which covers generic coding and design issues, and
"localization," which involves translating and customizing a product for a
specific market.
Software developers must understand the intricacies of internationalization
since they write the actual underlying code. How well they use established
services to achieve mission objectives determines the overall success of the
project. At a fundamental level, code and feature design affect how a product is
translated and customized. Therefore, software developers need to understand key
localization concepts.
From a geographic perspective, a locale is a place. From a software perspective,
a locale is an ID used to select information associated with a a language and/or
a place. ICU locale information includes the name and identifier of the spoken
language, sorting and collating requirements, currency usage, numeric display
preferences, and text direction (left-to-right or right-to-left, horizontal or
vertical).
General locale-sensitive standards include keyboard layouts, default paper and
envelope sizes, common printers and monitor resolutions, character sets or
encoding ranges, and input methods.
## ICU Services Overview
The ICU services support all major locales with language and sub-language pairs.
The sub-language generally corresponds to a country. One way to think of this is
in terms of the phrase "X language as spoken in Y country." The way people speak
or write a particular language might not change dramatically from one country to
the next (for example, German is spoken in Austria, Germany, and Switzerland).
However, cultural conventions and national standards often differ a great deal.
A key advantage to using the ICU services is the net result in reduced time to
market. The translation of the display strings is bundled in separate text files
for translation. A programmer team with translators no longer needs to search
the source code in order to rewrite the software for each country and language.
## Internationalization and Unicode
Unicode enables a program to use a standard encoding scheme for all textual data
within the program's environment. Conversion has to be done with incoming and
outgoing data only. Operations on the text (while it is in the environment) are
simplified since you do not have to keep track of the encoding of a particular
text.
Unicode supports multilingual data since it encodes characters for all world
languages. You do not have to tag pieces of data with their encoding to enable
the right characters, and you can mix languages within a single piece of text.
Some of the advantages of using ICU to internationalize your program include the
following:
* It can handle text in any language or combination of languages.
* The source code can be written so that the program can work for many
locales.
* Configurable, pluggable localization is enabled.
* Multiple locales are supported at the same time.
* Non-technical people can be given access to information and you don't have
to open the source code to them.
* Software can be developed so that the same code can be ported to various
platforms.
## Project Management Tips for Internationalizing Software
The following two processes are key when managing, developing and designing a
successful internationalization software deliverable:
1. Separate the program's executable code from its UI elements.
2. Avoid making cultural assumptions.
Keep static information (such as pictures, window layouts) separate from the
program code. Also ensure that the text which the program generates on the fly
(such as numbers and dates) comes out in the right language. The text must be
formatted correctly for the targeted user community.
Make sure that the analysis and manipulation of both text and kinds of data
(such as dates), is done in a manner that can be easily adapted for different
languages and user communities. This includes tasks such as alphabetizing lists
and looking for line-break positions.
Characters must display on the screen correctly (the text's storage format must
be translated to the proper visual images). They must also be accepted as input
(translated from keystrokes, voice input or another kind of input into the
text's storage format). These processes are relatively easy for English, but
quite challenging for other languages.
### Separating Executable Code from UI Elements
Good software design requires that the programming code implementing the user
interface (UI) be kept separate from code implementing the underlying
functionality. The description of the UI must also be kept separate from the
code implementing it.
The description of the UI contains items that the user sees, including the
various messages, buttons, and menu commands. It also contains information about
how dialog boxes are to be laid out, and how icons, colors or other visual
elements are to be used. For example, German words tend to be longer since they
contains grammatical suffixes that English has lost in the last 800 years. The
following table shows how word lengths can differ among languages.
|English|German|Cyrillic-Serbian|
|--------|--------|-------------|
|cut|ausschneiden|исеци|
|copy|kopieren|копирајpasteeinfügenзалепи|
The description of the UI, especially user-visible pieces of text, must be kept
together and not embedded in the program's executable code. ICU provides the
ResourceBundle services for this purpose.
### Avoiding Cultural/Hidden Assumptions
Another difficulty encountered when designing and implementing code is to make
it flexible enough to handle different ways of doing things in other countries
and cultures. Most programmers make unconscious assumptions about their user's
language and customs when they design their programs. For example, in Thailand,
the official calendar is the Buddhist calendar and not the Gregorian calendar.
These assumptions make it difficult to translate the user interface portion of
the code for some user communities without rewriting the underlying program. The
ICU libraries provide flexible APIs that can be used to perform the most common
and important tasks. They contain pre-built supporting data that enables them to
work correctly in 75 languages and more than 200 locales. The key is
understanding when, where, why, or how to use the APIs effectively.
The remainder of this section provides an overview of some cultural and hidden
assumptions components. (See the Table of contents for a list of topics.)
#### Numbers and Dates
Numbers and dates are represented in different languages. Do not implement
routines for converting numbers into strings, and do not call low-level system
interfaces like sprintf() that do not produce language-sensitive results.
Instead, see how ICU's [NumberFormat](formatparse/numbers/index.md) and
[DateFormat](formatparse/datetime/index.md) services can be used more
effectively.
#### Messages
Be careful when formulating assumptions about how individual pieces of text are
used together to create a complete sentence (for example, when error messages
are generated) . The elements might go together in a different order if the
message is translated into a new language. ICU provides
[MessageFormat](formatparse/messages/index.md) (§) and
[ChoiceFormat](formatparse/messages/index.md) (§) to help with these
occurrences.
> :point_right: **Note**: *There also might be situations where parts of the sentence change when other
parts of the sentence also change (selecting between singular and plural nouns
that go after a number is the most common example). *
#### Measuring Units
Numerical representations can change with regard to measurement units and
currency values. Currency values can vary by country. A good example of this is
the representation of $1,000 dollars. This amount can represent either U.S. or
Canadian dollar values. US dollars can be displayed as USD while Canadian
dollars can be displayed as CAD, depending on the locale. In this case, the
displayed numerical quantity might change, and the number itself might also
change. [NumberFormat](formatparse/numbers/index.md) provides some support for
this.
#### Alphabetical Order of Characters
All languages (even those using the same alphabet) do not necessarily have the
same concept of alphabetical order. Do not assume that alphabetical order is the
same as the numerical order of the character's code-point values. In practice,
'a' is distinct from 'A' and 'b' is distinct from 'B'. Each has a different code
point . This means that you can not use a bit-wise lexical comparison (such as
what strcmp() provides), to sort user-visible lists.
Not all languages interpret the same characters as equivalent. If a character's
case is changed it is not always a one-to-one mapping. Accent differences, the
presence or absence of certain characters, and even spelling differences might
be insignificant when determining whether two strings are equal. The[
Collator](collation/index.md) services provide significant help in this area.
#### Characters
A character does not necessarily correspond to a single code-point position in
the backing store. All languages might not have the same definition of a word,
and might not find that any group of characters separated by a white space is an
acceptable approximation for the definition of a word. ICU provides the
[BreakIterator](boundaryanalysis/index.md) services to help locate boundaries or
when counting units of text.
When checking characters for membership in a particular class, do not list the
specific characters you are interested in, and do not assume they come in any
particular order in the encoding scheme. For example, /A-Za-z/ does not mean all
letters in most European languages, and /0-9/ does not mean all digits in many
writing systems. This also holds true when using C interfaces such as isupper()
and islower. ICU provides a large group of utility functions for testing
character properties, such as u_isupper and u_islower().
#### Text Input and Layout
Do not assume anything about how a piece of text might be drawn on the screen,
including how much room it takes up, the direction it flows, or where on the
screen it should start. All of these text elements vary according to language.
As a result, there might not be a one-to-one relationship between characters and
keystrokes. One-to-many, many-to-one, and many-to-many relationships between
characters and keystrokes all occur in real text in some languages.
#### Text Manipulation
Do not assume that all textual data, which the program stores and manipulates,
is in any particular language or writing system. ICU provides many methods that
help with text storage. The UnicodeString class and u_strxxx functions are
provided for Unicode-based character manipulation. For example, when appending
an existing Unicode character buffer, characters can be removed or extracted out
of the buffer.
A good example of text manipulation is the Rosetta stone. The same text is
written on it in Hieroglyphic, Greek and Demotic. ICU provides the services to
correctly process multi-lingual text such as this correctly.
#### Date/Time Formatting
Time can be determined in many units, such as the lengths of months or years,
which day is the first day of the week, or the allowable range of values like
month and year (with DateFormat). It can also determine the time zone you are in
(with TimeZone), or when daylight-savings time starts. ICU provides the Calendar
services needed to handle these issues.
This example shows how a user interface element can be used to increment or
decrement the time field value.
#### Distributed Locale Support
In most server applications, do not assume that all clients connected to the
server interact with their users in the same language. Also do not assume that a
session stops and restarts whenever a user speaking one language replaces
another user speaking a different language. ICU provides sufficient flexibility
for a program to handle multiple locales at the same time.
For example, a Web server needs to serve pages to different users, languages,
and date formats at the same time.
#### LayoutEngine
The ICU LayoutEngine is an Open Source library that provides a uniform, easy to
use interface for preparing complex scripts or text for display. The Latin
script, which is the most commonly used script among software developers, is
also the least complex script to display especially when it is used to write
English. Using the Latin script, characters can be displayed from left to right
in the order that they are stored in memory. Some scripts require rendering
behavior that is more complicated than the Latin script. We refer to these
scripts as "complex scripts" and to text written in these scripts as "complex
text."

View file

@ -0,0 +1,137 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# ICU4J Locale Service Provider
## Overview
Java SE 6 introduced a new feature which allows Java user code to extend locale
support in Java runtime environment. JREs shipped by Oracle or IBM come with
decent locale coverage, but some users may want more locale support. Java SE 6
includes abstract classes extending
[java.util.spi.LocaleServiceProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/LocaleServiceProvider.html).
Java SE 6 users can create a subclass of these abstract class to supply their
own locale support for text break, collation, date/number formatting or
providing translations for currency, locale and time zone names.
ICU4J has been providing more comprehensive locale coverage than standard JREs.
However, Java programmers have to use ICU4J's own internationalization service
APIs (com.ibm.icu.\*) to utilize the rich locale support. Sometimes, the
migration is not an option for various reasons. For example, your code may
depend on existing Java libraries utilizing JDK internationalization service
APIs, but you have no access to the source code. In this case, it is not
possible to modify the libraries to use ICU4J APIs.
ICU4J Locale Service Provider is a component consists of classes implementing
the Java SE 6 locale sensitive service provider interfaces. Available service
providers are -
* [BreakIteratorProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/BreakIteratorProvider.html)
* [CollatorProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/CollatorProvider.html)
* [DateFormatProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/DateFormatProvider.html)
* [DateFormatSymbolsProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/DateFormatSymbolsProvider.html)
* [DecimalFormatSymbolsProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/DecimalFormatSymbolsProvider.html)
* [NumberFormatProvider](http://download.oracle.com/javase/6/docs/api/java/text/spi/NumberFormatProvider.html)
* [CurrencyNameProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/CurrencyNameProvider.html)
* [LocaleNameProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/LocaleNameProvider.html)
* [TimeZoneNameProvider](http://download.oracle.com/javase/6/docs/api/java/util/spi/TimeZoneNameProvider.html)
ICU4J Locale Service Provider is designed to work as installed extensions in a
JRE. Once the component is configured properly, Java application running on the
JRE automatically picks the ICU4J's internationalization service implementation
when a requested locale is not available in the JRE.
## Using ICU4J Locale Service Provider
Java SE 6 locale sensitive service providers are using the [Java Extension
Mechanism](http://download.oracle.com/javase/6/docs/technotes/guides/extensions/index.html).
An implementation of a locale sensitive service provider is installed as an
optional package to extend the functionality of the Java core platform. To
install an optional package, its JAR files must be placed in the Java extension
directory. The standard location is *<java-home>/lib/ext*. You can alternatively
use the system property *java.ext.dirs* to specify one or more locations where
optional packages are installed. For example, if the JRE root directry is
JAVA_HOME and you put ICU4J Locale Service Provider files in ICU_SPI_DIR, the
ICU4J Locale Service Provider is enabled by the following command.
java -Djava.ext.dirs=%JAVA_HOME%\\lib\\ext;%ICU_SPI_DIR% <your_java_app>
\[Microsoft Windows\]
java -Djava.ext.dirs=$JAVA_HOME/lib/ext:$ICU_SPI_DIR <your_java_app> \[Linux,
Solaris and other unix like platforms\]
The ICU4J's implementations of Java SE 6 locale sensitive service provider
interfaces and configuration files are packaged in a single JAR file
(*icu4j-localespi-<version>.jar*). But the actual implementation of the service
classes and data are in the ICU4J core JAR file (*icu4j-<version>.jar*). So you
need to put the localespi JAR file along with the core JAR file in the Java
extension directory.
Once the ICU4J Locale Service Provider is installed properly, factory methods in
JDK internationalization classes look for the implementation provided by ICU4J
when a requested locale is not supported by the JDK service class. For example,
locale *af_ZA* (Afrikaans - South Africa) is not supported by JDK DateFormat in
Oracle Java SE 6. The following code snippet returns an instance of DateFormat
from ICU4J Locale Service Provider and prints out the current date localized for
af_ZA.
DateFormat df = DateFormat.getDateInstance(DateFormat.LONG, new Locale("af",
"ZA"));
System.out.println(df.format(new Date()));
Sample output:
2008 Junie 19 \[With ICU4J Locale Service Provider enabled\]
June 19, 2008 \[Without ICU4J Locale Service Provider\]
## Optional Configuration
### Enabling or disabling individual service
By default, all Java 6 SE locale sensitive service providers are enabled in the
ICU4J Locale Service Provider JAR file. If you want to disable specific
providers supported by ICU4J, you can remove the corresponding provider
configuration files from *META-INF/services* in the localespi JAR file. For
example, if you do not want to use ICU's time zone name service at all, you can
remove the file: *META-INF/services/java.util.spi.TimeZoneNameProvider* from the
JAR file.
**Note:** Disabling DateFormatSymbolsProvider/DecimalFormatSymbolsProvider won't
affect the localized symbols actually used by
DateFormatProvider/NumberFormatProvider by the current implementation. These
services are implemented independently.
### Configuring the behavior of ICU4J Locale Service Provider
*com/ibm/icu/impl/javaspi/ICULocaleServiceProviderConfig.properties* in the
localespi JAR file is used for configuring the behavior of the ICU4J Locale
Service Provider implementation. There are some configuration properties
available. See the table below for each configuration in detail.
**Property** **Value** **Default** **Description**
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.enableIcuVariants "true" or
"false" "true" Whether if Locales with ICU's variant suffix will be included in
getAvailableLocales. The current Java SE 6 locale sensitive service does not
allow user provided provider implementations to override locales supported by
JRE itself. When this property is "true"(default), ICU4J Locale Service Provider
includes Locales with the
suffix(com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.icuVariantSuffix) in
the variant field. For example, the ICU4J provider includes locales fr_FR and
fr_FR_ICU4J in the available locale list. So JDK API user can still access the
internationalization service object created by the ICU4J provider by the special
locale fr_FR_ICU4J.
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.icuVariantSuffix *Any String*
"ICU4J" (49 or later)
"ICU" (before 49)
Suffix string used in Locale's variant field to specify the ICU implementation.
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.enableIso3Languages "true" or
"false" "true" Whether if 3-letter language Locales are included in
getAvailabeLocales. Use of 3-letter language codes in java.util.Locale is not
supported by the API reference document. However, the implementation does not
check the length of language code, so there is no practical problem with it.
com.ibm.icu.impl.javaspi.ICULocaleServiceProvider.useDecimalFormat "true" or
"false" "false" Whether if java.text.DecimalFormat subclass is used for
NumberFormat#getXXXInstance.
DecimalFormat#format(Object,StringBuffer,FieldPosition) is declared as final, so
ICU cannot override the implementation. As a result, some number types such as
BigInteger/BigDecimal are not handled by the ICU implementation. If a client
expects NumberFormat#getXXXInstance returns a DecimalFormat (for example, need
to manipulate decimal format patterns), he/she can set true to this setting.
However, in this case, BigInteger/BigDecimal support is not done by ICU's
implementation.

1071
docs/userguide/icudata.md Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,267 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# ICU4J FAQ
This page contains frequently asked questions about the content provided with
the International Components for Unicode for Java as well as basics on
internationalization. It is organized into the following sections:
### Common Questions
#### What version of Java is required for ICU4J?
ICU4J 4.4 or later versions utilize Java 5 language features and only run on JRE
5 or later. The ICU4J Locale SPI module depends on JDK 6 Locale Service Provider
framework, therefore, it requires JRE 6 or later.
#### Comparison between ICU and JDK: What's the difference?
This is one of our most popular question. Please refer to [our comparison
chart](http://icu-project.org/charts/comparison/).
#### How can I get the version information of ICU4J library on my system?
You can get the ICU4J version information by public API class
[com.ibm.icu.util.VersionInfo](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/VersionInfo.html).
The static field
[VersionInfo.ICU_VERSION](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/VersionInfo.html#ICU_VERSION)
contains the current ICU4J library version information.
Since ICU4J 4.6, ICU4J jar file includes Main-Class that prints out the ICU
version information like below:
```
$ java -jar icu4j.jar
International Component for Unicode for Java 4.8
Implementation Version: 4.8
Unicode Data Version: 6.0
CLDR Data Version: 2.0
Time Zone Data Version: 2011g
```
#### I'm using ICU4J X, but planning to upgrade ICU4J version to X+1 soon. What should I do for the migration?
See the user guide section
[Version Numbers in ICU](../design.md#version-numbers-in-icu)
for the details about the meaning of the version number parts and how the ICU
version number changes.
In general, two different reference releases are not binary compatible (i.e.
drop-in jar file replacement would not work). To use a new reference version of
ICU4J, you should rebuild your application with the new ICU4J library. ICU
project has the
[API compatibility policy](../design.md#icu-api-compatibility)
long as you're using ICU APIs marked as @stable in the API reference
documentation, your application should successfully compile with the new
reference version of ICU4J library without any source code modifications. (Note:
ICU project team may retract APIs previously marked as @stable by well-defined
process. But this is a very rare case.) However, you might still need to review
the usage of ICU4J APIs especially when your application set a certain
assumption on the behavior of APIs driven by Unicode or locale data. For
example, a date format pattern used for locale X might not be exactly the same
with the pattern in a new version.
#### How can I see all API changes between two different ICU versions?
For every ICU4J release, we publish
[APIChangeReport.html](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4j/APIChangeReport.html)
which captures all API changes since previous reference release.
However, someone may want to see the changes between the
current release and much older ICU4J version. For example, you're currently
using ICU4J 60 and considering to upgrade to ICU4J 64. In this case, you can
generate a change report page by following steps.
1. Download [ICU4J 64 source package
archive](http://site.icu-project.org/download/64#TOC-ICU4J-Download)
from the ICU 64 download page and extract files to your local system.
2. Set up ICU4J build environment as explained in
[readme.html](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4j/readme.html)
included in the root directory of the ICU4J source package archive.
3. Edit
[build.properties](https://github.com/unicode-org/icu/blob/master/icu4j/build.properties)
in the root directory and change the property value api.report.prev.version
from 63 to 60.
4. Invoke ant target "apireport".
5. The output is generated at out/icu4j_compare_60_64.html.
### International Calendars
#### Why do I need these classes?
If your application displays or manipulates dates and times, and if you want
your application to run in countries outside of North America and western
Europe, you need to support the traditional calendar systems that are still in
use in some parts of the world. These classes provide that support while
conforming to the standard Java Calendar API, allowing you to code your
application once and have it work with any international calendar.
#### Which Japanese calendar do you support?
Currently, our JapaneseCalendar is almost identical to the Gregorian calendar,
except that it follows the traditional conventions for year and era names. In
modern times, each emperor's reign is treated as an era, and years are numbered
from the start of that era. Historically each emperor's reign would be divided
up into several eras, or *gengou*. Currently, our era data extends back to
*Haika*, which began in 645 AD. In all other respects (month and date, all of
the time fields, etc.) the JapaneseCalendar class will give results that are
identical to GregorianCalendar.
Lunar calendars similar to the Chinese calendar have also been used in Japan
during various periods in history, but according to our sources they are not in
common use today. If you see a real need for a Japanese lunar calendar, and
especially if you know of any good references on how it differs from the Chinese
calendar, please let us know by posting a note on the [mailing
list](http://icu-project.org/contacts.html).
#### Do you *really* support the true lunar Islamic calendar?
The Islamic calendar is strictly lunar, and a month begins at the moment when
the crescent of the new moon is visible above the horizon at sunset. It is
impossible to calculate this calendar in advance with 100% accuracy, since moon
sightings are dependent on the location of the observer, the weather, the
observer's eyesight, and so on. However, there are fairly commonly-accepted
criteria (the angle between the sun and the moon, the moon's angle above the
horizon, the position of the moon's bright limb, etc.) that let you predict the
start of any given month with a very high degree of accuracy, except of course
for the weather factor. We currently use a fairly crude approximation that is
still relatively accurate, corresponding with the official Saudi calendar for
all but one month in the last 40-odd years. This will be improved in future
versions of the class.
What all this boils down to is that the IslamicCalendar class does a fairly good
job of predicting the Islamic calendar, and it is good enough for most
computational purposes. However, for religious purposes you should, of course,
consult the appropriate mosque or other authority.
### TimeZone
#### Does ICU4J have its own time zone rule data?
Yes. ICU4J library contains time zone rule data generated from the [tz
database](https://www.iana.org/time-zones).
#### Why does ICU4J carry the time zone rule data while my JRE also has the data?
There are several reasons. Bundling our own time zone data allow us to provide
quick updates to users. ICU project team usually release the latest time zone
rule data patch as soon as the new tz database release is published (usually
within 1 to 3 days). Having our own rule data also allows the ICU4J library to
provide some advanced TimeZone features (see [com.ibm.icu.util.BasicTimeZone API
documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/BasicTimeZone.html)).
#### How can I get the latest time zone rule data patch?
You can use [ICU4J Time Zone Update
Utility](http://site.icu-project.org/download/icutzu) to update the time zone
rule data to the latest.
#### I do not want to maintain yet another time zone rule data. Are there any way to configure ICU4J to use the JRE's time zone data?
If you do not use the advanced TimeZone features, then you can configure ICU4J
to use JRE's time zone support by editing ICUConfig.properties (included in
ICU4J library jar file) or simply setting a system property. See
[com.ibm.icu.util.TimeZone API
documentation](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/TimeZone.html)
for the details.
### StringSearch
#### Do I have to know anything about Collators to use StringSearch?
Since StringSearch uses a RuleBasedCollator to handle the language-sensitive
aspects of searching, understanding how collation works certainly helps. But the
only parts of the Collator API that you really need to know about are the
collation strength values, `PRIMARY`, `SECONDARY`, and `TERTIARY`, that
determine whether case and accents are ignored during a search.
#### What algorithm are you using to perform the search?
StringSearch uses a version of the Boyer-Moore search algorithm that has been
modified for use with Unicode. Rather than using raw Unicode character values in
its comparisons and shift tables, the algorithm uses collation elements that
have been "hashed" down to a smaller range to make the tables a reasonable size.
### RuleBasedBreakIterator
#### Why did you bother to rewrite BreakIterator? Wasn't the old version working?
It was working, but we were too constrained by the design. The break-data tables
were hard-coded, and there was only one set of them. This meant you couldn't
customize BreakIterator's behavior, nor could we accommodate languages with
mutually-exclusive breaking rules (Japanese and Chinese, for example, have
different word-breaking rules.) The hard-coded tables were also very
complicated, difficult to maintain, and easy to mess up, leading to mysterious
bugs. And in the original version, there was no way to subclass BreakIterator
and get any implementation at all -- if you wanted different behavior, you had to
rewrite the whole thing from scratch. We undertook this project to fix all these
problems and give us a better platform for future development. In addition, we
managed to get some significant performance improvements out of the new version.
#### What do you mean, performance improvements? It seems WAY slower to me!
The one thing that's significantly slower is construction. This is because it
actually builds the tables at runtime by parsing a textual description. In the
old version, the tables were hard-coded, so no initialization was necessary. If
this is causing you trouble, it's likely that you're creating and destroying
BreakIterators too frequently. For example, if you're writing code to word-wrap
a document in a text editor, and you create and destroy a new BreakIterator for
every line you process, performance will be unbelievably slow. If you move the
creation out of the inner loop and create a new BreakIterator only once per
word-wrapping operation, or once per document, you'll find that your performance
improves dramatically. If you still have problems after doing this, let us
know -- there may be bugs we need to fix.
#### This still has all the same bugs that the old BreakIterator did! Why would I want to use this one instead?
Because now you can fix it. The resource data in this package was designed to
mimic as closely as possible the behavior of the original BreakIterator class
(as of JDK 1.2). We did this deliberately to minimize our variables when making
sure the new iterator still passed all the old tests. We haven't updated it
since to avoid the bookkeeping hassles of keeping track of which version
includes which fixes. We're hoping to get this added to a future version of the
JDK, at which time we'll fix all the outstanding bugs relating to breaking in
the wrong places. In the meantime, you can customize the resource data to modify
things to work the way you want them to.
#### Why is there no demo?
We haven't had time to write a good demo for this new functionality yet. We'll
add one later.
#### What's this DictionaryBasedBreakIterator thing?
This is a new feature that isn't in the JDK. DictionaryBasedBreakIterator is
intended for use with languages that don't put spaces between words (such as
Thai), or for languages that do put spaces between words, but often combine lots
of words into long compound words (such as German). Instead of looking through
the text for sequences of characters that signal the end of a word, it compares
the text against a list of known words, using this to determine where the
boundaries should go. The algorithm we use for this is fast, accurate, and
error-tolerant.
#### Why do you have a Thai dictionary, but no resource data that actually lets me use it?
We're not quite done doing the necessary research. We don't currently have good
test cases we can use to verify it's working correctly with Thai, nor are we
completely confident in our dictionary. If you can help us with this, we'd like
to hear from you!
#### What's this BreakIteratorRules_en_US_TEST thing?
This is a resource file that, in conjunction with the "english.dict" dictionary,
we used to test the dictionary-based break iterator. It allows you to locate
word boundaries in English text that has had the spaces taken out. (The
SimpleBITest program demonstrates this.) The dictionary isn't
industrial-strength, however: we included enough words to make for a reasonable
test, but it's by no means complete or anywhere near it.
#### How can I create my own dictionary file?
Right now, you can't. We didn't include the tool we used to create dictionary
files because it's very rough and extremely slow. There's also a strong
likelihood that the format of the dictionary files will change in the future. If
you really want to create your own dictionary file, contact us, and we'll see
what we can do.

View file

@ -0,0 +1,455 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# ICU FAQs
## Introduction to ICU
#### What is ICU?
ICU is a cross-platform Unicode based globalization library. It includes support
for locale-sensitive string comparison, date/time/number/currency/message
formatting, text boundary detection, character set conversion and so on.
#### Where can I get ICU?
You can get ICU4C and ICU4J from <http://www.icu-project.org/download/>
**Why don't you build binaries for my platform?**
There are many versions of compilers on so many platforms that we cannot build
them all and guarantee compatibility between them all even on the same platform.
Due to these restrictions, we only distribute a limited number of binary
versions of ICU, but we will assist in building other versions from source.
**Why don't you provide project files for my MSVC version (MSVC 2008, etc)?**
You can use the Cygwin build environment to build ICU from source against the
MSVC compiler. See the ICU4C Readme.
#### How do I install the binary versions of ICU?
* **Windows**:
* The DLLs you may need for your application are located in
**bin\\icuXX##.dll**, where "XX" are two letters (such as "uc" for the
"common" library, "in" for the "i18n" library, etc.) and ## is the major
and the minor version number (such as **42** for **4.2** / **4.2**.0.1
or **4.2**.4 ).
* Either place the DLLs in the same directory as your application's .EXE
files, or set the PATH variable to point to the directory containing the
ICU DLLs.
* For compiling applications, add the "include" direcotry (the parent of
the "unicode" and "layout" directories) to the include search path.
* For linking applications, add the "lib" directory to the appropriate
path.
* **Other Platforms**:
* For other platforms, the .tgz file unpacks to a "/usr/local" type
hierarchy. For system-wide installation, you can unpack all of the files
into /usr/local/bin, /usr/local/include, etc.
* The configuration script **/usr/local/bin/icu-config** or the similar
Makefile include fragment **/usr/local/lib/icu/current/Makefile.inc**
can be used in building applications.
#### Can you help me build ICU4C for ...
We can try ... make sure you read the latest "readme" and also the [ICU
Data](../icudata.md) section. You might also [searching the icu-support
archives](http://site.icu-project.org/contacts), and then posting a question
there. Additionally, sites such as
[StackOverflow](http://stackoverflow.com/search?q=icu) may have helpful tips for
your topic.
* **Android NDK**
* Please try [searching the icu-support
archives](http://site.icu-project.org/contacts) and also see
[StackOverflow](http://stackoverflow.com/search?q=icu+android).
* **iPhone**
* Please try [searching the icu-support
archives](http://site.icu-project.org/contacts) and also see
[StackOverflow](http://stackoverflow.com/search?q=icu+iphone).
#### What is the ICU binary compatibility policy?
Please see the section on
[binary compatibility](../design.md#icu-binary-compatibility-using-icu-as-an-operating-system-level-library)
in the [design chapter](../design.md).
#### How is ICU licensed?
The ICU license is intended to allow ICU to be included both in free software
projects and in proprietary or commercial products.
Since ICU 58, ICU is covered by the
[Unicode license](http://www.unicode.org/copyright.html#License) which is very similar to
the previous ICU license.
ICU 1.8.1ICU 57 and ICU4J 1.3.1ICU4J 57 are covered by the [ICU
license](https://github.com/unicode-org/icu/blob/release-57-1/icu4c/LICENSE),
a simple, permissive non-copyleft free software license, compatible with the GNU
GPL. The ICU license is identical to the version of the X license that was
formerly available at <http://www.x.org/Downloads_terms.html> . (This site no
longer exists, but can still be retrieved through internet archive services.)
#### Can I use ICU from other languages besides C/C++ and Java?
There are a number of wrappers available, please see the
[Related Projects](http://site.icu-project.org/related) page.
#### How do I upgrade to a new version of ICU? Should I be concerned about API changes, a new Unicode version or a new CLDR version)?
Our goal is for ICU upgrades to go smoothly. Here are some steps you can take to
prepare for an upgrade, or to make sure that your usage of ICU is
upgrade-friendly.
* **API:** ensure that you are not using draft APIs which may have changed in
a future release. See the section on
[API compatibility](../design.md#icu-api-compatibility) in the
[design chapter](../design.md).
* **Unicode:** See the release notes for particular versions of Unicode to
ensure that your code is not affected by property changes or other
specification changes.
* **CLDR:** If your application has test cases which depend on specific
translations, these assumptions may become invalid if the translation of an
item changes, new support is added, or if a country changes its currency.
Try not to depend on specific translations, or be prepared to change test
cases. Also, a newer version may support additional translations,
currencies, types of calenders
* **Building/Deploying your Application (ICU4C):** ICU4C usually builds with
symbol renaming (See:
[binary compatibility](../design.md#icu-binary-compatibility-using-icu-as-an-operating-system-level-library)
in the [design chapter](../design.md)). Be sure that you build your
application with the updated ICU header files, so that it will link against
the current ICU. Also, don't hard-code the names of ICU libraries in your
build scripts and projects. Where possible, link against just the
'base name' such as `libicuuc.so` or `icuuc.lib` rather than a name
containing the version number such as `libicuuc.so.**46**` or
`icuuc**46**.dll`.
## Building and Testing ICU
#### How do I build ICU?
See the readme.html that is included with ICU.
#### How do I get 32- or 64-bit versions of the ICU libraries?
From ICU version 4.2 on, the configure script will build with the default bit
width of your platform. You can request 64 or 32 bits with the
**--with-library-bits=** option, (e.g. `runConfigureICU Linux
**--with-library-bits=64**` or `runConfigureICU MacOSX
**--with-library-bits=32**`).
(For the behavior of attempting 64 bits if possible, use
**--with-library-bits=64else32**).
#### How do I build an optimized, non debug ICU?
On Win32, choose the 'Release' configuration from the drop down menu. On other
platforms, use the runConfigureICU script, which uses the configure script. The
runConfigureICU script uses the safest level of optimization for the ICU
libraries. If your platform is not specified, set the following environment
variables before running configure or runConfigureICU: **CFLAGS=-O CXXFLAGS=-O**
#### Why am I getting so many test failures when I use "gmake check"?
Please view the readme that is included with ICU. It has all the details on how
to build and test ICU, and it usually answers most problems.
If you are using a compiler that hasn't been tested with ICU before, you may
have encountered an optimization bug with the compiler. On Unix platforms you
can specify **--disable-release** when you are using runConfigureICU (e.g.
`runConfigureICU --disable-release LinuxRedHat`). If this fixes your problem, it
is recommended that you report the optimization bug to the compiler
manufacturer.
If neither of these fix your problem, please send an e-mail to the [ICU4C
Support List](http://icu-project.org/contacts.html) .
#### How can I reduce the size of the ICU data library?
Use the [Data Customizer](https://unicode-org.atlassian.net/browse/ICU-12835)
or see
[Customizing ICU's Data Library](../icudata.md#customizing-icus-data-library)
in the [ICU Data Management](../icudata.md) chapter of this User's Guide.
#### Why am I seeing a small ( only a few K ) instead of a large ( several megabytes ) data shared library (icudt)?
#### Opening ICU services fails with U_MISSING_RESOURCE_ERROR and u_init() returns failure.
ICU libraries always must link with the ICU data library. However, so that ICU
can bootstrap itself, it first builds a 'stub' data library, in
**icu\\source\\stubdata**, so that the tools can function. You should only use
this in production if you are NOT using DLL-mode data access, in which case you
are accessing ICU data as individual files, as an archive (.dat) file, or some
other means. Normally, you should be using the larger library built from
**icu\\source\\data**. If you see this issue after ICU has completed building,
re-run 'make' in **icu\\source\\data**, or the '**makedata**' project in Visual
Studio.
#### Can I add or remove a converter from ICU?
Yes. Please see [Customizing ICU's Data Library](../icudata.md#customizing-icus-data-library)
in the [ICU Data Management](../icudata.md) of this User's Guide. You can also
get extra converters from <http://www.icu-project.org/charts/charset/> or use
the [ICU Data Customizer](https://unicode-org.atlassian.net/browse/ICU-12835)
tool.
#### Why don't the makefiles work?
You need GNU's make program version 3.8 or later, and you need to run the
runConfigureICU script, which is located in the `icu/source directory`. You may
be using a platform that ICU does not support. If the first two answers do not
apply to you, then you should send an e-mail to the
[ICU4C Support List](http://www.icu-project.org/contacts.html).
Here are some places you can find gmake:
1. GNU: <http://www.gnu.org/software/make/>
2. Sun® Source/Binaries: <http://www.sunfreeware.com>
3. z/OS (OS/390) Source/Binaries:
<http://www.ibm.com/servers/eserver/zseries/zos/unix/bpxa1ty1.html#opensrc>
4. IBM i (OS/400) Source/Binaries:
<http://www.ibm.com/servers/enable/site/porting/iseries/overview/gnu_utilities.html>
Due to differences in every platform's make program, we will not support other
versions of our make files.
#### What version of the C iostream is used in ICU4C?
ICU4C uses the latest available version of the iostream on the target platform.
Only the `io` library uses iostream.
#### I only want to use the C APIs, do I need a C++ compiler?
Large portions of ICU4C were always implemented in C++, and over time we are
moving more into that direction. We continue to support and add C APIs, in order
to provide binary-compatible APIs. For the implementation, C++ is much better:
It is generally easier to work with, which reduces bugs and maintenance. It is
closer to Java, which is important for porting between ICU4C and ICU4J. We use
[RAII](http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization)
(e.g., LocalPointer) to reduce opportunities for memory leaks, we use inline
functions and type-safe constants instead of #define, etc. However, we do not
use exceptions, and we do not use the Standard Template Library (STL), so
ICU4C's dependencies on the C++ library are minimal. See the new
[dependencies.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/test/depstest/dependencies.txt)
and search for "group: cplusplus".
As ICU does not use exceptions, the GCC option `-fno-exceptions` will reduce or
remove the dependencies on the standard C++ library. In
[GCC](http://gcc.gnu.org) 4.5 there is an option `-static-libstdc++` which will
remove C++ library dependencies. Visual Studio has the
[/MT option](http://msdn.microsoft.com/en-us/library/2kzt1wy3(v=VS.100).aspx),
and other compilers may have similar options. See the
[How To Use ICU](../howtouseicu.md) page for related information on this topic.
## Features of ICU
#### What computer languages does ICU support?
ICU4C (ICU) is written in C and C++, and ICU4J is written in Java™.
#### How are the APIs documented for deprecation?
Please read the [ICU API compatibility](../design.md#icu-api-compatibility)
section in the [ICU Design](../design.md) chapter.
#### What version of Unicode standard does ICU support?
ICU versions 65 supports Unicode version 12.
The Unicode versions for older versions of ICU are listed on the ICU download
page, <http://www.icu-project.org/download/>
#### Does ICU support UTF-16 surrogates and Unicode supplementary characters?
Yes.
#### Does Java support UTF-16 surrogates and Unicode supplementary characters?
Java 5 introduced support for Unicode supplementary characters. Java 1.4 and
earlier do not directly support them.
#### How does ICU relate to Java's java.text.\* package?
The International Components for Unicode are available both as a C/C++ library
and a Java class library. ICU provides internationalization utilities for
writing global applications in C, C++ or Java programming languages. ICU was
originally developed by the Unicode group at the IBM Globalization Center of
Competency in Cupertino, and ICU was contributed to Sun for inclusion into the
JDK 1.1. ICU4J includes enhanced versions of some of these contributed classes
plus additional classes that complement the classes in the JDK.
ICU4C started as a C++ port of the original Java Internationalization classes.
These classes are now partially implemented in C, with largely parallel C and
C++ APIs. ICU4C and ICU4J continue to leapfrog each other with features and bug
fixes. Over time, features from ICU4J get added to the JDK as well.
Both versions of ICU have a goal to implement the latest Unicode standard,
maintain a single portable source code base, and to make it easier for software
developers to create global applications.
## Using ICU
#### Can I use any of the features of ICU without Unicode strings?
No. In order to use the collation, text boundary analysis, formatting or other
ICU APIs, you must use Unicode strings. In order to get Unicode strings from
your native codepage, you can use the conversion API.
#### How do I declare a Unicode string in ICU?
Use the `U_STRING_DECL` and `U_STRING_INIT` macros or use the UnicodeString
class for C++. Strings are represented as `UChar \*` as the base string type.
Even though most platforms declare wide strings as `wchar_t \*` or `L""` as the
base string type, that declaration is not portable because the `sizeof(wchar_t)`
can be 1, 2 or 4, and the encoding may not even be Unicode. On the platforms
where `sizeof(wchar_t)` is 2 bytes, `UChar` is defined as `wchar_t`. In that
case you can use ICU's strings with 3rd party legacy functions; however, we do
not suggest using Unicode strings without the `U_STRING_DECL` and
`U_STRING_INIT` macros or UnicodeString class because they are platform
independent implementations.
#### How is a Unicode string represented in ICU4C?
A Unicode string is currently represented as UTF-16. The endianess of UTF-16 is
platform dependent. You can guarantee the endianess of UTF-16 by using a
converter. UTF-16 strings can be converted to other Unicode forms by using a
converter or with the UTF conversion macros.
ICU does not use UCS-2. UCS-2 is a subset of UTF-16. UCS-2 does not support
surrogates, and UTF-16 does support surrogates. This means that UCS-2 only
supports UTF-16's Base Multilingual Plane (BMP). The notion of UCS-2 is
deprecated and dead. Unicode 2.0 in 1996 changed its default encoding to UTF-16.
If you need to do a quick and easy conversion between UTF-16 and UTF-8, UTF-32
or an encoding in `wchar_t`, you should take a look at unicode/ustring.h. In
that header file you will find `u_strToWCS`, `u_strFromWCS`, `u_strToUTF8`,
`u_strFromUTF8`, `u_strToUTF32` and `u_strFromUTF32` functions. These
functions are provided for your convenience instead of using the `ucnv_\*` API.
You can also take a look at the `UTF_\*`, `UTF8_\*`, `UTF16_\*` and `UTF32_\*`
macros, which are defined in
[unicode/utf.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utf.h),
[unicode/utf8.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utf8.h),
[unicode/utf16.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utf16.h)
and [unicode/utf32.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utf32.h).
These macros are helpful for programmers that need to manipulate and process
Unicode strings.
#### How do I index into a UTF-16 string?
Typically, indexes and offsets in strings count string units, not characters
(although in C and Java they have a char type).
For example, in old-fashioned MBCS strings, you would count indexes and offsets
by bytes, not by the variable-width character count. In UTF-16, you do the same,
just count 16-bit units (in ICU: UChar).
#### What is the performance difference between UTF-8 and UTF-16?
Most of the time, the memory throughput of the hard drive and RAM is the main
performance constraint. UTF-8 is 50% smaller than UTF-16 for US-ASCII, but UTF-8
is 50% larger than UTF-16 for East and South Asian scripts. There is no memory
difference for Latin extensions, Greek, Cyrillic, Hebrew, and Arabic.
For processing Unicode data, UTF-16 is much easier to handle. You get a choice
between either one or two units per character, not a choice among four lengths.
UTF-16 also does not have illegal 16-bit unit values, while you might want to
check for illegal bytes in UTF-8. Incomplete character sequences in UTF-16 are
less important and more benign. If you want to quickly convert small strings
between the different UTF encodings or get a UChar32 value, you can use the
macros provided in `utf.h` and its siblings `utf8.h` and `utf16.h`. For larger
or partial strings, please use the conversion API.
#### How do the converters work?
The converters act like a data stream. This means that the state of the last
character is saved in the converter after each call to the `ucnv_fromUnicode()`
and `ucnv_toUnicode()` functions. So if the source buffer ends with part of a
surrogate Unicode character pair, the next call to `ucnv_toUnicode()` will
write out the equivalent character to the destination buffer. Please see the
[Conversion](../conversion/index.md) chapter of the User's Guide for details.
#### What does a locale look like in ICU?
ICU locales are lightweight, and they are represented by just a string.
Lightweight means that there is just a string to represent a locale and nothing
more. Many platforms have numbers and other data structures to represent a
locale, but ICU has one simple platform independent string to represent a
locale.
ICU locales usually contain an ISO-639 language name (2-3 characters), an
ISO-3166 country name (2-3 characters), and a variant name which is user
specified. When a language or country is not represented by these standards, ICU
uses 3 characters to represent that part of the locale. All three parts are
separated by an underscore "_". For example, US English is "en_US", and German
in Germany with the Euro symbol is represented as "de_DE_EURO". Traditionally
the language part of the locale is lowercase, the country is uppercase and the
variant is uppercase. More details are available from the [Locale
Chapter](../locale/index.md) of this User's Guide.
#### How is ICU versioned?
Please read the [ICU Design](../design.md) chapter of the User's Guide.
#### What is the relationship between ICU locale data and system locale data?
There is no relationship. ICU is not dependent on the operating system for the
locale data.
This also means that `uloc_setDefault()` does not affect the operating system.
The function `uloc_setDefault()` only sets ICU's default locale. Normally the
default locale for ICU is whatever the operating system says is the default
locale.
#### How are errors handled in ICU?
Since not all compilers can handle exceptions, we return an error from functions
with a `UErrorCode` parameter. The `UErrorCode` parameter of a function will
return any errors that occurred while it was executing. It's usually a good idea
to check for errors after calling a function by using the `U_SUCCESS` and
`U_FAILURE` macros. `U_SUCCESS` returns true when the function did run properly,
and `U_FAILURE` returns true when the function did NOT run properly. You may
handle specific errors from a function by checking the exact value of error. The
possible values of `UErrorCode` are located in
[utypes.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utypes.h)
of the common project. Before any function is called with a `UErrorCode`, it
must be initialized to `U_ZERO_ERROR`.
Here is an example of `UErrorCode` being used.
```c++
UErrorCode err = U_ZERO_ERROR;
callMyFunction(&err);
if (U_FAILURE(err)) {
puts("callMyFunction() Failed!");
}
```
Please see the [ICU Design](../design.md) chapter for details.
#### With calendar classes, why are months 0-based?
"I have been using ICU for its calendar classes, and have found it to be
excellent. That said, I am wondering why the decision was made to keep months
0-based while almost all the other calendrical units (years, weeks of year,
weeks of month, date, days of year, days of week, days of week in month) are
1-based? This has been the source of several bugs whenever the mind is slightly
less than razor sharp." --Contributor
This was not our choice. We inherited it from the Java Calendar API,
unfortunately.
#### Is there a guideline for COBOL programs that want to use ICU?
There is a COBOL/ICU guideline available since ICU 2.2. For more details, please
refer to the [COBOL section](../usefrom/cobol.md) of this User's Guide.
#### Where can I get more information about using ICU?
Please send an e-mail to the [ICU4C Support
List](http://www.icu-project.org/contacts.html) .

103
docs/userguide/index.md Normal file
View file

@ -0,0 +1,103 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Introduction to ICU
As companies integrate e-commerce on a global scale into their fundamental
business processes, their prospective customers, established customers, and
active partners can take advantage of increased revenue and decreased expenses
through software internationalization. They also can improve customer
communications and increase savings.
## Meeting the Challenge of Globalization
In today's business climate of globalization, companies must compete in a new
Internet-enabled business climate of constant change and compressed time frames.
Their customers expect reliable service and support.
## Taking Advantage of Internationalized Software
Companies need to establish a better linkage between their global business
processes and the underlying supportive IT processes. If they want to deliver
this new flexibility and agility, they must depend on the software
internationalization process.
The software internationalization development process uses libraries (such as
the International Components for Unicode (ICU) libraries), to enable one single
program to work with text in any language for any place in the world. For
example, instead of having separate software versions for ten different
countries, the ICU services can create one version that works seamlessly and
transparently in all of them.
The ICU components are an integral part of software development because they
hide the cultural nuances and technical complexities of locale-specific software
requirements. These complexities provide critical functionality for
applications, but the application developer does not need to exert a huge effort
or incur high costs to build them.
## Justifying the Investment
The business case needed to justify the investment in software
internationalization is compelling when the investment is amortized over a
number of projects. In the fast-paced and rapidly-evolving world of traditional
and evolving e-businesses, these international components provide a firm ground
on which companies, partners and suppliers can build their business
transactions. They can share competitive information to help gain a significant
economic advantage.
The ICU services deliver proven value by lowering the cost required to integrate
with disparate applications, systems and data sources on a regional and global
scale. It provides value to a company's IT investment by lowering IT complexity,
risk, maintenance costs and training costs. It also enhances organizational
flexibility, leverages
existing assets, and improves planning and decision-making. It enables
organizational learning, process-driven synchronization, event-driven evaluation
and decision-making.
## Background and History of ICU
ICU was originally developed by the Taligent company. The Taligent team later
became the Unicode group at the IBM® Globalization Center of Competency in
Cupertino. The team has received significant input from the open source
community worldwide.
Java™ classes developed at Taligent were incorporated into the Java Development
Kit (JDK) 1.1 developed by Sun® Microsystems. The classes were then ported to
C++ and later some classes were also ported to C. The classes provide
internationalization utilities for writing global applications in C, C++, or
Java programming languages.
ICU for Java (ICU4J) includes enhanced versions of some of these classes, plus
additional classes that complement the classes in the JDK. C and C++ versions of
the same international functionality are available in ICU for C (ICU4C). The
APIs differ slightly due to language differences and new functionality. For
example, ICU4C includes a character converter API.
ICU4J and ICU4C keep the same development goals. They both track additions to
the Java internationalization APIs and implement the latest released Unicode
standard. They also maintain a single, portable source code base.
All of us in the ICU and open source group appreciate the time you are taking to
understand our technology. We have put our best collective effort into these
open components, and look forward to your questions, comments and suggestions.
## Downloading ICU
Download ICU in one of the following ways:
1. From the download page, <http://www.icu-project.org/download/>, for
packaged stable releases of ICU.
2. From the source code repository, <http://www.icu-project.org/repository/>,
for the latest development versions.
After downloading, see the included README file for information on what is
included, building, installing, etc.
## ICU License
Current license: <https://github.com/unicode-org/icu/blob/master/icu4c/LICENSE>
See also <https://github.com/unicode-org/icu/blob/userguide-migration/docs/userguide/icufaq/index.md#how-is-the-icu-licensed>

View file

@ -0,0 +1,8 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# ICU IO

View file

@ -0,0 +1,25 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# C: ustdio
This API provides a `<stdio.h>`-like API wrapper around ICU's other [formatting
and parsing](../formatparse/index.md) APIs. It is meant to ease the transition of adding
Unicode support to a preexisting applications using stdio. The following is a
small list of noticable differences between stdio and ICU I/O's ustdio
implementation.
* Locale specific formatting and parsing is only done with file IO.
* `u_fstropen` can be used to simulate file IO with strings. This is similar
to the iostream API, and it allows locale specific formatting and parsing to
be used.
* This API provides uniform formatting and parsing behavior between platforms
(unlike the standard stdio implementations found on various platforms).
* This API is better suited for text data handling than binary data handling
when compared to the typical stdio implementation.
* You can specify a [Transliterator](../transforms/index.md) while using the
file IO.
* You can specify a file's [codepage](../conversion/converters.md) separately
from the codepage.

View file

@ -0,0 +1,11 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# C++: ustream
The ustream interface provides a Unicode iostream-like API.
At this time, this API contains `operator<<` and `operator>>` for
[UnicodeString](../strings/index.md) manipulation with the C++ I/O stream API.

View file

@ -0,0 +1,184 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Layout Engine
## Line Layout Deprecation
> :warning: ***The ICU Line LayoutEngine has been removed in ICU 58.*** It had not had active
> development for some time, had many open abugs,
> and had been deprecated in ICU 54.
> Users of ICU Layout are **strongly** encouraged to consider the HarfBuzz project
> as a replacement for the ICU Layout Engine. An ICU team member responsible for
> the Layout Engine is contributing fixes and features to HarfBuzz, and a drop in
> wrapper is available to allow use of HarfBuzz as a direct replacement for the
> ICU layout engine.
> HarfBuzz has its own active mailing lists, please use those for discussion of
HarfBuzz and its use as a replacement for the ICU layout engine.
See:
[http://www.freedesktop.org/wiki/Software/HarfBuzz](http://www.freedesktop.org/wiki/Software/HarfBuzz)
> :point_right: **Users of the "layoutex" ParagraphLayout library**: Please see information
about how to build "layoutex" on the [Paragraph Layout](paragraph.md) page.
## Overview
:warning: **See deletion/deprecation notice, above.**
The Latin script, which is the most commonly used script among software
developers, is also the least complex script to display especially when it is
used to write English. Using the Latin script, characters can be displayed from
left to right in the order that they are stored in memory. Some scripts require
rendering behavior that is more complicated than the Latin script. We refer to
these scripts as "complex scripts" and to text written in these scripts as
"complex text." Examples of complex scripts are the Indic scripts (for example,
Devanagari, Tamil, Telugu, and Gujarati), Thai, and Arabic.
These complex scripts exhibit complications that are not found in the Latin
script. The following lists the main complications in complex text:
The ICU LayoutEngine is designed to handle these complications through a simple,
uniform client interface. Clients supply Unicode code points in reading or
"logical" order, and the LayoutEngine provides a list of what to display,
indicates the correct order, and supplies the positioning information.
Because the ICU LayoutEngine is platform independent and text rendering is
inherently platform dependent, the LayoutEngine cannot directly display text.
Instead, it uses an abstract base class to access font files. This base class
models a TrueType font at a particular point size and device resolution. The
TrueType fonts have the following characteristics:
1. A font is a collection of images, called glyphs. Each glyph in the font is
referred to by a 16-bit glyph id.
2. There is a mapping from Unicode code points to glyph ids. There may be
glyphs in the font for which there is no mapping.
3. The font contains data tables referred to by 4 byte tags. (e.g. `GSUB`,
`cmap`). These tables can be read into memory for processing.
4. There is a method to get the width of a glyph.
5. There is a method to get the position of a control point from a glyph.
Since many of the contextual forms, ligatures, and split characters needed to
display complex text do not have Unicode code points, they can only be referred
to by their glyph indices. Because of this, the LayoutEngine's output is a list
of glyph indices. This means that the output must be displayed using an
interface where the characters are specified by glyph indices rather than code
points.
A concrete instance of this base class must be written for each target platform.
For a simple example which uses the standard C library to access a TrueType
font, look at the PortableFontInstance class in
[icu/source/test/letest](https://github.com/unicode-org/icu/tree/master/icu4c/source/test/letest)
.
The ICU LayoutEngine supports complex text in the following ways:
1. If the font contains OpenType® tables, the LayoutEngine uses those tables.
2. If the font contains Apple Advanced Typography (AAT) tables, the
LayoutEngine uses those tables.
3. For Arabic and Hebrew text, if OpenType tables are not present, the
LayoutEngine uses Unicode presentation forms.
4. For Thai text, the LayoutEngine uses either the Microsoft or Apple Thai
forms.
OpenType processing requires script-specific processing to be done before the
tables are used. The ICU LayoutEngine performs this processing for Arabic,
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telegu, Kannada, and
Malayalam text.
The AAT processing in the LayoutEngine is relatively basic as it only applies
the default features in left-to-right text. This processing has been tested for
Devanagari text. Since AAT processing is not script-specific, it might not work
for other scripts.
## Programming with the LayoutEngine
**See deprecation notice, above.**
The ICU LayoutEngine is designed to process a run of text which is in a single
font. It is written in a single direction (left-to-right or right-to-left), and
is written in a single script. Clients can use ICU's
[Bidi](../transforms/bidi.md) processing to determine the direction of the text
and use the ScriptRun class in
[icu/source/extra/scrptrun](https://github.com/unicode-org/icu/tree/master/icu4c/source/extra/scrptrun)
to find a run of text in the same script. Since the representation of font
information is application specific, ICU cannot help clients find these runs of
text.
Once the text has been broken into pieces that the LayoutEngine can handle, call
the LayoutEngineFactory method to create an instance of the LayoutEngine class
that is specific to the text. The following demonstrates a call to the
LayoutEngineFactory:
```c
LEFontInstace *font = <the text's font>;
UScriptCode script = <the text's script>;
LEErrorCode error = LE_NO_ERROR;
LayoutEngine *engine;
engine = LayoutEngine::layoutEngineFactory(font,
script,
0, // language - ignored
error);
The following example shows how to use the LayoutEngine to process the text:
LEUnicode text[] = <the text to process>;
le_int32 offset = <the starting offset of the text to process>;
le_int32 count = <the number of code points to process>;
le_int32 max = <the total number of characters in text>;
le_bool rtl = <true if the text is right-to-left, false otherwise>;
float x, y = <starting x, y position of the text>;
le_int32 glyphCount;
glyphCount = engine->layoutChars(text, offset, count, max, rtl,
x, y, error);
```
This previous example computes three arrays: an array of glyph indices in
display order, an array of x, y position pairs for each glyph, and an array that
maps each output glyph back to the input text array. Use the following get
methods to copy these arrays:
```c
LEGlyphID *glyphs = new LEGlyphID[glyphCount];
le_int32 *indices = new le_int32[glyphCount];
float *positions = new float[(glyphCount * 2) + 2];
engine->getGlyphs(glyphs, error);
engine->getCharIndices(indices, error);
engine->getGlyphPositions(positions, error);
```
> :point_right: **Note** The positions array contains (glyphCount * 2) + 2 entries. This is because
> there is an x and a y position for each glyph. The extra two positions hold the
> x, y position of the end of the text run.
Once users have the glyph indices and positions, they can use the
platform-specific code to draw the glyphs. For example, on Windows 2000, users
can call `ExtTextOut` with the `ETO_GLYPH_INDEX` option to draw the glyphs and on
Linux, users can call `TT_Load_Glyph` to get the bitmap for each glyph. However,
users must draw the bitmaps themselves.
> :point_right: **Note:** The ICU LayoutEngine was developed separately from the rest of ICU and uses
> different coding conventions and basic types. To use the LayoutEngine with ICU
> coding conventions, users can use the ICULayoutEngine class, which is a thin
> wrapper around the LayoutEngine class that incorporates ICU conventions and
> basic types.
For a more detailed example of how to call the LayoutEngine, look at
[icu/source/test/letest/letest.cpp](https://github.com/unicode-org/icu/tree/master/icu4c/source/test/letest/letest.cpp)
. This is a simple test used to verify that the LayoutEngine is working
properly. It does not do any complex text rendering.
For more information, see [ICU](http://icu-project.org/) , the [OpenType
Specification](http://www.microsoft.com/typography/tt/tt.htm) , and the
[TrueType Font File
Specification](http://developer.apple.com/fonts/TTRefMan/RM06/Chap6.html) .
> :warning: **Note:** See deprecation notice, above.

View file

@ -0,0 +1,56 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Paragraph Layout
This page is about the Paragraph Layout library that is available in ICU4C/C++.
For information about the deprecated Line Layout Engine, including its deprecation notice,
see: [Layout Engine](index.md).
### About the Paragraph Layout library
* The ICU Line LayoutEngine works on small chunks - unidirectional runs. It does
not layout text at the paragraph level.
* The **ParagraphLayout** object will analyze the text into runs of text in
the same font, script and direction, and will create a LayoutEngine object
for each run. The LayoutEngine will transform the characters into glyph
codes in visual order. Clients can use this to break a paragraph into lines,
and to display the glyphs in each line.
* Also see the
[ParagraphLayout](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1ParagraphLayout.html)
API Docs
### Building the Paragraph Layout library with HarfBuzz
While the ICU LayoutEngine is deprecated as of ICU 54, the ICU *Paragraph
*Layout library is not. The Paragraph Layout library must now be built using the HarfBuzz engine instead of the ICU LayoutEngine.
#### UNIX Makefile instructions / Cygwin / Msys / etc. (ICU 54+)
The following steps must be completed in order:
1. Build and install a complete ICU with the **`--disable-layout`
`--disable-layoutex`** switches passed to configure
2. Build and install HarfBuzz - http://harfbuzz.org (HarfBuzz's use of ICU may
be enabled or disabled at your choice)
3. Build and install the [icu-le-hb](http://harfbuzz.org) library.
4. Now, rerun "configure" on the exact **same** ICU workspace used above:
* with "icu-le-hb" AND the above-mentioned installed ICU available via
pkg-config ( `pkg-config --modversion icu-le-hb` should return a version,
such as "0.0.0" )
* with the --disable-layout **`--enable-layoutex`** switches passed to configure
5. next, run `make install` JUST in the **`source/layoutex`** directory, to install
libiculx and `icu-lx.pc`
The above steps will produce a libiculx library that depends on HarfBuzz.
If pkg-config visible installation is not suitable for step 4, you may also
manually set the following variables when building ICU in step 5:
* set `ICULEHB_CFLAGS` to the appropriate include path for icu-le-hb ( such
as **`-I/usr/local/include/icu-le-hb`** )
* set `ICULEHB_LIBS` to link against icu-le-hb and dependents as needed
(such as **`-L/usr/local/lib -licu-le-hb`** )

View file

@ -0,0 +1,141 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Locale Examples
## Locale Currency Conventions
Application programs should not reset the default locale as a way of requesting
an international object, because resetting default locale affects the other
programs running in the same process. Use one of the factory methods instead,
e.g. `Collator::createInstance(Locale)`.
In general, a locale object or locale string is used for specifying the locale.
Here is an example to specify the Belgium French with Euro currency locale:
**C++**
```c++
Locale loc("fr", "BE");
Locale loc2("fr_BE");
```
**C**
```c
const char *loc = "fr_BE";
```
**Java**
```java
ULocale loc = new ULocale("fr_BE");
```
> :point_right: **Note**: **Java** does **not** support the form `Locale("xx_yy_ZZ")`,
> instead use the form `Locale("xx","yy","ZZ")`.
## Locale Constants
A `Locale` is the mechanism for identifying the kind of object (`NumberFormat`) that
you would like to get. The locale is just a mechanism for identifying objects,
not a container for the objects themselves. For example, the following creates
various number formatters for the "Germany" locale:
**C++**
```c++
UErrorCode status = U_ZERO_ERROR;
NumberFormat *nf;
nf = NumberFormat::createInstance(Locale::getGermany(), status);
delete nf;
nf = NumberFormat::createCurrencyInstance(Locale::getGermany(), status);
delete nf;
nf = NumberFormat::createPercentInstance(Locale::getGermany(), status);
delete nf;
```
**C**
```c
UErrorCode success = U_ZERO_ERROR;
UChar *pattern;
UNumberFormat *nf;
UParseError *pe;
nf = unum_open( UNUM_DEFAULT, pattern, 0, "fr_FR", pe, &success );
unum_close(nf);
nf = unum_open( UNUM_CURRENCY, pattern, 0, "fr_FR", pe, &success );
unum_close(nf);
nf = unum_open( UNUM_PERCENT, pattern, 0, "fr_FR", pe, &success );
unum_close(nf);
```
**Java**
```java
NumberFormat nf = NumberFormat.getInstance(ULocale.GERMANY);
NumberFormat currencyInstance = NumberFormat.getCurrencyInstance(ULocale.GERMANY);
NumberFormat percentInstance = NumberFormat.getPercentInstance(ULocale.GERMANY);
```
## Querying Locale
Each class that performs locale-sensitive operations allows you to get all the
available objects of that type. You can sift through these objects by language,
country, or variant, and use the display names to present a menu to the user.
For example, you can create a menu of all the collation objects suitable for a
given language. For example, the following shows the display name of all
available locales in English (US):
**C++**
```c++
int32_t count;
const Locale* list = NULL;
UnicodeString result;
list = Locale::getAvailableLocales(count);
for (int i = 0; i < count; i++) {
list[i].getDisplayName(Locale::getUS(), result);
/* print result */
}
```
**C**
```c
int32_t count;
UChar result[100];
int i = 0;
UErrorCode status = U_ZERO_ERROR;
count = uloc_countAvailable();
for (i = 0; i < count; i++) {
uloc_getDisplayName(uloc_getAvailable(i), "en_US", result, 100, &status);
/* print result */
}
```
**Java**
```java
import com.ibm.icu.util.*;
public class TestLocale {
public void run() {
ULocale l[] = ULocale.getAvailableLocales();
int n = l.length;
for(int i=0; i<n; ++i) {
ULocale locale = l[i];
System.out.println();
System.out.println("The base name of this locale is: " + locale.getBaseName());
System.out.println("Locale's country name: " + locale.getDisplayCountry());
System.out.println("Locale's script name: " + locale.getDisplayScript());
System.out.println("Locale's language: " + locale.getDisplayLanguage());
System.out.println("Locale's variant: " + locale.getDisplayVariant());
}
}
public static void main(String args[]) {
new TestLocale().run();
}
}
```

View file

@ -0,0 +1,572 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Locale
## Overview
This chapter explains **locales**, a fundamental concept in ICU. ICU services
are parameterized by locale, to allow client code to be written in a
locale-independent way, but to deliver culturally correct results.
## The Locale Concept
A locale identifies a specific user community - a group of users who have
similar culture and language expectations for human-computer interaction (and
the kinds of data they process).
A community is usually understood as the intersection of all users speaking the
same language and living in the same country. Furthermore, a community can use
more specific conventions. For example, an English/United States/Military locale
is separate from the regular English/United States locale since the US military
writes times and dates differently than most of the civilian community.
A program should be localized according to the rules specific for the target
locale. Many ICU services rely on the proper locale identification in their
function.
The locale object in ICU is an identifier that specifies a particular locale and
has fields for language, country, and an optional code to specify further
variants or subdivisions. These fields also can be represented as a string with
the fields separated by an underscore.
In the C++ API, the locale is represented by the `Locale` class, which provides
methods for finding language, country and variant components. In the C API the locale
is defined simply by a character string. In the Java API, the locale is represented by
`ULocale` which is analogous to the `Locale` class but provide additional support
for ICU protocol. All the locale-sensitive ICU services use the locale information
to determine language and other locale specific parameters of their function.
The list of locale-sensitive services can be found in the Introduction to ICU
section. Other parts of the library use the locale as an indicator to
customize their behavior.
For example, when the locale-sensitive date format service needs to format a
date, it uses the convention appropriate to the current locale. If the locale is
English, it uses the word "Monday" and if it is French, it uses the word
"Lundi".
The locale object also defines the concept of a default locale. The default
locale is the locale, used by many programs, that regulates the rest of the
computer's behavior by default and is usually controlled by the user in a
control panel window. The locale mechanism does not require a program to know
which locale the user is using and thus makes most programming simpler.
Since locale objects can be passed as parameters or stored in variables, the
program does not have to know specifically which locales they identify. Many
applications enable a user to select a locale. The resulting locale object is
passed as a parameter, which then produces the customized behavior for that
locale.
A locale provides a means of identifying a specific region for the purposes of
internationalization and localization.
> :point_right: **Note**: An ICU locale is frequently confused with a Portable
> Operating System Interface (POSIX) locale ID. An ICU locale ID is not a POSIX
> locale ID. ICU locales do not specify the encoding and specify variant locales
> differently.
A locale consists of one or more pieces of ordered information:
### Language code
The languages are specified using a two- or three-letter lowercase code for a
particular language. For example, Spanish is "es", English is "en" and French is
"fr". The two-letter language code uses the
[ISO-639](https://www.loc.gov/standards/iso639-2/) standard.
### Script code
The optional four-letter script code follows the language code. If specified, it
should be a valid script code as listed on the
[Unicode ISO 15924 Registry](https://www.unicode.org/iso15924/iso15924-codes.html).
### Country code
There are often different language conventions within the same language. For
example, Spanish is spoken in many countries in Central and South America but
the currencies are different in each country. To allow for these differences
among specific geographical, political, or cultural regions, locales are
specified by two-letter, uppercase codes. For example, "ES" represents Spain and
"MX" represents Mexico. The two letter country code uses the
[ISO-3166](https://www.iso.org/iso-3166-country-codes.html) standard.
Java supports two letter country codes that uses ISO-3166 and UN M.49 code.
### Variant code
Differences may also appear in language conventions used within the same
country. For example, the Euro currency is used in several European countries
while the individual country's currency is still in circulation. Variations
inside a language and country pair are handled by adding a third code, the
variant code. The variant code is arbitrary and completely application-specific.
ICU adds "_EURO" to its locale designations for locales that support the Euro
currency. Variants can have any number of underscored key words. For example,
"EURO_WIN" is a variant for the Euro currency on a Windows computer.
Another use of the variant code is to designate the Collation (sorting order) of
a locale. For instance, the "es__TRADITIONAL" locale uses the traditional
sorting order which is different from the default modern sorting of Spanish.
Collation order and currency can be more flexibly specified using keywords
instead of variants; see below.
### Keywords
The final element of a locale is an optional list of keywords together with
their values. Keywords must be unique. Their order is not significant. Unknown
keywords are ignored. The handling of keywords depends on the specific services
that utilize them. Currently, the following keywords are recognized:
Keyword | Possible Values | Description
--------|-----------------|------------
calendar | A calendar specifier such as "gregorian", "islamic", "chinese", "islamic-civil", "hebrew", "japanese", or "buddhist". See the Key/Type Definitions table in the [Locale Data Markup Language](http://www.unicode.org/reports/tr35/) for a list of recognized values. | If present, the calendar keyword specifies the calendar type that the `Calendar` factory methods create. See the calendar locale and keyword handling section (§) of the [Calendar Classes](../datetime/calendar/index.md) chapter for details.
collation | A collation specifier such as "phonebook", "pinyin", "traditional", "stroke", "direct", or "posix". See the Key/Type Definitions table in the [Locale Data Markup Language](http://www.unicode.org/reports/tr35/) for a list of recognized values. | If present, the collation keyword modifies how the collation service searches through the locale data when instantiating a collator. See the collation locale and keyword handling section (§) of the [Collation Services Architecture](../collation/architecture.md) chapter for details.
currency | Any standard three-letter currency code, such as "USD" or "JPY". See the LocaleExplorer [currency list](http://demo.icu-project.org/icu-bin/locexp?_=en&SHOWCurrencies=1#Currencies) for a list of currently recognized currency codes. | If present, the currency keyword is used by `NumberFormat` to determine the currency to use to format a currency value, and by `ucurr_forLocale()` to specify a currency.
numbers | A numbering system specifier such as "latn", "arab", "deva", "hansfin" or "thai". See the Key/Type Definitions table in the [Locale Data Markup Language](http://www.unicode.org/reports/tr35/) for a list of recognized values. | If present, the numbers keyword is used by `NumberFormat` to determine the numbering system to be used for formatting and parsing numbers. The numbering system defines the set of digits used for decimal formatting, such as "latn" for western (ASCII) digits, or "thai" for Thai digits. The numbering system may also define complex algorithms for number formatting, such as "hansfin" for simplified Chinese numerals using financial ideographs.
If any of these keywords is absent, the service requesting it will typically use
the rest of the locale specifier in order to determine the appropriate behavior
for the locale. The keywords allow a locale specifier to override or refine this
default behavior.
### Examples
Locale ID | Language | Script | Country | Variant | Keywords | Definition
----------|----------|--------|---------|---------|----------|-----------
en_US | en | | US | | | English, United States of America. <br>Browse in [LocaleExplorer](http://demo.icu-project.org/icu-bin/locexp?_=en_US)
en_IE_PREEURO | en | | IE | | | English, Ireland. <br>Browse in [LocaleExplorer](http://demo.icu-project.org/icu-bin/locexp?_=en_IE_PREEURO)
en_IE@currency=IEP | en | | IE | | currency=IEP | English, Ireland with Irish Pound. <br>Browse in [LocaleExplorer](http://demo.icu-project.org/icu-bin/locexp?_=en_IE@currency=IEP)
eo | eo | | | | | Esperanto. <br>Browse in [LocaleExplorer](http://demo.icu-project.org/icu-bin/locexp?_=eo)
fr@collation=phonebook;calendar=islamic-civil | fr | | | | collation=phonebook <br>calendar=islamic-civil | French (Calendar=Islamic-Civil Calendar, Collation=Phonebook Order). <br>Browse in [LocaleExplorer](http://demo.icu-project.org/icu-bin/locexp?_=fr@collation=phonebook;calendar=islamic-civil)
sr_Latn_RS_REVISED@currency=USD | sr | Latn | RS | REVISED | currency=USD | Serbian (Latin, Yugoslavia, Revised Orthography, Currency=US Dollar) <br>Browse in [LocaleExplorer](http://demo.icu-project.org/icu-bin/locexp?d_=en&_=sr_Latn_RS_REVISED@currency=USD)
### Default Locales
Default locales are available to all the objects in a program. If you set a new
default locale for one section of code, it can affect the entire program.
Application programs should not set the default locale as a way to request an
international object. The default locale is set to be the system locale on that
platform.
For example, when you set the default locale, the change affects the default
behavior of the `Collator` and `NumberFormat` instances. When the default locale is
not wanted, you can set the desired locale using a factory method supplied with
the classes such as `Collator::createInstance()`.
Using the ICU C functions, `NULL` can be passed for a locale parameter to specify
the default locale.
## Locales and Services
ICU is implemented as a set of services. One example of a service is the
formatting of a numeric value into a string. Another is the sorting of a list of
strings. When client code wants to use a service, the first thing it does is
request a service object for a given locale. The resulting object is then
expected to perform the its operations in a way that is culturally correct for
the requested locale.
### Requested Locale
The **requested** locale is the one specified by the client code when the
service object is requested.
### Valid Locale
A **populated** locale is one for which ICU has data, or one in which client
code has registered a service. If the requested locale is not populated, then
ICU will fallback until it reaches a populated locale. The first populated
locale it reaches is the **valid** locale. The
valid locale is reachable from the requested locale via zero or more fallback
steps.
### Fallback
Locale **fallback** proceeds as follows:
1. The variant is removed, if there is one.
2. The country is removed, if there is one.
3. The script is removed, if there is one.
4. The ICU default locale is examined. The same set of steps is performed for
the default locale.
At any point, if the desired data is found, then the fallback procedure stops.
Keywords are not altered during fallback until the default locale is reached, at
which point all keywords are replaced by those assigned to the default locale.
### Actual Locale
Services request specific resources within the valid locale. If the valid locale
directly contains the requested resource, then it is the **actual** locale. If
not, then ICU will fallback until it reaches a locale that does directly contain
the requested resource. The first such locale is the actual locale. The actual
locale is reachable from the valid locale via zero or more fallback steps.
### getLocale()
Client code may wish to know what the valid and actual locales are for a given
service object. To support this, ICU services provide the method `getLocale()`.
The `getLocale()` method takes an argument specifying whether the actual or
valid locale is to be returned.
Some service object will have an empty or null return from `getLocale()`. This
indicates that the given service object was not created from locale data, or
that it has since been modified so that it no longer reflects locale data,
typically through alteration of the pattern (but not localized symbol changes --
such changes do not reset the actual and valid locale settings).
Currently, the services that support the `getLocale()` API are the following
classes and their subclasses:
### Functional Equivalence
Various services provide the API `getFunctionalEquivalent` to allow callers
determine the **functionally equivalent locale** for a requested locale. For
example, when instantiating a collator for the locale `en_US_CALIFORNIA`, the
functionally equivalent locale may be `en`.
The purpose of this is to allow applications to do intelligent caching. If an
application opens a service object for locale A with a functional equivalent Q
and caches it, then later when it requires a service object for locale B, it can
first check if locale B has the **same functional equivalent** as locale A; if
so, it can reuse the cached A object for the B locale, and be guaranteed the
same results as if it has instantiated a service object for B. In other words,
```
Service.getFunctionalEquivalent(A) == Service.getFunctionalEquivalent(B)
```
implies that the object returned by `Service.getInstance(A)` will behave
equivalently to the object returned by `Service.getInstance(B)`.
Here is a pseudo-code example:
The functional equivalent locale returned by a service has no meaning beyond
what is stated above. For example, if the functional equivalent of Greek is
Hebrew for collation, that makes no statement about the linguistic relation of
the languages -- it only means that the two collators are functionally
equivalent.
While two locales with the same functional equivalent are guaranteed to be
equivalent, the converse is **not** true: If two locales are in fact equivalent,
they may **not** return the same result from `getFunctionalEquivalent`. That is,
if the object returned by `Service.getInstance(A)` behaves equivalently to the
object returned by `Service.getInstance(B)`, `Service.getFunctionalEquivalent(A)`
**may or may not** be equal to `Service.getFunctionalEquivalent(B)`. Take again
the example of Greek and Hebrew, with respect to collation. These locales may
happen to be functional equivalents (since they each just turn on full
normalization), but it may or may not be the case that they return the same
functionally equivalent locale. This depends on how the data is structured
internally.
The functional equivalent for a locale may change over time. Suppose that Greek
were enhanced to change sorting of additional ancient Greek characters. In that
case, it would diverge; the functional equivalent of Greek would no longer be
Hebrew.
## Canonicalization
ICU works with **ICU format locale IDs**. These are strings that obey the
following character set and syntax restrictions:
1. The only permitted characters are ASCII letters, hyphen ('-'), underscore
('_'), at-sign ('@'), equals sign ('='), and semicolon (';').
2. IDs consist of either a base name, keyword list, or both. If a keyword list
is present it must be preceded by an at-sign.
3. The base name must precede the keyword list, if both are present.
4. The base name defines the language, script, country, and variant, and can
contain only ASCII letters, hyphen, or underscore.
5. The keyword list consists of keyword/value pairs. Each keyword or value
consists of one or more ASCII letters, hyphen, or underscore. Keywords and
values are separated by a single equals sign. Multiple keyword/value pairs,
if present, are separated by a single semicolon. A keyword may not appear
without a value. The same keyword may not appear twice.
ICU performs two kinds of canonicalizing operations on 'ICU format' locale IDs.
Level 1 canonicalization is performed routinely and automatically by ICU APIs.
The recommended procedure for client code using locale IDs from outside sources
(e.g., POSIX, user input, etc.) is to pass such "foreign IDs" through level 2
canonicalization before use.
**Level 1 canonicalization**. This operation performs minor, isolated changes,
such as changing "en-us" to "en_US". Level 1 canonicalization is **not**
designed to handle "foreign" locale IDs (POSIX, .NET) but rather IDs that are in
ICU format, but which do not have normalized case and delimiters. Level 1
canonicalization is accomplished by the ICU functions `uloc_getName`,
`Locale::createFromName`, and `Locale::Locale`. The latter two APIs exist in both
C++ and Java.
1. Level 1 canonicalization is defined only on ICU format locale IDs as defined
above. Behavior with any other kind of input is unspecified.
2. Case is normalized. Elements interpreted as **language** strings will be
converted to lowercase. **Country** and **variant** elements will be
converted to uppercase. **Script** elements will be title-cased. **Keywords**
will be converted to lowercase. **Keyword values** will remain unchanged.
3. Hyphens are converted to underscores.
4. All 3-letter country codes are converted to 2-letter equivalents.
5. Any 3-letter language codes are converted to 2-letter equivalents if
possible. 3-letter language codes with no 2-letter equivalent are kept as
3-letter codes.
6. Keywords are sorted.
**Level 2 canonicalization**. This operation may make major changes to the ID,
possibly replacing entire elements of the ID. An example is changing
"fr-fr@EURO" to "fr_FR@currency=EUR". Level 2 canonicalization is designed to
translate POSIX and .NET IDs, as well as nonstandard ICU locale IDs. Level 2 is
a **superset** of level 1; every operation performed by level 1 is also
performed by level 2. Level 2 canonicalization is performed by `uloc_canonicalize`
and `Locale::createCanonical`. The latter API exists in both C++ and Java.
1. Level 2 canonicalization operates on ICU format locale IDs with the
following additions:
1. The period ('.') is also a valid input character.
2. An at-sign may be followed by text that is not a keyword/value pair. If
present, such text is added to the variant.
2. POSIX variants are normalized, e.g., "en_US@VARIANT" => "en_US_VARIANT".
3. POSIX charset specifiers are **deleted**, e.g. "en_US.utf8" => "en_US".
4. The variant "EURO" is converted to the keyword specifier "currency=EUR".
This conversion applies to both "fr_FR_EURO" and "fr_FR@EURO" style IDs.
5. The variant "PREEURO" is converted to the keyword specifier "currency=K",
where K is the 3-letter currency code for the country's national currency in
effect at the time of the euro transitiion. This conversion applies to both
"fr_FR_PREURO" and "fr_FR@PREURO" style IDs. This mapping is only performed
for the following locales: ca_ES (ESP), de_AT (ATS), de_DE (DEM), de_LU
(EUR), el_GR (GRD), en_BE (BEF), en_IE (IEP), es_ES (ESP), eu_ES (ESP),
fi_FI (FIM), fr_BE (BEF), fr_FR (FRF), fr_LU (LUF), ga_IE (IEP), gl_ES
(ESP), it_IT (ITL), nl_BE (BEF), nl_NL (NLG), pt_PT (PTE).
6. The following IANA registered ISO 3066 names are remapped: art_LOJBAN =>
jbo, cel_GAULISH => cel__GAULISH, de_1901 => de__1901, de_1906 => de__1906,
en_BOONT => en__BOONT, en_SCOUSE => en__SCOUSE, sl_ROZAJ => sl__ROZAJ,
zh_GAN => zh__GAN, zh_GUOYU => zh, zh_HAKKA => zh__HAKKA, zh_MIN => zh__MIN,
zh_MIN_NAN => zh__MINNAN, zh_WUU => zh__WUU, zh_XIANG => zh__XIANG, zh_YUE
=> zh__YUE.
7. The following .NET identifiers are remapped: "" (empty string) =>
en_US_POSIX, az_AZ_CYRL => az_Cyrl_AZ, az_AZ_LATN => az_Latn_AZ, sr_SP_CYRL
=> sr_Cyrl_SP, sr_SP_LATN => sr_Latn_SP, uz_UZ_CYRL => uz_Cyrl_UZ,
uz_UZ_LATN => uz_Latn_UZ, zh_CHS => zh_Hans, zh_CHT => zh_Hant. The empty
string is not remapped if a keyword list is present.
8. Variants specifying collation are remapped to collation keyword specifiers,
as follows: de__PHONEBOOK => de@collation=phonebook, es__TRADITIONAL =>
es@collation=traditional, hi__DIRECT => hi@collation=direct, zh_TW_STROKE =>
zh_TW@collation=stroke, zh__PINYIN => zh@collation=pinyin.
9. Variants specifying a calendar are remapped to calendar keyword specifiers,
as follows: ja_JP_TRADITIONAL => ja_JP@calendar=japanese, th_TH_TRADITIONAL
=> th_TH@calendar=buddhist.
10. Special case: C => en_US_POSIX.
Certain other operations are not performed by either level 1 or level 2
canonicalization. These are listed here for completeness.
1. Language identifiers that have been superseded will not be remapped. In
particular, the following transformations are not performed:
1. no => nb
2. iw => he
3. id => in
4. nb_no_NY => nn_NO
2. The behavior of level 2 canonicalization when presented with a remapped ID
combined together with keywords is not defined. For example,
fr_FR_EURO@currency=FRF has an undefined level 2 canonicalization.
All APIs (with a few exceptions) in ICU4C that take a `const char* locale`
parameter can be assumed to automatically peform level 1 canonicalization before
using the locale ID to do resource lookup, keyword interpretation, etc.
Specifically, the static API `getLanguage`, `getScript`, `getCountry`, and `getVariant`
behave exactly like their non-static counterparts in the class `Locale`. That is,
for any locale ID `loc`, `new Locale(loc).getFoo() == Locale::getFoo(loc)`, where
Foo is one of Language, Script, Country, or Variant.
The `Locale` constructor (in C++ and Java) taking multiple strings behaves exactly
as if those strings were concatenated, with the '_' separator inserted between
two adjacent non-empty strings, and the result passed to `uloc_getName`.
> :point_right: **Note**: Throughout this discussion `Locale` refers to both the
> C++ `Locale` class and the ICU4J `com.ibm.icu.util.ULocale` class. Although C++
> notation is used, all statements made regarding `Locale` apply equally to
> `com.ibm.icu.util.ULocale`.
## Usage: Creating Locales
If you are localizing an application to a locale that is not already supported,
you need to create your own `Locale` object. New `Locale` objects are created using
one of the three constructors in this class:
```c++
Locale( const char * language);
Locale( const char * language, const char * country);
Locale( const char * language, const char * country, const char * variant);
```
Because a locale object is just an identifier for a region, no validity check is
performed. If you want to verify that the particular resources are available for
the locale you construct, you must query those resources. For example, you can
query the `NumberFormat` object for the locales it supports using its
`getAvailableLocales()` method.
New `ULocale` objects in Java are created using one the following three
constructor in this class:
```java
ULocale( String localeID)
ULocale( String a, String b)
ULocale( String a, String b, String c)
```
The locale ID passed in the constructor consists of optional languages, scripts,
country and variant fields in that oder, separated by underscore, followed by an
optional keywords. For example, "en_US", "sy_Cyrl_YU", "zh__pinyin",
"es_ES@currency=EUR,collation=traditional". The fields a, b, c in the other two
constructors are the components of the locale ID. For example, the following two
locale object are same:
```java
ULocale ul = new Ulocale("sy_Cyrl_YU");
ULocale ul = new ULocale("sy", "Cyrl", "YU");
```
In C++, the `Locale` class provides a number of convenient constants that you can
use to create locales. For example, the following refers to a `NumberFormat` object
for the United States:
```c++
Locale::getUS()
```
In C, a string with the language country and variant concatenated together with
an underscore '_' describe a locale. For example, "en_US" is a locale that is
based on the English language in the United States. The following can be used as
equivalents to the locale constants:
```c
ULOC_US
```
In Java, the `ULocale` provides a number of convenient constants that can be used
to create locales.
```java
ULocale.US;
```
## Usage: Retrieving Locales
Locale-sensitive classes have a `getAvailableLocales()` method that returns all of
the locales supported by that class. This method also shows the other methods
that get locale information from the resource bundle. For example, the following
shows that the `NumberFormat` class provides three convenience methods for
creating a default `NumberFormat` object:
```c++
NumberFormat::createInstance();
NumberFormat::createCurrencyInstance();
NumberFormat::createPercentInstance();
```
Locale-sensitive classes in Java also have a `getAvailableULocales()` method that
returns all of the locales supported by that class.
### Displayable Names
Once you've created a `Locale` in C++ and a `ULocale` in java, you can perform a
query of the locale for information about itself. The following shows the
information you can receive from a locale:
Method | Description
-------|------------
`getCountry()` | Retrieves the ISO Country Code
`getLanguage()` | Retrieves the ISO Language
`getDisplayCountry()` | Shows the name of the country suitable for displaying information to the user
`getDisplayLanguage()` | Shows the name of the language suitable for displaying to the user
> :point_right: **Note**: The `getDisplayXXX` methods are themselves locale-sensitive
> and have two versions in C++: one that uses the default locale and one that takes a
> locale as an argument and displays the name or country in a language appropriate to
> that locale.
> :point_right: **Note**: In Java, the `getDisplayXXX` methods have three versions:
> one that uses the default locale, the other takes a locale as an argument and the
> third one which takes locale ID as an argument.
Each class that performs locale-sensitive operations allows you to get all the
available objects of that type. You can sift through these objects by language,
country, or variant, and use the display names to present a menu to the user.
For example, you can create a menu of all the collation objects suitable for a
given language.
### HTTP Accept-Language
ICU provides functions to negotiate the best locale to use for an operation,
given a user's list of acceptable locales, and the application's list of
available locales. For example, a browser sends the web server the HTTP
"`Accept-Language`" header indicating which locales, with a ranking, are
acceptable to the user. The server must determine which locale to use when
returning content to the user.
Here is an example of selecting an acceptable locale within a CGI application:
C:
```c
char resultLocale[200];
UAcceptResult outResult;
available = ures_openAvailableLocales("myBundle", &status);
int32_t len = uloc_acceptLanguageFromHTTP(resultLocale, 200, &outResult,
getenv("HTTP_ACCEPT_LANGUAGE"), available, &status);
if(U_SUCCESS(status)) {
printf("Using locale %s\n", outResult);
}
```
Here is an example of selecting an acceptable locale within a Java application:
Java:
```java
ULocale[] availableLocales = ULocale.getAvailableLocales();
boolean[] fallback = { false };
ULocale result = ULocale.acceptLanguage(availableLocales, fallback);
System.out.println("Using locale " + result);
```
> :point_right: **Note**: As of this writing, this functionality is available in
> both C and Java. Please read the following two linked documents for important
> considerations and recommendations when using this header in a web application.
> *For further information about the Accept-Language HTTP header:* <br>
> https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4 <br>
> *Notes and cautions about the use of this header:* <br>
> https://www.w3.org/International/questions/qa-accept-lang-locales
## Programming in C vs. C++ vs. Java
See Programming for Locale in [C, C++ and Java](examples.md) for more information.

View file

@ -0,0 +1,516 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Localizing with ICU
## Overview
There are many different formats for software localization, i.e., for resource
bundles. The most important file format feature for translation of text elements
is to represent key-value pairs where the values are strings.
Each format was designed for a certain purpose. Many but not all formats are
recognized by translation tools. For localization it is best to use a source
format that is optimized for translation, and to convert from it to the
platform-specific formats at build time.
This overview concentrates on the formats that are relevant for working with
ICU. The examples below show only lists of strings, which is the lowest common
denominator for resource bundles.
## Recommendation
The most promising long-term approach is to author localizable data in XLIFF
format (see the [XLIFF](#xliff) (§) section below) and to convert it to native,
platform/tool-specific formats at build time.
Short-term, due to the lack of ICU tools for XLIFF, either custom tools must be
used to convert from some authoring/translation format to Java/ICU formats, or
one of the Java/ICU formats needs to be used for authoring and translation.
## Java and ICU4J
### .properties files
Java `PropertyResourceBundle` uses runtime-parsed .properties files. They contain
key-value pairs where both keys and values are Unicode strings. No other native
data types (e.g., integers or binaries) are supported. There is no way to
specify a charset, therefore .properties files must be in ISO 8859-1 with \u
escape sequences (see the Java `native2ascii` tool).
Defined at: http://java.sun.com/j2se/1.4/docs/api/java/util/PropertyResourceBundle.html
Example: (`example_de.properties`)
```properties
key1=Deutsche Sprache schwere Sprache
key2=Düsseldorf
```
### .java ListResourceBundle files
Java `ListResourceBundle` files provide implementation subclasses of the
`ListResourceBundle` abstract base class. **They are Java code!** Source files are
.java files that are compiled as usual with the javac compiler. Syntactic rules
of Java apply. As Java source code, they can contain arbitrary Java objects and
can be nested.
Although the Java compiler allows to specify a charset on the command line, this
is uncommon, and .java resource bundle files are therefore usually encoded in
ISO 8859-1 with \u escapes like .properties files.
Defined at: http://java.sun.com/j2se/1.4/docs/api/java/util/ListResourceBundle.html
Example: (`example_de.java`)
```java
public class example_de extends ListResourceBundle {
public Object[][] getContents() {
return contents;
}
static final Object[][] contents={
{ "key1", "Deutsche Sprache " +
"schwere Sprache" },
{ "key2", "Düsseldorf" }
};
}
```
ICU4J can also access the ICU4C resource bundles described in the next section,
using the API described in the [UResourceBundle](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/util/UResourceBundle.html) documentation.
## ICU4C
### .txt resource bundles
ICU4C natively uses a plain text source format with a nested structure that was
derived from Java `ListResourceBundle` .java files when the original ICU Java
class files were ported to C++. The ICU4C bundle format can of course contain
only data, not code, unlike .java files. Resource bundle source files are
compiled with the `genrb` tool into a binary runtime form (`.res` files) that is
portable among platforms with the same charset family (ASCII vs. EBCDIC) and
endianness.
Features:
1. Key-value pairs. Keys are strings of "invariant characters" - a portable subset of the ASCII graphic character repertoire. About "invariant characters" see the definition of the .txt file format (URL below) or [icu/source/common/unicode/utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html)
2. Values can be Unicode strings, integers, binaries (BLOBs), integer array (vectors), and nested structures. Nested structures are either arrays (position-indexed vectors) of values or "tables" of key-value pairs.
3. Values inside nested structures can be all of the ones as on the top level, arbitrarily deeply nested via arrays and tables.
4. Long strings can be split across lines: Adjacent strings separated only by whitespace including line breaks) are automatically concatenated at build time.
5. At runtime, when a top-level item is not found, then ICU looks up the same key in the parent bundle as determined by the locale ID.
6. A value can also be an "alias", which is simply a reference to another bundle's item. This is to save space by storing large data pieces only once when they cannot be inherited along the locale ID hierarchy (e.g., collation data in ICU shared among zh_HK and zh_TW).
7. Source files can be in any charset. Unicode signature byte sequences are recognized automatically (UTF-8/16, SCSU, ...), otherwise the tool takes a charset name on the command line.
Defined at: [icu-docs/master/design/bnf_rb.txt](https://raw.githubusercontent.com/unicode-org/icu-docs/master/design/bnf_rb.txt)
To use with ICU4C, see the [Resource Bundle APIs](resources.md#resource-bundle-apis) section of this userguide.
Example: (`de.txt`)
```
de {
key1 { "Deutsche Sprache "
"schwere Sprache" }
key2 { "Düsseldorf" }
}
```
### ICU4C XML resource bundles
The ICU4C XML resource bundle format was defined simply to express the same
capabilities of the .txt and binary ICU4C resource bundles in XML form. However,
we have decided to drop the format for lack of use and instead adopt standard
XLIFF format for localization. For more information on XLIFF format, see the
following section. For examples on using ICU tools to produce and read XLIFF
format see the XLIFF Usage (§) section in the [resource management chapter](resources.md#using-xliff-for-localization).
## XLIFF
The XML Localization Interchange File Format (XLIFF) is an emerging industry
standard "for the interchange of localization information". Version 1.1 is
available (2003-Oct-31), and 1.2 is almost complete (2007-Jan-20).
This is the result of a quick review of XLIFF and may need to be improved.
Features:
1. Multiple resource bundles per XLIFF file are supported.
2. Multiple languages per XLIFF file are supported.
3. XLIFF provides a rich set of ways to communicate intent, types of items,
etc. all the way from content creation to all stages and phases of
translation.
4. Nesting of values appears to not be supported.
5. XLIFF is independent of actual build-time or runtime resource bundle
formats. .xlf files must be converted to native formats at build time.
Defined at: http://www.oasis-open.org/committees/xliff/
Example: (`example.xlf`)
```xml
<<?xml version="1.0" encoding="utf-8"?>
<xliff version = "1.1" xmlns='urn:oasis:names:tc:xliff:document:1.1'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xsi:schemaLocation='urn:oasis:names:tc:xliff:document:1.1
http://www.oasis-open.org/committees/xliff/documents/xliff-core-1.1.xsd'>
<file xml:space = "preserve" source-language = "en" target-language = "sh"
datatype = "x-icu-resource-bundle" original = "root.txt"
date = "2007-08-17T21:17:08Z">
<header>
<tool tool-id = "genrb-3.3-icu-3.8" tool-name = "genrb"/>
</header>
<body>
<group id = "root" restype = "x-icu-table">
<trans-unit id = "optionMessage" resname = "optionMessage">
<source>unrecognized command line option:</source>
<target>nepoznata opcija na komandnoj liniji:</target>
</trans-unit>
<trans-unit id = "usage" resname = "usage">
<source>usage: ufortune [-v] [-l locale]</source>
<target>upotreba: ufortune [-v] [-l lokal]</target>
</trans-unit>
</group>
</body>
</file>
</xliff>
```
For examples on using ICU tools to produce and read XLIFF format see the XLIFF
Usage (§) section in the [resource management chapter](resources.md#using-xliff-for-localization).
## DITA
The Darwin Information Typing Architecture (DITA) is "IBM's XML architecture for
topic-oriented information". It is a family of XML formats for several types of
publications including manuals and resource bundles. It is extensible. For
example, subformats can be defined by refining DTDs. One design feature is to
provide cross-document references for reuse of existing contents. For more
information see http://www.ibm.com/developerworks/xml/library/x-dita4/index.html
While it is certainly possible to define resource bundle formats via DTDs in the
DITA framework, there currently (2002-Nov-27) do not appear to be resource
bundle formats actually defined, or tools available specifically for them.
## Linux/gettext
The OpenI18N specification requires support for message handling functions
(mostly variants of `gettext()`) as defined in `libintl.h`. See Tables 3-5 and 3-6
and Annex C in http://www.openi18n.org/docs/html/LI18NUX-2000-amd4.htm
Resource bundles ("portable object files", extension .po) are plain text files
with key-value pairs for string values. The format and functions support a
simple selection of plural forms by associating integer values (via C language
expressions) with indexes of strings.
The `msgfmt` utility compiles .po files into "message object files" (extension
.mo). The charset is determined from the locale ID in `LC_CTYPE`. There are
additional supporting tools for .po files.
*Note: The OpenI18N specification also requires POSIX `gencat`/`catgets` support. See the [POSIX](#posixcatsgets) (§) section below.*
Defined at: Annex C of the Li18nux-2000 specification, see above.
Example: (`example.po`)
```
domain "example_domain"
msgid "key1"
msgstr "Deutsche Sprache schwere Sprache"
msgid "key2"
msgstr "Düsseldorf"
```
## POSIX/catgets
POSIX (The Open Group specification) defines message catalogs with the `catgets()`
C function and the gencat build-time tool. Message catalogs contain key-value
pairs where the keys are integers `1`..`NL_MSGMAX` (see `limits.h`), and the values
are strings. Strings can span multiple lines. The charset is determined from the
locale ID in `LC_CTYPE`.
Defined at:
https://pubs.opengroup.org/onlinepubs/009695399/utilities/gencat.html and
https://pubs.opengroup.org/onlinepubs/009695399/functions/catgets.html
Example: (`example.txt`)
```
1 Deutsche Sprache \
schwere Sprache
2 Düsseldorf
```
## Windows
Windows uses a number of file formats depending on the language environment --
MSVC 6, Visual Basic, or Visual Studio .NET. The most well-known source formats
are the [.rc Resource](https://docs.microsoft.com/windows/win32/menurc/about-resource-files)
and [.mc Message](https://docs.microsoft.com/en-us/windows/win32/eventlog/message-files)
file formats. They both get compiled into .res files that are linked into
special sections of executables. Source formats can be UTF-16, while compiled
strings are (almost) always UTF-16 from .rc files (except for predefined
ComboBox strings) and can optionally be UTF-16 from .mc files.
.rc files carry key-value pairs where the keys are usually numeric but can be
strings. Values can be strings, string tables, or one of many Windows
GUI-specific structured types that compile directly into binary formats that the
GUI system interprets at runtime. .rc files can include C #include files for
#defined numeric keys. .mc files contain string values preceded by per-message
headers similar to the Linux/gettext() format. There is a special format of
messages with positional arguments, with printf-style formatting per argument.
In both .rc and .mc formats, Windows LCID values are defined to be set on the
compiled resources.
Developers and translators usually overlook the fact that binary resources are
included, and include them into each translation. This despite Windows, like
Java and ICU, using locale ID fallback at runtime.
.rc and .mc files are tightly integrated with Microsoft C/C++, Visual Studio and
the Windows platform, but are not used on any other platforms.
A [sample Windows .rc file](#sample-windows-rc-file) (§) is at the end of this document.
## ICU tools
ICU 2.4 provides tools for conversion between resource bundle formats:
1. ICU4C .txt -> ICU4C .res: Default operation of genrb (ICU 2.0 and before).
2. ICU4C .txt -> ICU4C .xml: Option with genrb (ICU 2.4).
3. ICU4C .txt -> Java ListResourceBundle .java format: Option with genrb (ICU
2.2).
Generates subclasses of ICUListResourceBundle to support non-string types.
4. Java ListResourceBundle .java format -> ICU4C .txt: Use ICU4J 2.4's
src/com/ibm/icu/dev/tools/localeconverter
5. ICU4C .xml -> ICU4C .txt: There is a tool for this conversion, but it is not
fully tested or documented. Please see the
[XLIFF2ICUConverter](https://icu-project.org/download/xliff2icuconverter.html)
tool.
There are currently no ICU tools for XLIFF.
### Converting de.txt to a ListResourceBundle
The following genrb invocation generates a ListResourceBundle from `de.txt` (see
the example file `de.txt` above):
`genrb -j -b TestName -p com.example de.txt`
The -j option causes .java output, -b is an arbitrary bundle name prefix, and -p
is an arbitrary package name. "Arbitrary" means "depends on your product" and
may be truly arbitrary if the generated .java files are not actually used in a
Java application. genrb auto-detects .txt files encoded in Unicode charsets like
UTF-8 or UTF-16 if they have a signature byte sequence ("BOM"). The .java output
file is in native2ascii format, i.e., it is encoded in US-ASCII with \u
escapes.
The output of the above genrb invocation is `TestName_de.java`:
```java
package com.example;
import java.util.ListResourceBundle;
import com.ibm.icu.impl.ICUListResourceBundle;
public class TestName_de extends ICUListResourceBundle {
public TestName_de () {
super.contents = data;
}
static final Object[][] data = new Object[][] {
{
"key1",
"Deutsche Sprache schwere Sprache",
},
{
"key2",
"D\u00FCsseldorf",
},
};
}
```
### Converting a ListResourceBundle back to .txt
An ICUListResourceBundle .java file as generated in the previous example can be
converted to an ICU4C .txt file with the following steps:
1. Compile the .java file, e.g. with `javac -d . TestName_de.java`. ICU4J needs
to be on the classpath (or use the -classpath option). If the .java file is
not in `native2ascii` format, then use the -encoding option (e.g. -encoding
UTF-8). The -d option (specifying an output directory, in this example the
current folder) is required. Without it, the Java compiler would not
generate the com/example folder hierarchy that is required in the next step.
2. You now have a .class file `com/example/TestName_de.class`.
3. Invoke the ICU4J locale converter tool to generate ICU4C .txt format output for
this .class file:
`java -cp ;(folder to ICU4J)/icu4j.jar;(working folder for the previous steps); com.ibm.icu.dev.tool.localeconverter.ConvertICUListResourceBundle -icu -package com.example -bundle-name TestName de > de.txt`
Note that the classpath must include the working folder for the previous
steps (the folder that contains "com"). The package name (com.example),
bundle name (TestName) and locale ID (de) must match the .java/.class files.
Note also that the locale converter writes to the standard output; the
command line above includes a redirection to de.txt.
The last step generates a new de.txt in `native2ascii` format:
```
de {
key2{"D\u00FCsseldorf"}
key1{"Deutsche Sprache schwere Sprache"}
}
```
## Further information
1. TMX: "The purpose of TMX is to allow easier exchange of translation memory
data between tools and/or translation vendors with little or no loss of
critical data during the process."
http://www.lisa.org/tmx/
2. LISA: Localisation Industry Standards Association
http://www.lisa.org/
## Sample Windows .rc file
This file (`winrc.rc`) was generated with MSVC 6, using the New Project wizard to
generate a simple "Hello World!" application, changing the LCIDs to German, then
adding the two example strings as above.
```
//Microsoft Developer Studio generated resource script.
//
#include "resource.h"
#define APSTUDIO_READONLY_SYMBOLS
/////////////////////////////////////////////////////////////////////////////
//
// Generated from the TEXTINCLUDE 2 resource.
//
#define APSTUDIO_HIDDEN_SYMBOLS
#include "windows.h"
#undef APSTUDIO_HIDDEN_SYMBOLS
#include "resource.h"
/////////////////////////////////////////////////////////////////////////////
#undef APSTUDIO_READONLY_SYMBOLS
/////////////////////////////////////////////////////////////////////////////
// German (Germany) resources
#if !defined(AFX_RESOURCE_DLL) || defined(AFX_TARG_DEU)
#ifdef _WIN32
LANGUAGE LANG_GERMAN, SUBLANG_GERMAN
#pragma code_page(1252)
#endif //_WIN32
/////////////////////////////////////////////////////////////////////////////
//
// Icon
//
// Icon with lowest ID value placed first to ensure application icon
// remains consistent on all systems.
IDI_WINRC ICON DISCARDABLE "winrc.ICO"
IDI_SMALL ICON DISCARDABLE "SMALL.ICO"
/////////////////////////////////////////////////////////////////////////////
//
// Menu
//
IDC_WINRC MENU DISCARDABLE
BEGIN
POPUP "&File"
BEGIN
MENUITEM "E&xit", IDM_EXIT
END
POPUP "&Help"
BEGIN
MENUITEM "&About ...", IDM_ABOUT
END
END
/////////////////////////////////////////////////////////////////////////////
//
// Accelerator
//
IDC_WINRC ACCELERATORS MOVEABLE PURE
BEGIN
"?", IDM_ABOUT, ASCII, ALT
"/", IDM_ABOUT, ASCII, ALT
END
/////////////////////////////////////////////////////////////////////////////
//
// Dialog
//
IDD_ABOUTBOX DIALOG DISCARDABLE 22, 17, 230, 75
STYLE DS_MODALFRAME | WS_CAPTION | WS_SYSMENU
CAPTION "About"
FONT 8, "System"
BEGIN
ICON IDI_WINRC,IDC_MYICON,14,9,16,16
LTEXT "winrc Version 1.0",IDC_STATIC,49,10,119,8,SS_NOPREFIX
LTEXT "Copyright (C) 2002",IDC_STATIC,49,20,119,8
DEFPUSHBUTTON "OK",IDOK,195,6,30,11,WS_GROUP
END
/////////////////////////////////////////////////////////////////////////////
//
// String Table
//
STRINGTABLE DISCARDABLE
BEGIN
IDS_APP_TITLE "winrc"
IDS_HELLO "Hello World!"
IDC_WINRC "WINRC"
IDS_SENTENCE "Deutsche Sprache schwere Sprache"
IDS_CITY "Düsseldorf"
END
#endif // German (Germany) resources
/////////////////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////////////////////////////
// English (U.S.) resources
#if !defined(AFX_RESOURCE_DLL) || defined(AFX_TARG_ENU)
#ifdef _WIN32
LANGUAGE LANG_ENGLISH, SUBLANG_ENGLISH_US
#pragma code_page(1252)
#endif //_WIN32
#ifdef APSTUDIO_INVOKED
/////////////////////////////////////////////////////////////////////////////
//
// TEXTINCLUDE
//
2 TEXTINCLUDE DISCARDABLE
BEGIN
"#define APSTUDIO_HIDDEN_SYMBOLS\r\n"
"#include ""windows.h""\r\n"
"#undef APSTUDIO_HIDDEN_SYMBOLS\r\n"
"#include ""resource.h""\r\n"
"\0"
END
3 TEXTINCLUDE DISCARDABLE
BEGIN
"\r\n"
"\0"
END
1 TEXTINCLUDE DISCARDABLE
BEGIN
"resource.h\0"
END
#endif // APSTUDIO_INVOKED
#endif // English (U.S.) resources
/////////////////////////////////////////////////////////////////////////////
#ifndef APSTUDIO_INVOKED
/////////////////////////////////////////////////////////////////////////////
//
// Generated from the TEXTINCLUDE 3 resource.
//
/////////////////////////////////////////////////////////////////////////////
#endif // not APSTUDIO_INVOKED
```

View file

@ -0,0 +1,929 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Resource Management
> :point_right: **Note**: This page describes the use of ICU4C Resource
> Management techniques and APIs. For an overview of the message localization
> process using ICU, see the related page [Localizing with ICU](localizing.md).
## Overview
A software product that needs to be localized wins or loses depending on how
easy is to change the data or "resources" which affect users. From the simplest
point of view, that data is the information presented to the user (such as a
translated message) as well as the region-specific ways of doing things (such as
sorting). The process of localization will eventually involve translators and it
would be very convenient if the process of localizing could be done only by
translators and experts in the target culture. There are several points to keep
in mind when designing such a localizable software product.
### Keeping Data Separate
Obviously, one does not want to make translators wade through the source code
and make changes there. That would be a recipe for a disaster. Instead, the
translatable data should be kept separately, in a format that allows translators
easy access. A separate resource managing mechanism is hence required.
Application access data through API calls, which pick the appropriate entries
from the resources. Resources are kept in human readable/editable format with
optional tools for content editing.
The data should contain all the elements to be localized, including, but no
limited to, GUI messages, icons, formatting patterns, and collation rules. A
convenient way for keeping binary data should also be provided - often icons for
different cultures should be different.
### Keeping Data Small
It is not unlikely that the data will be same for several regions - take for
example Spanish speaking countries - names of the days and month will be the
same in both Mexico and Spain. It would be very beneficial if we can prevent the
duplication of data. This can be achieved by structuring resources in such a way
so that an unsuccessful query into a more specific resource triggers the same
query in a more general resource. A convenient way to do this is to use a tree
like structure.
Another way to reduce the data size is to allow linking of the resources that
are same for the regions that are not in general-specific relation.
### Find the Best Available Data
Sometimes, the exact data for a region is still not available. However, if the
data is structured correctly, the user can be presented with similar data. For
example, a Spanish speaking user in Mexico would probably be happier with
Spanish than with English captions, even if some of the details for Mexico are
not there.
If the data is grouped correctly, the program can automatically find the most
suitable data for the situation.
The previous points all lead to a separate mechanism that stores data separately
from the code. Software is able to access the data through the API calls. Data
is structured in a tree like structure, with the most general region in the root
(most commonly, the root region is the native language of the development team).
Branches lead to more specialized regions, usually through languages, countries
and country regions. Data that is already the same on the more general level is
not repeated.
> :point_right: **Note**: The path through languages, countries and country
> region could be different. One may decide to go through countries and then
> through languages spoken in the particular country. In either case, some data
> must be duplicated - if you go through languages, the currency data for
> different speaking parts of the same country will be duplicated (consider
> French and English languages in Canada) - on the other side, when you go
> through countries, you will need to duplicate day names and similar
> information.
Here is an example of a such a resource tree structure:
```
root Root
|
+-------+---+---+----+----+
| | | | |
en de ja ru zh Language
| | | | |
+---+ +---+ | | +------+
| | | | | | | |
| | | | | | Hans Hant Script
| | | | | | | |
| | | | | | | +----+
| | | | | | | | |
US IE DE AT JP RU CN HK TW Country or Region
|
POSIX Variant
```
Let us assume that the root resource contains data written by the original
implementors and that this data is in English and conforms to the conventions
used in the United States. Therefore, resources for English and English in
United States would be empty and would take its data from the root resource. If
a version for Ireland is required, appropriate overriding changes can be made to
the data for English in Ireland. Special variant information could be put into
en_US_POSIX if specific legacy formatting were required, or specific sub-region
information were required. When making the version for the German speaking
region, all the German data would be in that resource, with the differences in
the Germany and Austria resources.
It is important to note that some locales have the optional script tag. This is
important for multi-script locales, like Uzbek, Azerbaijani, Serbian or Chinese.
Even though Chinese uses Han characters, the characters are usually identified
as either traditional Chinese (Hant) characters, or simplified Chinese (Hans).
Even if all the data that would go to a certain resource comes from the more
general resources, it should be made clear that the particular region is
supported by application. This can be done by having completely empty resources.
## The ICU Model
ICU bases its resource management model on the ideas presented above. All the
resource APIs are concentrated in the resource bundle framework. This framework
is closely tied in its functioning to the ICU [Locale](index.md) naming scheme.
ICU provides and relies on a set of locale specific data in the resource bundle
format. If we think that we have correct data for a requested locale, even if
all its data comes from a more general locales, we will provide an empty
resource bundle. This is reflected in our return informational codes (see the
section on APIs). A lot of ICU frameworks (collation, formatting etc.) relies on
the data stored in resource bundles.
Resource bundles rely on the ICU data framework. For more information on the
functioning of ICU data, see the appropriate [section](../icudata.md).
Users of the ICU library can also use the resource bundle framework to store and
retrieve localizable data in their projects.
Resource bundles are collections of resources. Individual resources can contain
data or other resources.
> :point_right: **Note**: ICU4J relies on the resource bundle mechanism already
> provided by JDK for its functioning. Therefore, most of the discussion here
> pertains only to ICU4C.
### Fallback Mechanism
Essential part ICU's resource management framework is the fallback mechanism. It
ensures that if the data for the requested locale is missing, an effort will be
made to obtain the most usable data. Fallback can happen in two situations:
1. When a resource bundle for a locale is requested. If it doesn't exist, a
more general resource bundle will be used. If there are no such resource
bundles, a resource bundle for default locale will be used. If this fails,
the root resource bundle will be used. When using ICU locale data, not
finding the requested resource bundle means that we don't know what the data
should be for that particular locale, so you might want to consider this
situation an error. Custom packages of resource bundles may or may not
adhere to this contract. A special care should be taken in remote server
situations, when the data from the default locale might not mean anything to
the remote user (imagine a situation where a server in Japan responds to a
Spanish speaking client by using default Japanese data.
2. When a resource inside a resource bundle is requested. If the resource is
not present, it will be sought after in more general resources. If at
initial opening of a resource bundle we went through the default locale, the
search for a resource will also go through it. For example, if a resource
bundle for zh_Hans_CN is opened, a missing resource will be looked for in
zh_Hans, zh and finally root. This is usually harmless, except when a
resource is only located in the default locale or in the root resource
bundle.
### Data Packaging
ICU allows and requires that the application specific data be stored apart from
the ICU internal data (locale, converter, transformation data etc.). Application
data should be stored in packages. ICU uses the default package (NULL) for its
data. All the ICU's build tools provide means to specify the package for your
data. More about how to package application data can be found below.
## Resource Bundle APIs
ICU4C provides both C and C++ APIs for using resource bundles. The core
implementation is in C, while the C++ APIs are only a thin wrapper around it.
Therefore, the code using C APIs will generally be faster.
Resource bundles use ICU's "open use close" paradigm. In C all the resource
bundle operations are done using the `UResourceBundle*` handle. `UResourceBundle*`
allows access to both resource bundles and individual resources. In C++, class
`ResourceBundle` should be used for both resource bundles and individual
resources.
To use the resource bundle framework, you need to include the appropriate header
file, `unicode/ures.h` for C and `unicode/resbund.h` for C++.
### Error Checking
If an operation with resource bundle fails, an error code will be set. It is
important to check for the value of the error code. In C you should frequently
use the following construct:
```c
if (U_SUCCESS(status)) {
/* everything is fine */
} else {
/* there was an error */
}
```
### Opening of Resource Bundles
The most common C resource bundle opening API is:
```c
UResourceBundle* ures_open(const char* package, const char* locale, UErrorCode* status)
```
The first argument specifies the package name or `NULL` for the default ICU package.
The second argument is the locale for which you want the resource bundle.
Special values for the locale are `NULL` for the default locale and `""` (empty
string) for the root locale. The third argument should be set to `U_ZERO_ERROR`
before calling the function. It will return the status of operation. Apart from
returning regular errors, it can return two informational/warning codes:
`U_USING_FALLBACK_WARNING` and `U_USING_DEFAULT_WARNING`. The first informational
code means that the requested resource bundle was not found and that a more
general bundle was returned. If you are opening ICU resource bundles, do note
that this means that we do not guarantee that the contents of opened resource
bundle will be correct for the requested locale. The situation might be
different for application packages. However, the warning `U_USING_DEFAULT_WARNING`
means that there were no more general resource bundles found and that you were
returned either a resource bundle that is the default for the system, or the root
resource bundle. This will almost certainly contain wrong data.
There are a couple of other opening APIs: `ures_openDirect` takes the same
arguments as the `ures_open` but will fail if the requested locale is not found.
Also, if opening is successful, no fallback will be performed if an individual
resource is not found. The second one, `ures_openU` takes a `UChar*` for package
name instead of `char*`.
In C++, opening is done through a constructor. There are several constructors.
Most notable difference from C APIs is that the package should be given as a
`UnicodeString` and the locale is passed as a `Locale` object. There is also a copy
constructor and a constructor that takes a C `UResourceBundle*` handle. The
result is a `ResourceBundle` object. Remarks about informational codes are also
valid for the C++ APIs.
> :point_right: **Note**: All the data accessing examples in the following
> sections use ICU's
> [root](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/locales/root.txt)
> resource bundle.
```c
UErrorCode status = U_ZERO_ERROR;
UResourceBundle* icuRoot = ures_open(NULL, "root", &status);
if (U_SUCCESS(status)) {
/* everything is fine */
...
/* do some interesting stuff here - see below */
...
/* and close the bundle afterwards */
ures_close(icuRoot); /* discussed later */
} else {
/* there was an error */
/* report and exit */
}
```
In C++, opening would look like this:
```c++
UErrorCode status = U_ZERO_ERROR;
// we rely on automatic construction of Locale object from a char*
ResourceBundle myResource("myPackage", "de_AT", status);
if (U_SUCCESS(status)) {
/* everything is fine */
...
/* do some interesting stuff here */
...
/* the bundle will be closed when going out of scope */
} else {
/* there was an error */
/* report and exit */
}
```
### Closing of Resource Bundles
After using, resource bundles need to be closed to prevent memory leaks. In C,
you should call the `void ures_close(UResourceBundle* resB)` API. In C++, if you
have just used the `ResourceBundle` objects, going out of scope will close the
bundles. When using allocated objects, make sure that you call the appropriate
delete function.
As already mentioned, resource bundles and resources share the same type. You
can close bundles and resources in any order you like. You can invoke `ures_close`
on `NULL` resource bundles. Therefore, you can always this API regardless of the
success of previous operations.
### Accessing Resources
Once you are in the possession of a valid resource bundle, you can access the
resources and data that it holds. The result of accessing operations will be a
new resource bundle object. In C, `UResourceBundle*` handles can be reused by
using the fill-in parameter. That saves you from frequent closing and
reallocating of resource bundle structures, which can dramatically improve the
performance. C++ APIs do not provide means for object reuse. All the C examples
in the following sections will use a fill-in parameter.
#### Types of Resources
Resource bundles can contain two main types of resources: complex and simple
resources. Complex resources store other resources and can have named or unnamed
elements. **Tables** store named elements, while **arrays** store unnamed ones.
Simple resources contain data which can be **string**, **binary**, **integer
array** or a single **integer**.
There are several ways for accessing data stored in the complex resources.
Tables can be accessed using keys, indexes and by iteration. Arrays can be
accessed using indexes and by iteration.
In order to be able to distinguish between resources, one needs to know the type
of the resource at hand. To find this out, use the
`UResType ures_getType(UResourceBundle* resourceBundle)` API, or the C++ analog
`UResType getType(void)`. The `UResType` is an enumeration defined in the
[unicode/ures.h](../../../icu4c/source/common/unicode/ures.h) header file.
> :point_right: **Note**: Indexes of resources in tables do not necessarily
> correspond to the order of items in a table. Due to the way binary structure is
> organized, items in a table are sorted according to the binary ordering of the
> keys, therefore, the index of an item in a table will be the index of its key
> string in the binary order. Furthermore, the ordering of the keys are different
> on ASCII and EBCDIC platforms.
> <br>
> Starting with ICU 4.4, the order of table items is the ASCII string order on
> all platforms.
> <br>
> The iteration order of table items might change from release to release.
#### Accessing by Key
To access resources using a key, you can use the `UResourceBundle*
ures_getByKey(const UResourceBundle* resourceBundle, const char* key,
UResourceBundle* fillIn, UErrorCode* status)` API. First argument is the parent
resource bundle, which can be either a resource bundle opened using `ures_open` or
similar APIs or a table resource. The key is always specified using invariant
characters. The fill-in parameter can be either `NULL` or a valid resource bundle
handle. If it is `NULL`, a new resource bundle will be constructed. If you pass an
already existing resource bundle, it will be closed and the memory will be
reused for the new resource bundle. Status indicator can return
`U_MISSING_RESOURCE_ERROR` which indicates that no resources with that key exist,
or one of the above mentioned informational codes (`U_USING_FALLBACK_WARNING` and
`U_USING_DEFAULT_WARNING`) which do not affect the validity of data in the case of
resource retrieval.
```c
...
/* we already got zones resource from the opening example */
UResourceBundle *zones = ures_getByKey(icuRoot, "zoneStrings", NULL, &status);
if (U_SUCCESS(status)) {
/* ... do interesting stuff - see below ... */
}
ures_close(zones);
/* clean up the rest */
...
```
In C++, the analogous API is `ResourceBundle get(const char* key, UErrorCode& status) const`.
Trying to retrieve resources by key on any other type of resource than tables
will produce a `U_RESOURCE_TYPE_MISMATCH` error.
#### Accessing by Index
Accessing by index requires you to supply an index of the resource that you want
to retrieve. Appropriate API is `UResourceBundle* ures_getByIndex(const
UResourceBundle* resourceBundle, int32_t indexR, UResourceBundle* fillIn,
UErrorCode* status)`. The arguments have the same semantics as for the
`ures_getByKey` API. The only difference is the second argument, which is the
index of the resource that you want to retrieve. Indexes start at zero. If an
index out of range is specified, `U_MISSING_RESOURCE_ERROR` is returned. To find
the size of a resource, you can use `int32_t ures_getSize(UResourceBundle*
resourceBundle)`. The maximum index is the result of this API minus 1.
```c
...
/* we already got zones resource from the accessing by key example */
UResourceBundle *currentZone = NULL;
int32_t index = 0;
for (index = 0; index < ures_getSize(zones); index++) {
currentZone = ures_getByIndex(zones, index, currentZone, &status);
/* ... do interesting stuff here ... */
}
ures_close(currentZone);
/* cleanup the rest */
...
```
Accessing simple resource with an index 0 will return themselves. This is useful
for iterating over all the resources regardless of type.
C++ overloads the get API with `ResourceBundle get(int32_t index, UErrorCode& status) const`.
#### Iterating Over Resources
If you don't care about the order of the resources and want simple code, you can
use the iteration mechanism. To set up iteration over a complex resource, you
can simply start iterating using the `UResourceBundle*
ures_getNextResource(UResourceBundle* resourceBundle, UResourceBundle* fillIn,
UErrorCode* status)`. It is advisable though to reset the iterator for a
resource before starting, in order to ensure that the iteration will indeed
start from the beginning - just in case somebody else has already been playing
with this resource. To reset the iterator use `void
ures_resetIterator(UResourceBundle* resourceBundle)` API. To check whether there
are more resources, call `UBool ures_hasNext(UResourceBundle* resourceBundle)`.
If you have iterated through the whole resource, `NULL` will be returned.
```c
...
/* we already got zones resource from the accessing by key example */
UResourceBundle *currentZone = NULL;
ures_resetIterator(zones);
while (ures_hasNext(zones)) {
currentZone = ures_getNextResource(zones, currentZone, &status);
/* ... do interesting stuff here ... */
}
ures_close(currentZone);
/* cleanup the rest */
...
```
C++ provides analogous APIs: `ResourceBundle getNext(UErrorCode& status)`, `void resetIterator(void)`
and `UBool hasNext(void)`.
#### Accessing Data in the Simple Resources
In order to get to the data in the simple resources, you need to use appropriate
APIs according to the type of a simple resource. They are summarized in the
tables below. All the pointers returned should be considered pointers to read
only data. Using an API on a resource of a wrong type will result in an error.
Strings:
| Language | API |
| -------- | ------------------------------------------------------------------------------------------------------ |
| C | `const UChar* ures_getString(const UResourceBundle* resourceBundle, int32_t* len, UErrorCode* status)` |
| C++ | `UnicodeString getString(UErrorCode& status) const` |
Example:
```c
...
UResourceBundle* version = ures_getByKey(icuRoot, "Version", NULL, &status);
if (U_SUCCESS(status)) {
int32_t versionStringLen = 0;
const UChar* versionString = ures_getString(version, &versionStringLen, &status);
}
ures_close(version);
...
```
Binaries:
| Language | API |
| -------- | -------------------------------------------------------------------------------------------------------- |
| C | `const uint8_t* ures_getBinary(const UResourceBundle* resourceBundle, int32_t* len, UErrorCode* status)` |
| C++ | `const uint8_t* getBinary(int32_t& len, UErrorCode& status) const` |
Integers, signed and unsigned:
| Language | API |
| -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| C | `int32_t ures_getInt(const UResourceBundle* resourceBundle, UErrorCode* status)` `uint32_t ures_getUInt(const UResourceBundle* resourceBundle, UErrorCode* status)` |
| C++ | `int32_t getInt(UErrorCode& status) const` <br> `uint32_t getUInt(UErrorCode& status) const` |
Integer Arrays:
| Language | API |
| -------- | ----------------------------------------------------------------------------------------------------------- |
| C | `const int32_t* ures_getIntVector(const UResourceBundle* resourceBundle, int32_t* len, UErrorCode* status)` |
| C++ | `const int32_t* getIntVector(int32_t& len, UErrorCode& status) const` |
#### Convenience APIs
Since the vast majority of data stored in resource bundles are strings, ICU's
resource bundle framework provides a number of different convenience APIs that
directly access strings stored in resources. They are analogous to APIs already
discussed, with the difference that they return const `UChar*` or `UnicodeString`
objects.
> :point_right: **Note**: The C APIs that allow returning of `UnicodeStrings` only
> work if used in a C++ file. Trying to use them in a C file will produce a
> compiler error.
APIs that allow retrieving strings by specifying a key:
| Language (Return Type) | API |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------ |
| C (UChar*) | `const UChar* ures_getStringByKey(const UResourceBundle* resB, const char* key, int32_t* len, UErrorCode* status)` |
| C (UnicodeString) | `UnicodeString ures_getUnicodeStringByKey(const UResourceBundle* resB, const char* key, UErrorCode* status)` |
| C++ | `UnicodeString getStringEx(const char* key, UErrorCode& status) const` |
APIs that allow retrieving strings by specifying an index:
| Language (Return Type) | API |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------- |
| C (UChar*) | `const UChar* ures_getStringByIndex(const UResourceBundle* resB, int32_t indexS, int32_t* len, UErrorCode* status)` |
| C (UnicodeString) | `UnicodeString ures_getUnicodeStringByIndex(const UResourceBundle* resB, int32_t indexS, UErrorCode* status)` |
| C++ | `UnicodeString getStringEx(int32_t index, UErrorCode& status) const` |
APIs for retrieving strings through iteration:
| Language (Return Type) | API |
| ---------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| C (UChar*) | `const UChar* ures_getNextString(UResourceBundle* resourceBundle, int32_t* len, const char** key, UErrorCode* status)` |
| C (UnicodeString) | `UnicodeString ures_getNextUnicodeString(UResourceBundle* resB, const char** key, UErrorCode* status)` |
| C++ | `UnicodeString getNextString(UErrorCode& status)` |
#### Other APIs
Resource bundle framework provides a number of additional APIs that allow you to
get more information on the resources you are using. They are summarized in the
following tables.
| Language | API |
| -------- | ------------------------------------------------------- |
| C | `int32_t ures_getSize(UResourceBundle* resourceBundle)` |
| C++ | `int32_t getSize(void) const` |
Gets the number of items in a resource. Simple resources always return size 1.
| Language | API |
| -------- | -------------------------------------------------------- |
| C | `UResType ures_getType(UResourceBundle* resourceBundle)` |
| C++ | `UResType getType(void)` |
Gets the type of the resource. For a list of resource types, see:
[unicode/ures.h](../../../icu4c/source/common/unicode/ures.h)
| Language | API |
| -------- | ------------------------------------------------ |
| C | `const char* ures_getKey(UResourceBundle* resB)` |
| C++ | `const char* getKey(void)` |
Gets the key of a named resource or `NULL` if this resource is a member of an
array.
| Language | API |
| -------- | ----------------------------------------------------------------------------- |
| C | `void ures_getVersion(const UResourceBundle* resB, UVersionInfo versionInfo)` |
| C++ | `void getVersion(UVersionInfo versionInfo) const` |
Fills out the version structure for this resource.
| Language | API |
| -------- | --------------------------------------------------------------------------------------- |
| C | `const char* ures_getLocale(const UResourceBundle* resourceBundle, UErrorCode* status)` |
| C++ | `const Locale& getLocale(void) const` |
Returns the locale this resource is from. This API is going to change, so stay
tuned.
### Format of Resource Bundles
Resource bundles are written in its source format. Before using them, they must
be compiled to the binary format using the `genrb` utility. Currently supported
source format is a text file. The format is defined in a [formal definition
file](https://github.com/unicode-org/icu-docs/blob/master/design/bnf_rb.txt).
This is an example of a resource bundle source file:
```
// Comments start with a '//' and extend to the end of the line
// first, a locale name for the bundle is defined. The whole bundle is a table
// every resource, including the whole bundle has its name.
// The name consists of invariant characters, digits and following symbols: -, _.
root {
menu {
id { "mainmenu" }
items {
{
id { "file" }
name { "&File" }
items {
{
id { "open" }
name { "&Open" }
}
{
id { "save" }
name { "&Save" }
}
{
id { "exit" }
name { "&Exit" }
}
}
}
{
id { "edit" }
name { "&Edit" }
items {
{
id { "copy" }
name { "&Copy" }
}
{
id { "cut" }
name { "&Cut" }
}
{
id { "paste" }
name { "&Paste" }
}
}
}
...
}
}
// This resource is a table, thus accessible only through iteration and indexes...
errors {
"Invalid Command",
"Bad Value",
// Add more strings here...
"Read the Manual"
}
splash:import { "splash_root.gif" } // This is a binary imported file
pgpkey:bin { a1b2c3d4e5f67890 } // a binary value
versionInfo { // a table
major:int { 1 } // of integers
minor:int { 4 }
patch:int { 7 }
}
buttonSize:intvector { 10, 20, 10, 20 } // an array of 32-bit integers
// will pick up data from zoneStrings resource in en bundle in the ICU package
simpleAlias:alias { "/ICUDATA/en/zoneStrings" }
// will pick up data from CollationElements resource in en bundle
// in the ICU package
CollationElements:alias { "/ICUDATA/en" }
}
```
Binary format is described in the [uresdata.h](../../../icu4c/source/common/uresdata.h)
header file.
### Resources Syntax
Syntax of the resources that can be stored in resource bundles is specified in
the following table:
| Data Type | Format | Description |
| --------------- | ---------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Tables | `[name][:table] { subname1 { subresource1 } ... subnameN { subresourceN } }` | Tables are a complex resource that holds named resources. If it is a part of an array, it does not have a name. At this point, a resource bundle is a table. Access is allowed by key, index, and iteration. |
| Arrays | `[name][:array] {subresource1, ... subresourceN }` | Arrays are a complex resource that holds unnamed resources. If it is a part of an array, it does not have a name. Arrays require less memory than tables (since they don't store the name of sub-resources) but the index and iteration access are as fast as with tables. |
| Strings | `[name][:string] { ["]UnicodeText["] }` | Strings are simple resources that hold a chunk of Unicode encoded data. If it is a part of an array, it does not have a name. |
| Binaries | `name:bin { binarydata } name:import{ "fileNameToImport" }` | Binaries are used for storing binary information (processed data, images etc). Information is stored on a byte level. |
| Integers | `name:int { integervalue }` | Integers are used for storing a 32 bit integer value. |
| Integer Vectors | `name:intvector { integervalue, ... integervalueN }` | Integer vectors are used for storing 32 bit integer values. |
| Aliases | `name:alias { locale and path to aliased resource }` | Aliases point to other resources. They are useful for preventing duplication of data in resources that are not on the same branch of the fallback chain. Alias can also have an empty path. In that case the position of the alias resource is used to find the aliased resource. |
Although specifying type for some resources can be omitted for backward
compatibility reasons, you are strongly encouraged to always specify the type of
the resources. As structure gets more complicated, some combinations of
resources that are not typed might produce unexpected results.
### Escape Sequences
String values can contain C/Java-style escape sequences like `\t`, `\r`, `\n`,
`\xhh`, `\uhhhh` and `\U00hhhhhh`, consistent with the `u_unescape()` C API, see the
[ustring.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ustring_8h.html)
API documentation.
A literal backslash (\\) in a string value must be doubled (\\\\) or escaped
with `\x5C` or `\u005C`.
A literal ASCII double quote (") in a double-quoted string must be escaped with
\\" or `\x22` or `\u0022`.
You should also escape carriage return (`\r`) and line feed (`\n`) as well as
control codes, non-characters, unassigned code points and other default-invisible
characters (see the Unicode [UAX #44](https://www.unicode.org/reports/tr44/)
`Default_Ignorable_Code_Point` property).
### Examples
The way to write your resource is to start with a table that has your locale
name. The contents of a table are between the curly brackets:
```
root:table {
}
```
Then you can start adding resources to your bundle. Resources on the first level
must be named and we suggest that you specify the type:
```
root:table {
usage:string { "Usage: genrb [Options] files" }
version:int { 122 }
errorcodes:array {
:string { "Invalid argument" }
:string { "File not found" }
}
}
```
The resource bundle format doesn't care about indentation and line breaks. You
can continue one string over many lines - you need to have the line break
outside of the string:
```
aVeryLongString:string {
"This string is quite long "
"and therefore should be "
"broken into several lines."
}
```
For more examples on syntax, take a look at our resource files for
[locales](../../../icu4c/source/data/locales) and [test data](../../../icu4c/source/test/testdata),
especially at the [testtypes resource bundle](../../../icu4c/source/test/testdata/testtypes.txt).
### Making Your Own Resource Bundles
In order to make your own resource bundle package, you need to perform several
steps:
1. Create your root resource bundle. This bundle should contain all the data
for your program. You are probably best off if you fill it with data in your
native language.
2. Create a chain of empty resource bundles for your native language and
region. For example, if your region is sr_CS, create all the entries in root
in Serbian and leave bundles for sr and sr_CS locales empty. This way, users
of your package will know whether you support a certain locale or not.
3. If you already have some data to localize, create more bundles with
localized data.
4. Decide on the name of your package. You will use the package name to access
your resources.
5. Compile the resource bundles using the `genrb` tool. The command line format
is `genrb [options] list-of-input-files`. Genrb expects that source files
are in invariant encoding and `\uXXXX` characters or UTF-8/UTF-16 with BOM.
If you need to use a different encoding, specify it using the `--encoding`
option. You also need to specify the destination directory name for your
resources using the `--destdir` option. This destination name needs to be the
same as the package name. Full list of options can be retrieved by invoking
`genrb --help`.
You can also output Java class files. You will need to specify the
`--write-java` option, followed by an optional encoding for the resulting
`.java` file. Default encoding is ASCII + `\uXXXX`. You will also have to
specify the resource bundle name using the `--bundle-name argument`.
After using `genrb`, you will end up with files of name
`packagename_localename.res`. For example, if you had `root.txt`, `en.txt`,
`en_US.txt`, `es.txt` and you invoked `genrb` using the following command line:
`genrb -d myapplication root.txt en.txt en_US.txt es.txt`, you will end up
with `myapplication/root.res`, `myapplication/en.res`, etc. The forward slash can
be a back slash on some platforms, like Windows. These files are now ready
to use and you can open them using `ures_open("myapplication", "en_US", err);`.
6. However, you might want to have only one file containing all the data. In
that case you need to use the package data tool. It can produce either a
memory mapped file or a dynamically linked library. For more information on
how to use package data tool, see the appropriate [section](../icudata.md).
Rolling out your own data takes some practice, especially if you want to package
it all together. You might want to take a look at how we package data. Good
places to start (except of course ICU's own [data](../../../icu4c/source/data/))
are [source/test/testdata/](../../../icu4c/source/test/testdata/) and
[source/samples/ufortune/resources/](../../../icu4c/source/samples/ufortune/resources/)
directories.
Also, here is a sample Windows batch file that does compiling and packing of
several resources:
```bat
genrb -d myapplication root.txt en.txt en_GB.txt fr.txt es.txt es_ES.txt
echo root.res en.res en_GB.res fr.res es.res es_ES.res > packagelist.txt
mkdir tmpdir
pkgdata -p myapplication -T tmpdir -m common packagelist.txt
```
It is also possible to use the `icupkg` tool instead of `pkgdata` to generate .dat
data archives. The `icupkg` tool became available in ICU4C 3.6. If you need the
data in a shared or static library, you still need to use the `pkgdata` tool. For
easier maintenance, packaging, installation and application patching, it's
recommended that you use .dat data archives.
### Using XLIFF for Localization
ICU provides tool that allow for converting resource bundles to and from XLIFF
format. Files in XLIFF format can contain translations of resources. In that
case, more than one resulting resource bundle will be constructed.
To produce a XLIFF file from a resource bundle, use the `-x` option of `genrb` tool
from ICU4C. Assume that we want to convert a simple resource bundle to the XLIFF
format:
```
root {
usage {"usage: ufortune [-v] [-l locale]"}
optionMessage {"unrecognized command line option:"}
}
```
To get a XLIFF file, we need to call genrb like this: `genrb -x -l en root.txt`.
Option `-x` tells `genrb` to produce XLIFF file, option `-l` specifies the language of
the resource. If the language is not specified, `genrb` will try to deduce the
language from the resource name (en, zh, sh). If the resource name is not an ISO
language code (root), default language for the platform will be used. Language
will be a source attribute for all the translation units. XLIFF file produced
from the resource above will be named `root.xlf` and will look like this:
```xml
<?xml version="1.0" encoding="utf-8"?>
<xliff version = "1.1 "xmlns = 'urn:oasis:names:tc:xliff:document:1.1'
xmlns:xsi = 'http://www.w3.org/2001/XMLSchema-instance'
xsi:schemaLocation='urn:oasis:names:tc:xliff:document:1.1
http://www.oasis-open.org/committees/xliff/documents/xliff-core-1.1.xsd'>
<file xml:space = "preserve" source-language = "en”
datatype = "x-icu-resource-bundle" original = "root.txt"
date = "2007-08-17T21:17:08Z">
<header>
<tool tool-id = "genrb-3.3-icu-3.8" tool-name = "genrb"/>
</header>
<body>
<group id = "root" restype = "x-icu-table">
<trans-unit id = "optionMessage" resname = "optionMessage">
<source>unrecognized command line option:</source>
</trans-unit>
<trans-unit id = "usage" resname = "usage">
<source>usage: ufortune [-v] [-l locale]</source>
</trans-unit>
</group>
</body>
</file>
</xliff>
```
This file can be sent to translators. Using translation tools that support
XLIFF, translators will produce one or more translations for this resource.
Processed file might look a bit like this:
```xml
<?xml version="1.0" encoding="utf-8"?>
<xliff version = "1.1" xmlns='urn:oasis:names:tc:xliff:document:1.1'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xsi:schemaLocation='urn:oasis:names:tc:xliff:document:1.1
http://www.oasis-open.org/committees/xliff/documents/xliff-core-1.1.xsd'>
<file xml:space = "preserve" source-language = "en" target-language = "sh"
datatype = "x-icu-resource-bundle" original = "root.txt"
date = "2007-08-17T21:17:08Z">
<header>
<tool tool-id = "genrb-3.3-icu-3.8" tool-name = "genrb"/>
</header>
<body>
<group id = "root" restype = "x-icu-table">
<trans-unit id = "optionMessage" resname = "optionMessage">
<source>unrecognized command line option:</source>
<target>nepoznata opcija na komandnoj liniji:</target>
</trans-unit>
<trans-unit id = "usage" resname = "usage">
<source>usage: ufortune [-v] [-l locale]</source>
<target>upotreba: ufortune [-v] [-l lokal]</target>
</trans-unit>
</group>
</body>
</file>
</xliff>
```
In order to convert this file to a set of resource bundle files, we need to use
ICU4J's `com.ibm.icu.dev.tool.localeconverter.XLIFF2ICUConverter` class.
> :point_right: **Note**: XLIFF2ICUConverter class relies on XML parser being
> available. JDK 1.4 and newer provide a XML parser out of box. For earlier
> versions, you will need to install xerces.
Command line for running XLIFF2ICUConverter should specify the file than needs
to be converted, sh.xlf in this case. Optionally, you can specify input and
output directories as well as the package name. After running this tool, two
files will be produced: en.txt and sh.txt. This is how they would look like:
```
// ***************************************************************************
// *
// * Tool: com.ibm.icu.dev.tool.localeconverter.XLIFF2ICUConverter.java
// * Date & Time: 08/17/2007 11:33:54 AM HST
// * Source File: C:\trunk\icuhtml\userguide\xliff\sh.xlf
// *
// ***************************************************************************
en:table{
optionMessage:string{"unrecognized command line option:"}
usage:string{"usage: ufortune [-v] [-l locale]"}
}
```
and
```
// ***************************************************************************
// *
// * Tool: com.ibm.icu.dev.tool.localeconverter.XLIFF2ICUConverter.java
// * Date & Time: 08/17/2007 11:33:54 AM HST
// * Source File: C:\trunk\icuhtml\userguide\xliff\sh.xlf
// *
// ***************************************************************************
sh:table{
optionMessage:string{"nepoznata opcija na komandnoj liniji:"}
usage:string{"upotreba: ufortune [-v] [-l lokal]"}
}
```
These files can be then used as all the other resource bundle files.

View file

@ -0,0 +1,196 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Packaging ICU4C
## Overview
This chapter describes, for the advanced user, how to package ICU4C for
distribution, whether alone, as part of an application, or as part of the
operating system.
## Making ICU Smaller
The ICU project is intended to provide everything an application might need in
order to process Unicode. However, in doing so, the results may become quite
large on disk. A default build of ICU normally results in over 16 MB of data,
and a substantial amount of object code. This section describes some techniques
to reduce the size of ICU to only the items which are required for your
application.
### Link to ICU statically
If you add the `--enable-static` option to the ICU command line build (Makefile
or cygwin), ICU will also build a static library version which you can link to
only the exact functions your application needs. Users of your ICU must compile
with -DU_STATIC_IMPLEMENTATION. Also see [How To Use ICU](../howtouseicu.md).
### Reduce the number of libraries used
ICU consists of a number of different libraries. The library dependency chart in the [Design](../design.md#Library_Dependencies_C)
chapter can be used to understand and
determine the exact set of libraries needed.
### Disable ICU features
Certain features of ICU may be turned on and off through preprocessor defines.
These switches are located in the file "uconfig.h", and disable the code for
certain features from being built.
All of these switches are defined to '0' by default, unless overridden by the
build environment, or by modifying uconfig.h itself.
| Switch Name | Library | Effect if #defined to '1' |
|--------------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| UCONFIG_ONLY_COLLATION | common & i18n | Turn off all other modules named here except collation and legacy conversion |
| UCONFIG_NO_LEGACY_CONVERSION | common | Turn off conversion apart from UTF, CESU-8, SCSU, BOCU-1, US-ASCII, and ISO-8859-1. Not possible to turn off legacy conversion on EBCDIC platforms. |
| UCONFIG_NO_BREAK_ITERATION | common | Turn off break iteration |
| UCONFIG_NO_COLLATION | i18n | Turn off collation and collation-based string search. |
| UCONFIG_NO_FORMATTING | i18n | Turn off all formatting (date, time, number, etc), and calendar/timezone services. |
| UCONFIG_NO_TRANSLITERATION | i18n | Turn off script-to-script transliteration |
| UCONFIG_NO_REGULAR_EXPRESSIONS | i18n | Turn off the regular expression functionality |
> :point_right: **NOTE**: *These switches do not necessarily disable data generation. For example, disabling formatting does not prevent formatting data from being built into the resource bundles. See the section on ICU data, for information on changing data packaging.*
*However, some ICU data builders will not function with these switches set, such
as UCONFIG_NO_FILE_IO or UCONFIG_NO_REGULAR_EXPRESSIONS. If using these
switches, it is best to use pre-built ICU data, such as is the default for ICU
source downloads, as opposed to data builds "from scratch" out of SVN.*
#### Using UCONFIG switches with Environment Variables
This method involves setting an environment variable when ICU is built. For
example, on a POSIX-like platform, settings may be chosen at the point
runConfigureICU is run:
```shell
env CPPFLAGS="-DUCONFIG_NO_COLLATION=1 -DUCONFIGU_NO_FORMATTING=1" \
runConfigureICU SOLARISCC ...
```
> :point_right: **Note**: When end-user code is compiled,
> it must also have the same CPPFLAGS
> set, or else calling some functions may result in a link failure.
#### Using UCONFIG switches by changing uconfig.h
This method involves modifying the source file
icu/source/common/unicode/uconfig.h directly, before ICU is built. It has the
advantage that the configuration change is propagated to all clients who compile
against this build of ICU, however the altered file must be tracked when the
next version of ICU is installed.
Modify 'uconfig.h' to add the following lines before the first #ifndef
UCONFIG_... section
```c
#ifndef UCONFIG_NO_COLLATION
#define UCONFIG_NO_COLLATION 1
#enddif
#ifndef UCONFIG_NO_FORMATTING
#define UCONFIG_NO_FORMATTING 1
#endif
```
### Reduce ICU Data used
There are many ways in which ICU data may be reduced. If only certain locales or
converters will be used, others may be removed. Additionally, data may be
packaged as individual files or interchangeable archives (.dat files), allowing
data to be installed and removed without rebuilding ICU. For details, see the
[ICU Data](../icudata.md) chapter.
## ICU Versions
(This section assumes the reader is familiar with ICU version numbers (§) as
covered in the [Design](../design.md) chapter, and filename conventions for
libraries in the [ReadMe](../../../icu4c/readme.html#HowToPackage)
.)
### POSIX Library Names
The following table gives an example of the dynamically linked library and
symbolic links built by ICU for the common ('uc') library, version 5.4.3, for
Linux
| File | Links to | Purpose |
|------------------|------------------|------------------------------------------------------------------------------------|
| `libicuuc.so` | `libicuuc.so.54.3` | Required for link: Applications compiled with ' -licuuc' will follow this symlink. |
| `libicuuc.so.54` | `libicuuc.so.54.3` | Required for runtime: This name is what applications actually link against. |
| `libicuuc.so.54.3` | Actual library | Required for runtime and link. Contains the name `libicuuc.so.54`. |
> :point_right: **Note**: This discussion gives
> Linux as an example, but it is typical for most platforms,
> of which AIX and 390 (zOS) are exceptions.
An application compiled with '-licuuc' will follow the symlink from `libicuuc.so`
to `libicuuc.so.54.3`, and will actually read the file `libicuuc.so.54.3`. (fully
qualified). This library file has an embedded name (SONAME) of `libicuuc.so.54`,
that is, with only the major and minor number. The linker will write **this**
name into the client application, because Binary compatibility is for versions
that share the same major+minor number.
If ICU version 5.4.**7** is subsequently installed, the following files may be
updated.
| File | Links to | Purpose |
|------------------|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| `libicuuc.so` | `libicuuc.so.54.7` | Required for link: Newly linked applications will follow this link, which should not cause any functional difference at link time. |
| libicuuc.so.54` | `libicuuc.so.54.7` | Required for runtime: Because it now links to version .7, existing applications linked to version 5.4.3 will follow this link and use the 5.4.7 code. |
| `libicuuc.so.54.7` | Actual library | Required for runtime and link. Contains the name `libicuuc.so.54`. |
If ICU version 5.6.3 or 3.2.9 were installed, they would not affect
already-linked applications, because the major+minor numbers are different - 56
and 32, respectively, as opposed to 54. They would, however, replace the link
`libicuuc.so`, which controls which version of ICU newly-linked applications
use.
In summary, what files should an application distribute in order to include a
functional runtime copy of ICU 5.4.3? The above application should distribute
`libicuuc.so.54.3` and the symbolic link `libicuuc.so.54`. (If symbolic links pose
difficulty, `libicuuc.so.54.3` may be renamed to `libicuuc.so.54`, and only
`libicuuc.so.54` distributed. This is less informative, but functional.)
### POSIX Library suffix
The --with-library-suffix option may be used with runConfigureICU or configure,
to distinguish on disk specially modified versions of ICU. For example, the
option --with-library-suffix=**myapp** will produce libraries with names such as
libicuuc**myapp**.so.54.3, thus preventing another ICU user from using myapp's
custom ICU libraries.
While two or more versions of ICU may be linked into the same application as
long as the major and minor numbers are different, changing the library suffix
is not sufficient to allow the same version of ICU to be linked. In other words,
linking ICU 5.4.3, 5.6.3, and 3.2.9 together is allowed, but 5.4.3 and 5.4.7 may
not be linked together, nor may 5.4.3 and 5.4.3-myapp be linked together.
### Windows library names
Assuming ICU version 5.4.3, Windows library names will follow this pattern:
| File | Purpose |
|---------------|--------------------------------------------------------------------------------------------|
| `icu`**uc**`.lib` | Release Link-time library. Needed for development. Contains `icuuc54.dll` name internally. |
| `icuuc54.dll` | Release runtime library. Needed for runtime. |
| `icuuc`**d**`.lib` | Debug link-time library (The `d` suffix indicates debug) |
| `icuuc54`**d**`.dll` | Debug runtime library. |
Debug applications must be linked with debug libraries, and release applications
with release libraries.
When a new version of ICU is installed, the .lib files will be replaced so as to
keep new compiles in sync with the newly installed header files, and the latest
DLL. As well, if the new ICU version has the same major+minor version (such as
5.4.7), then DLLs will be replaced, as they are binary compatible. However, if
an ICU with a different major+minor version is installed, such as 5.5, then new
DLLs will be copied with names such as 'icuuc55.dll'.
## Packaging ICU4C as Part of the Operating System
The services which are now known as ICU were written to provide operating
system-level and application environment-level services. Several operating
systems include ICU as a standard or optional package.
See [ICU Binary Compatibility](../design.md#ICU_Binary_Compatibility) for
more details.

View file

@ -0,0 +1,166 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Plug-ins
This page documents the ICU4C DLL Plug-in capability.
This feature is a Technology Preview which first appeared in ICU4C version
4.3.4. It may be altered or removed in subsequent releases, and feedback is
appreciated.
## Off by default
As per ticket [ICU-11763](https://unicode-org.atlassian.net/browse/ICU-11763), the plugin
mechanism discussed here is disabled by default as of ICU 56. Use
**--enable-plugins** and/or define **UCONFIG_ENABLE_PLUGINS=1** to enable the
mechanism.
## Background
ICU4C has functionality for registering services, setting
mutex/allocation handlers, etc. But they must be installed 'before any
ICU services are used'
The ICU plugin mechanism allows small code modules, called plugins, to be loaded
automatically when ICU starts.
## How it works
At u_init time, ICU will read from a list of DLLs and entrypoints, and
attempt to load plugins found in the list. plugins are called and can
perform any ICU related function, such as registering or unregistering
service objects. At u_cleanup time, plugins have the opportunity to
uninstall themselves before they are removed from memory and unloaded.
## Plugin API
The current plugin API is documented as
[icuplug.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/icuplug_8h.html)
Some sample plugins are available at:
[testplug.c](../../../icu4c/source/tools/icuinfo/testplug.c)
Here is a simple, trivial plugin:
```c
U_CAPI
UPlugTokenReturn U_EXPORT2
myPlugin(UPlugData *data, UPlugReason reason, UErrorCode *status) {
    if(reason==UPLUG_REASON_QUERY) {
        uplug_setPlugName(data, "Simple Plugin"); /* optional */
        uplug_setPlugLevel(data, UPLUG_LEVEL_HIGH); /* Mandatory */
    } else if(reason==UPLUG_REASON_LOAD) {
        /* ... load ... */
        /* Set up some ICU things here. */
    } else if(reason==UPLUG_REASON_UNLOAD) {
        /* ... unload ... */
    }
    return UPLUG_TOKEN; /* Mandatory. */
}
```
The `UPlugData*` is an opaque pointer to the plugin-specific data,
and is used in all other API calls.
The API contract is:
1. the plugin MUST always return UPLUG_TOKEN as a return value- to
indicate that it is a valid plugin.
2. when the 'reason' parameter is set to UPLUG_REASON_QUERY, the
plugin MUST call uplug_setPlugLevel() to indicate whether it is a high
level or low level plugin.
3. when the 'reason' parameter is UPLUG_REASON_QUERY, the plugin
SHOULD call uplug_setPlugName to indicate a human readable plugin name.
## Configuration
You can see a sample configuration file here:
[icuplugins_windows_sample.txt](../../../icu4c/source/tools/icuinfo/icuplugins_windows_sample.txt)
At ICU startup time, the environment variable "ICU_PLUGINS" will be
queried for a directory name. If it is not set, the #define
`DEFAULT_ICU_PLUGINS` will be checked for a default value.
`DEFAULT_ICU_PLUGINS` will be set, on autoconf'ed and installed ICU
versions, to "$(prefix)/lib/icu" if not set otherwise by the build
environment.
Within the above-named directory, the file "icuplugins##.txt" will be
opened, if present, where _##_ is the major+minor number of the currently
running ICU (such as, 44 for ICU 4.4).
So, for example, by default, ICU 4.4 would attempt to open
`$(prefix)/lib/icu/icuplugins44.txt`
The configuration file has the following format:
1. Hash (#) begins a comment line
2. Non-comment lines have two or three components:
> `LIBRARYNAME ENTRYPOINT [ CONFIGURATION .. ]`
3. Tabs or spaces separate the three items.
4. _LIBRARYNAME_ is the name of a shared library, either a short name if
it is on the loader path, or a full pathname.
5. _ENTRYPOINT_ is the short (undecorated) symbol name of the plugin's
entrypoint, as above.
6. _CONFIGURATION_ is the entire rest of the line. It's passed as-is to
the plugin.
An example configuration file is, in its entirety:
```
# this is icuplugins44.txt
testplug.dll myPlugin hello=world
```
The DLL testplug.dll is opened, and searched for the entrypoint
"myPlugin", which must meet the API contract above.
The string "hello=world" is passed to the plugin verbatim.
## Load Order
Plugins are categorized as "high" or "low" level. Low level are those
which must be run BEFORE high level plugins, and before any operations
which cause ICU to be 'initialized'. If a plugin is low level but
causes ICU to allocate memory or become initialized, that plugin is said
to cause a 'level change'.
At load time, ICU first queries all plugins to determine their level,
then loads all 'low' plugins first, and then loads all 'high' plugins.
Plugins are otherwise loaded in the order listed in the configuration file.
## User interface and troubleshooting
The new command line utility, `icuinfo`, will not only print out ICU
version information, but will also give information on the load status
of plugins, with the "-L" option. It will list all loaded or
possibly-loaded plugins, give their level, and list any errors
encountered which prevented them from loading. Thus, the end user can
validate their plugin configuration file to determine if plugins are
missing, unloadable, or loaded in the wrong order.
For example the following run shows that the plugin named
"myPluginFailQuery" did not call uplug_setPlugLevel() and thus failed to
load.
$ icuinfo -v -L
Compiled against ICU 4.3.4, currently running ICU 4.3.4
ICUDATA is icudt43l
plugin file is: /lib/plugins/icuplugins43.txt
Plugins:
# Level Name
Library:Symbol
config| (configuration string)
>>> Error | Explanation
-----------------------------------
#1 HIGH Just a Test High-Level Plugin
plugin| /lib/plugins/libplugin.dylib:myPlugin
config| x=4
#2 HIGH High Plugin
plugin| /lib/plugins/libplugin.dylib:myPluginHigh
config| x=4
#3 INVALID this plugin did not call uplug_setPlugName()
plugin| /lib/plugins/libplugin.dylib:myPluginFailQuery
config| uery
\\\ status| U_PLUGIN_DIDNT_SET_LEVEL
/// Error: This plugin did not call uplug_setPlugLevel during QUERY.
#4 LOW Low Plugin
plugin| /lib/plugins/libplugin.dylib:myPluginLow
config| x=4
Default locale is en_US
Default converter is UTF-8.

246
docs/userguide/posix.md Normal file
View file

@ -0,0 +1,246 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# C/POSIX Migration
## Migration from Standard C and POSIX APIs
The ISO C and POSIX standards define a number of APIs for string handling and
internationalization in C. They do not support Unicode well because they were
initially designed before Unicode/ISO 10646 were developed, and the POSIX APIs
are also problematic for other internationalization aspects.
This chapter discusses C/POSIX APIs with their problems, and shows which ICU
APIs to use instead.
> :point_right: **Note**: *We use the term "POSIX" to mean the POSIX.1 standard (IEEE Std 1003.1) which
defines system interfaces and headers with relevance for string handling and
internationalization. The XPG3, XPG4, Single Unix Specification (SUS) and other
standards include POSIX.1 as a subset, adding other specifications that are
irrelevant for this topic.*
> :construction: This chapter is not complete yet more POSIX APIs are expected to be discussed
in the future.
## Strings and Characters
### Character Sets and Encodings
#### ISO C
The ISO C standard provides two basic character types (char and wchar_t) and
defines strings as arrays of units of these types. The standard allows nearly
arbitrary character and string character sets and encodings, which was necessary
when there was no single character set that worked everywhere.
For portable C programs, characters and strings are opaque, i.e., a program
cannot assume that any particular character is represented by any particular
code or sequence of codes. Programs use standard library functions to handle
characters and strings. Only a small set of characters — usually the set of
graphic characters available in US-ASCII — can be reliably accessed via
character and string literals.
#### Problems
1. Many different encodings are used on each platform, making it difficult for
multiple programs and libraries to process the same text.
2. Programs often need to know the codes of special characters. For example,
code that parses a filename needs to know how the path and file separators
are encoded; this is commonly possible because filenames deliberately use
US-ASCII characters, but any software that uses non-ASCII characters becomes
platform-dependent. It is practically impossible to provide sophisticated
text processing without knowledge of the character set, its string encoding,
and other detailed features.
3. The C/POSIX standards only provide a very limited set of useful functions
for character and string handling; many functions that are provided do not
work for non-trivial cases.
4. While the size of the char type is in practice fixed to 8 bits in modern
compilers, and its common encodings are reasonably well documented, the size
of wchar_t varies between 8/16/32 bits depending on the compiler, and only
few of the string encodings used with it are documented.
5. See also [What size wchar_t do I need for
Unicode?](http://icu-project.org/docs/papers/unicode_wchar_t.html) .
6. A program based on this model must be recompiled for each platform. Usually,
it must be recompiled for each supported language or family of languages.
7. The ISO C standard basically requires, by how its standard functions are
defined, that the data type for a single character code in a large character
set is the same as the string base unit type (wchar_t). This has led to C
standard library implementations using Unicode encodings which are either
limited for single-character functions to only part of Unicode, or suffer
from reduced interoperability with most Unicode-aware software.
#### ICU
ICU always processes Unicode text. Unicode covers all languages and allows safe
hard coding of character codes, in addition to providing many standard or
recommended algorithms and a lot of useful character property data. See the
chapters about [Unicode Basics](unicode.md) and [Strings](strings/index.md) and
others.
ICU uses the 16-bit encoding form of Unicode (UTF-16) for processing, making it
fully interoperable with most Unicode-aware software. (See [UTF-16 for
Processing](http://www.unicode.org/notes/tn12/) .) In the case of ICU4J, this is
naturally the case because the Java language and the JDK use UTF-16.
ICU uses and/or provides direct access to all of the [Unicode
properties](strings/properties.md) which provide a much finer-grained
classification of characters than [C/POSIX character
classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html)
.
In C/C++ source code character and string literals, ICU uses only "invariant"
characters. They are the subset of graphic ASCII characters that are almost
always encoded with the same byte values on all systems. (One set of byte values
for ASCII-based systems, and another such set of byte values for EBCDIC
systems.) See
[utypes.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utypes.h)
for the set of "invariant" characters.
With the use of Unicode, the implementation of many of the Unicode standard
algorithms, and its cross-platform availability, ICU provides for consistent,
portable, and reliable text processing.
### Case Mappings
#### ISO C
The standard C functions tolower(), towupper(), etc. take and return one
character code each.
#### Problems
1. This does not work for German, where the character "ß" (sharp s) uppercases
to the two characters "SS". (It "expands".)
2. It does not work for Greek, where the character "Σ" (capital sigma)
lowercases to either "ς" (small final sigma) or "σ" (small sigma) depending
on whether the capital sigma is the last letter in a word. (It is
context-dependent.)
3. It does not work for Lithuanian and Turkic languages where a "combining dot
above" character may need to be removed in certain cases. (It "contracts"
and is language- and context-dependent.)
4. There are a number of other such cases.
5. There are no standard functions for title-casing strings.
6. There are no standard functions for case-folding strings. (Case-folding is
used for case-insensitive comparisons; there are C/POSIX functions for
direct, case-insensitive comparisons of pairs of strings. Case-folding is
useful when one string is compared to many others, or as part of a chain of
transformations of a string.)
#### ICU
Case mappings are operations taking and returning strings, to support length
changes and context dependencies. Unicode provides algorithms and data for
proper case mappings, and ICU provides APIs for them. (See the API references
for various string functions and for Transforms/Transliteration.)
### Character Classes
#### ISO C
The standard C functions isalpha(), isdigit(), etc. take a character code each
and return boolean values for whether the character belongs to the current
locale's respective character class.
#### Problems
1. Character classes are bound to locales, instead of providing consistent
classifications for characters.
2. The same character may have different classifications depending on the
locale and the platform.
3. There are only very few POSIX character classes, and they are not well
defined. For example, there is a class for punctuation characters but not
one for symbols.
4. For example, the dollar symbol (“$”) may or may not belong to the punct
class depending on the locale, even on the same system.
5. The standard allows at most two sets of decimal digits: The digits of the
“portable character set” (i.e., those in the ASCII repertoire) and one more.
Some implementations only recognize ASCII digits in the isdigit() function.
However, there are many sets of decimal digits in a multilingual character
set like Unicode.
6. The POSIX standard assumes that each locale definition file carries the
character class data for all relevant characters. With many locales using
overlapping character repertoires, this can lead to a lot of duplication.
For efficiency, many UTF-8 locales define character classes only for very
few characters instead of for all of Unicode. For example, some de_DE.utf-8
locales only define character classes for characters used in German, or for
the repertoire of ISO 8859-1 in other words, for only a tiny fraction of
the representable Unicode repertoire. Processing of text using more than
this repertoire is not possible with such an implementation.
7. For more about the problems with POSIX character classes in a Unicode
context see [Annex C: Compatibility Properties in Unicode Technical Standard
#18: Unicode Regular
Expressions](http://www.unicode.org/reports/tr18/#Compatibility_Properties)
and see the mailing list archives for the unicode list (on unicode.org). See
also the ICU design document about [C/POSIX character
classes](https://htmlpreview.github.io/?https://github.com/unicode-org/icu-docs/blob/master/design/posix_classes.html).
#### ICU
ICU provides locale-independent access to all [Unicode
properties](strings/properties.md) (except Unihan.txt properties), as well as to
the POSIX character classes, via functions defined in uchar.h and in ICU4J's
UCharacter class (see API references) as well as via UnicodeSet. The POSIX
character classes are implemented according to the recommendations in UTS #18.
The Unicode Character Database defines more than 70 character properties, their
values are designed for the large character set as well as for real text
processing, and they are updated with each version of Unicode. The UCD is
available online, facilitating industry-wide consistency in the implementation
of Unicode properties.
## Formatting and Parsing
### Currency Formatting
#### POSIX
The strfmon() function is used to format monetary values. The default format and
the currency display symbol or display name are selected by the LC_MONETARY
locale ID. The number formatting can also be controlled with a formatting string
resembling what printf() uses.
#### Problems
1. Selection of the currency via a locale ID is unreliable: Countries change
currencies over time, and the locale data for a particular country may not
be available. This results in using the wrong currency. For example, an
application may assume that a country has switched from a previous currency
to the Euro, but it may run on an OS that predates the switch.
2. Using a single locale ID for the whole format makes it very difficult to
format values for multiple currencies with the same number format (for
example, for an exchange rate list or for showing the price of an item
adjusted for several currencies). strfmon() allows to specify the number
format fully, but then the application cannot use a country's default number
format.
3. The set of formattable currencies is limited to those that are available via
locale IDs on a particular system.
4. There does not appear to be a function to parse currency values.
#### ICU
ICU number formatting APIs have separate, orthogonal settings for the number
format, which can be selected with a locale ID, and the currency, which is
specified with an ISO code. See the [Formatting
Numbers](formatparse/numbers/index.md) chapter for details.

352
docs/userguide/services.md Normal file
View file

@ -0,0 +1,352 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# ICU Services
## Overview of the ICU Services
ICU enables you to write language-independent C/C++ and Java code that is used on
separate, localized resources to get language-specific results. ICU supports
many features, including language-sensitive text, dates, time, numbers,
currency, message sorting, and searching. ICU provides language-specific results
for a broad range of languages.
### Strings, Properties and CharacterIterator
ICU provides basic Unicode support for the following:
* [Unicode strings](strings/index.md)
ICU includes type definitions for UTF-16 strings and code points. It also
contains many C u_string functions and the C++ UnicodeString class with many
additional string functions.
* [Unicode properties](strings/properties.md)
ICU includes the C definitions and functions found in uchar.h as well as
some macros found in utf.h. It also includes the C++ Unicode class.
* [Unicode string iteration](strings/characteriterator.md)
In C, ICU uses the macros in utf.h for the iteration of strings. In C++, ICU
uses the characterIterator and its subclasses.
### Conversion Basics
A converter is used to transform text from one encoding type to another. In the
case of Unicode, ICU transforms text from one encoding codepage to Unicode and
back. An encoding is a mapping from a given character set definition to the
actual bits used to represent the data.
### Locale and Resources
The ICU package contains the locale and resource bundles as well as the classes
that implement them. Also, the ICU package contains the locale data (plain text
resource bundles) and provides APIs to access and make use of that data in
various services. Users need to understand these terms and the relationship
between them.
A locale identifies a group of users who have similar cultural and linguistic
expectations for how their computers interact with them and process data. This
is an abstract concept that is typically expressed by one of the following:
A locale ID specifies a language and region enabling the software to support
culturally and linguistically appropriate information for each user. A locale
object represents a specific geographical, political, or cultural region. As a
programmatic expression of locale IDs, ICU provides the C++ locale class. In C,
Application Programming Interfaces (APIs) use simple C strings for locale IDs.
ICU stores locale-specific data in resource bundles, which provide a general
mechanism to access strings and other objects for ICU services to perform
according to locale conventions. ICU contains data for its services to support
many locales. Resource bundles contain the locale data of applications that use
ICU. In C++, the **ResourceBundle** implements the locale data. In C, this
feature is provided by the **ures_** interface.
In addition to storing system-level data in ICU's resource bundles, applications
typically also need to use resource bundles of their own to store
locale-dependent application data. ICU provides the generic resource bundle APIs
to access these bundles and also provides the tools to build them.
> :point_right: **Note**: *Display strings, which are displayed to a user of a program, are bundled in a
separate file instead of being embedded in the lines of the program.*
### Locales and Services
The interaction between locales and services is fundamental to ICU. Please refer
to [Locales and Services](./locale/index.md#Locales_and_Services).
### Transliteration
Transliteration was originally designed to convert characters from one script to
another (for example, from Greek to Latin, or Japanese Katakana to Latin). Now,
transliteration is a more flexible mechanism that has pre-built transformations
for case conversions, normalization conversions, the removal of given
characters, and also for a variety of language and script transliterations.
Transliterations can be chained together to perform a series of operations and
each step of the process can use a UnicodeSet to restrict the characters that
are affected. There are two basic types of transliterators:
Most natural language transliterators (such as Greek-Latin) are written a
rule-based transliterators. Transliterators can be written as text files using a
simple language that is similar to regular expression syntax.
### Date and Time Classes
Date and time routines manage independent date and time functions in
milliseconds since January 1, 1970 (0:00:00.000 UTC). Points in time before then
are represented as negative numbers.
ICU provides the following [classes](datetime/index.md) to support calendars and
time zones:
* [Calendar](datetime/calendar/index.md#calendar)
The abstract superclass for extracting calendar-related attributes from a
Date value.
* [Gregorian Calendar](datetime/calendar/index.md#gregorian-calendar)
A concrete class for representing a Gregorian calendar.
* [TimeZone](datetime/timezone/index.md)
An abstract superclass for representing a time zone.
* [SimpleTimeZone](datetime/timezone/index.md)
A concrete class for representing a time zone for use with a Gregorian
calendar.
> :point_right: **Note**: *C classes provide the same functionality as the C++ classes with the exception
of subclassing.*
### Format and Parse
Formatters translate between non-text data values and textual representations of
those values. The result is a string of text that represents the internal value.
A formatter can parse a string and convert a textual representation of some
value (if it finds one it understands) back into its internal representation.
For example, when the formatter reads the characters 1, 0, and 3 followed by
something other than a digit, it produces the value 103 in its internal binary
representation.
A formatter takes a value and produces a user-readable string that represents
that value or takes a string and parses it to produce a value.
ICU provides the following areas and classes for general formatting, formatting
numbers, formatting dates and times, and formatting messages:
#### General Formatting
See [Formatting and Parsing Classes](formatparse/index.md#formatting-and-parsing-classes) for an introduction to the following:
* Format
* FieldPosition
* ParsePosition
* Formattable
#### Formatting Numbers
* [NumberFormat](formatparse/numbers/index.md#numberformat)
NumberFormat provides the basic fields and methods to format number objects
and number primitives into localized strings and parse localized strings to
number objects.
* [DecimalFormat](formatparse/numbers/index.md#decimalformat)
DecimalFormat provides the methods used to format number objects and number
primitives into localized strings and parse localized strings into number
objects in base 10.
* [DecimalFormatSymbols](formatparse/numbers/index.md#decimalformatsymbols)
DecimalFormatSymbols is a concrete class used by DecimalFormat to access
localized number strings such as the grouping separators, the decimal
separator, and the percent sign.
#### Formatting Dates and Times
* [DateFormat](formatparse/datetime/index.md) (§)
DateFormat provides the basic fields and methods for formatting date objects
to localized strings and parsing date and time strings to date objects.
* [SimpleDateFormat](formatparse/datetime/index.md) (§)
SimpleDateFormat is a concrete class used to format date objects to
localized strings and to parse date and time strings to date objects using a
GregorianCalendar.
* [DateFormatSymbols](formatparse/datetime/index.md) (§)
DateFormatSymbols is a concrete class used to access localized date and time
formatting strings, such as names of the months, days of the week, and the
time zone.
#### Formatting Messages
* [MessageFormat](formatparse/messages/index.md) (§)
MessageFormat is a concrete class used to produce a language-specific user
message that contains numbers, currency, percentages, date, time, and string
variables.
* [ChoiceFormat](formatparse/messages/index.md) (§)
ChoiceFormat is a concrete class used to map strings to ranges of numbers
and to handle plural words and name series in user messages.
> :point_right: **Note**: *C classes provide the same functionality as the C++ classes with the exception
of subclassing.*
### Searching and Sorting
Sorting and searching non-English text presents a number of challenges that many
English speakers are unaware of. The primary source of difficulty is accents,
which have very different meanings in different languages, and sometimes even
within the same language:
* Many accented letters, such as the é in café, are treated as minor variants
on the letter that is accented.
* Sometimes the accented form of a letter is treated as a distinct letter for
the purposes of comparison. For example, Å in Danish is treated as a
separate letter that sorts just after Z.
* In some cases, an accented letter is treated as if it were two letters. In
traditional German, for example, ä is compared as if it were ae.
Searching and sorting is done through collation using the Collator class and its
sub-classes RuleBasedCollator and CollationElementIterator as well as the
CollationKey object. Collation determines the proper sort sequence for two or
more natural language strings. It also can determine if two strings are
equivalent for the purpose of searching.
The Collator class and its sub-class RuleBasedCollator perform locale-sensitive
string comparisons to create sorting and searching routines for natural language
text. Collator and RuleBasedCollator can distinguish between characters
associated with base characters (such as 'a' and 'b'), accent marks (such as
'ò', 'ó'), and uppercase or lowercase properties (such as 'a' and 'A').
ICU provides the following collation classes for sorting and searching natural
language text according to locale-specific rules:
* [Collator](collation/architecture.md) is the abstract base class of all classes that compare strings.
* [CollationElementIterator](collation/architecture.md) is a concrete iterator class that provides an
iterator for stepping through each character of a locale-specific string
according to the rules of a specific collator object.
* [RuleBasedCollator](collation/architecture.md) is the only built-in
implementation of the collator. It
provides a sophisticated mechanism for comparing strings in a
language-specific manner, and an interface that allows the user to
specifically customize the sorting order.
* [CollationKey](collation/architecture.md) is an object that enables the fast sorting of strings by
representing a string as a sort key under the rules of a specific collator
object.
> :point_right: **Note**: *C classes provide the same functionality as the C++ classes with the exception
of subclassing.*
### Text Analysis
The BreakIterator services can be used for formatting and handling text;
locating the beginning and ending points of a word; counting words, sentences,
and paragraphs; and listing unique words. Specifically, text operations can be
done to locate the following linguistic boundaries:
* Display text on the screen and locate places in the text where the
BreakIterator can perform word-wrapping to fit the text within the margins
* Locate the beginning and end of a word that the user has selected
* Count graphemes (or characters), words, sentences, or paragraphs
* Determine how far to move in the text store when the user hits an arrow key
to move forward or backward one grapheme
* Make a list of all the unique words in a document
* Figure out whether or not a range of text contains only whole words
* Capitalize the first letter of each word
* Extract a particular unit from the text such as "find me the third grapheme
in this document"
The BreakIterator services were designed and developed around an "iterator" or
"cursor" style of interface. The object points to a particular place in the
text. You can move the pointer forward or backward to search the text for
boundaries.
The BreakIterator class makes it possible to iterate over user characters. A
BreakIterator can find the location of a character, word, sentence or potential
line-break boundary. This makes it possible for a software program to properly
select characters for text operations such as highlighting a character, cutting
a word, moving to the next sentence, or wrapping words at a line ending.
BreakIterator performs these operations in a locale-sensitive manner, meaning
that it recognizes text boundaries according to the particular locale ID.
ICU provides the following classes for iterating over locale-specific text:
* [BreakIterator](boundaryanalysis/index.md)
The abstract base class that defines the operations for finding and getting
the positions of logical breaks in a string of text: characters, words,
sentences, and potential line breaks.
* [CharacterIterator](strings/characteriterator.md)
The abstract base class for forward and backward iteration over a string of
Unicode characters.
* [StringCharacterIterator](strings/index.md)
A concrete class for forward and backward iteration over a string of Unicode
characters. StringCharacterIterator inherits from CharacterIterator.
### Paragraph Layout
See [Paragraph Layout](./layoutengine/paragraph.md) for more details.
## Locale-Dependent Operations
Many of the ICU classes are locale-sensitive, meaning that you have to create a
different one for each locale.
| C API | C++ Class | Description |
|----------|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ubrk_ | BreakIterator | The BreakIterator class implements methods to find the location of boundaries in the text. |
| ucal_ | Calendar | The Calendar class is an abstract base class that converts between a UDate object and a set of integer fields such as YEAR, MONTH, DAY, HOUR, and so on. |
| umsg.h | ChoiceFormat | A ChoiceFormat class enables you to attach a format to a range of numbers. |
| ucol_ | CollationElementIterator | The CollationElementIterator class is used as an iterator to walk through each character of an international string. |
| ucol_ | CollationKey | The Collator class generates the Collation keys. |
| ucol_ | Collator | The Collator class performs locale-sensitive string comparison. |
| udat_ | DateFormat | DateFormat is an abstract class for a family of classes. DateFormat converts dates and times from their internal representations to a textual form that is language-independent, and then back to their internal representations. |
| udat_ | DateFormatSymbols | DateFormatSymbols is a public class that encapsulates localized date and time formatting data. This information includes time zone information. |
| unum_ | DecimalFormatSymbols | This class represents the set of symbols needed by DecimalFormat to format numbers. |
| umsg.h | Format | The Format class is the base class for all formats. |
| ucal_ | GregorianCalendar | GregorianCalendar is a concrete class that provides the standard calendar used in many locations. |
| uloc_ | Locale | A Locale object represents a specific geographical, political, or cultural region. |
| umsg.h | MessageFormat | MessageFormat provides a means to produce concatenated messages in language-neutral way. |
| unum_ | NumberFormat | NumberFormat is an abstract base class for all number formats. |
| ures_ | ResourceBundle | ResourceBundle provides a means to access a collection of locale-specific information. |
| ucol_ | RuleBasedCollator | The RuleBasedCollator provides the implementation of the Collator class using data-driven tables. |
| udat_ | SimpleDateFormat | SimpleDateFormat is a concrete class used to format and parse dates in a language-independent way. |
| ucal_ | SimpleTimeZone | SimpleTimeZone is a concrete subclass of TimeZone that represents a time zone for use with a Gregorian calendar. |
| usearch_ | StringSearch | StringSearch provides a way to search text in a locale sensitive manner. |
| ucal_ | TimeZone | TimeZone represents a time zone offset, and also determines daylight savings time settings. |
## Locale-Independent Operations
The following ICU services can be used in all locales as they provide
locale-independent services and users do not need to specify a locale ID:
| C API | C++ Class | Description |
|-----------|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ubidi_ | | UBiDi is used for implementing the Unicode BiDi algorithm. |
| utf.h | CharacterIterator | CharacterIterator is an abstract class that defines an API for iteration on text objects. It is an interface for forward and backward iteration and for the random access of a text object. Also, it provides backward compatibility to the Java and older ICU CharacterIterator classes. |
| n/a | Formattable | Formattable is a thin wrapper class that converts between the primitive numeric types (double, long, and so on) and the UDate and UnicodeString classes. Formattable objects can be passed to the Format class or its subclasses for formatting. |
| unorm_ | Normalizer | Normalizer transforms Unicode text into an equivalent composed or decomposed form to allow for easier sorting and searching of text. |
| n/a | ParsePosition | ParsePosition is a simple class used by the Format class and its subclasses to keep track of the current position during parsing. |
| uidna_ | | An implementation of the IDNA protocol as defined in RFC 3490. |
| utf.h | StringCharacterIterator | A concrete subclass of CharacterIterator that iterates over the characters (code units or code points) in a UnicodeString. |
| utf.h | UCharCharacterIterator | A concrete subclass of CharacterIterator that iterates over the characters (code units or code points) in a UChar array. |
| uchar.h | | The Unicode character properties API allows you to query the properties associated with individual Unicode character values. |
| uregex_ | RegexMatcher | RegexMatcher is a regular expressions implementation. This allows you to perform string matching based upon a pattern. |
| utrans_ | Transliterator | Transliterator is an abstract class that transliterates text from one format to another. The most common type of transliterator is a script, or an alphabet. |
| uset_ | UnicodeSet | Objects of the UnicodeSet class represent character classes used in regular expressions. These classes specify a subset of the set of all Unicode characters. This is a mutable set of Unicode characters. |
| ustring.h | UnicodeString | UnicodeString is a string class that stores Unicode characters directly. This class is a concrete implementation of the abstract class Replaceable. |
| ushape.h | | Provides operations to transform (shape) between Arabic characters and their presentation forms. |
| ucnv_ | | The Unicode conversion API allows you to convert data written in one codepage/encoding to and from UTF-16. |

83
docs/userguide/sitemap.md Normal file
View file

@ -0,0 +1,83 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
* [Boundary Analysis](boundaryanalysis/index.md)
* [Break Rules](boundaryanalysis/break-rules.md)
* [Collation](collation/index.md)
* [Collation API Details](collation/api.md)
* [ICU Collation Service Architecture](collation/architecture.md)
* [Collation Concepts](collation/concepts.md)
* [Collation Customization](collation/customization/index.md)
* [“Ignore Punctuation” Options](collation/customization/ignorepunct.md)
* [Collation Examples](collation/examples.md)
* [Collation FAQ](collation/faq.md)
* [ICU String Search Service](collation/icu-string-search-service.md)
* [Conversion](conversion/index.md)
* [Compression](conversion/compression.md)
* [Using Converters](conversion/converters.md)
* [Conversion Data](conversion/data.md)
* [Character Set Detection](conversion/detection.md)
* [Date/Time Services](datetime/index.md)
* [Calendar Classes](datetime/calendar/index.md)
* [Calendar Examples](datetime/calendar/examples.md)
* [ICU TimeZone Classes](datetime/timezone/index.md)
* [Date and Time Zone Examples](datetime/timezone/examples.md)
* [Universal Time Scale](datetime/universaltimescale.md)
* [ICU Architectural Design](design.md)
* [Development](dev/index.md)
* [Coding Guidelines](dev/codingguidelines.md)
* [Contributions to the ICU library](dev/contributions.md)
* [Synchronization Issues](dev/sync/index.md)
* [Custom ICU4C Synchronization](dev/sync/custom.md)
* [Editing the ICU User Guide](editing.md)
* [Formatting and Parsing](index.md)
* [Formatting Dates and Times](formatparse/datetime/index.md)
* [Date and Time Formatting Examples](formatparse/datetime/examples.md)
* [Formatting Messages](formatparse/messages/index.md)
* [Message Formatting Examples](formatparse/messages/examples.md)
* [Formatting Numbers](formatparse/numbers/index.md)
* [RuleBasedNumberFormat Examples](formatparse/numbers/rbnf-examples.md)
* [Rounding Modes](formatparse/numbers/rounding-modes.md)
* [Glossary](glossary.md)
* [How To Use ICU](howtouseicu.md)
* [Software Internationalization](i18n.md)
* [ICU4J Locale Service Provider](icu4j-locale-service-provider.md)
* [ICU Data](icudata.md)
* [ICU FAQs](icufaq/index.md)
* [ICU4J FAQ](icufaq/icu4j-faq.md)
* [Introduction to ICU](intro.md)
* [ICU IO](io/index.md)
* [C: ustdio](io/ustdio.md)
* [C++: ustream](io/ustream.md)
* [Layout Engine](layoutengine/index.md)
* [Paragraph Layout](layoutengine/paragraph.md)
* [Locale](locale/index.md)
* [Locale Examples](locale/examples.md)
* [Localizing with ICU](locale/localizing.md)
* [Resource Management](locale/resources.md)
* [Packaging ICU4C](packaging/index.md)
* [Packaging ICU4J](packaging-icu4j.md)
* [Plug-ins](packaging/plug-ins.md)
* [C/POSIX Migration](posix.md)
* [ICU Services](services.md)
* [Strings](strings/index.md)
* [CharacterIterator Class](strings/characteriterator.md)
* [Properties](strings/properties.md)
* [Regular Expressions](strings/regexp.md)
* [StringPrep](strings/stringprep.md)
* [UnicodeSet](strings/unicodeset.md)
* [UText](strings/utext.md)
* [UTF-8](strings/utf-8.md)
* [Transforms](transforms/index.md)
* [BiDi Algorithm](transforms/bidi.md)
* [Case Mappings](transforms/casemappings.md)
* [General Transforms](transforms/general/index.md)
* [Transform Rule Tutorial](transforms/general/rules.md)
* [Normalization](transforms/normalization/index.md)
* [Normalization Examples (Obsolete)](transforms/normalization/examples.md)
* [Unicode Basics](unicode.md)
* [Use From...](usefrom/index.md)
* [How To Use ICU4C From COBOL](usefrom/cobol.md)
* [Java Native Interface (JNI)](usefrom/jni.md)

View file

@ -0,0 +1,169 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# CharacterIterator Class
## Overview
CharacterIterator is the abstract base class that defines a protocol for
accessing characters in a text-storage object. This class has methods for
iterating forward and backward over Unicode characters to return either the
individual Unicode characters or their corresponding index values.
Using CharacterIterator ICU iterates over text that is independent of its
storage method. The text can be stored locally or remotely in a string, file,
database, or other method. The CharacterIterator methods make the text appear as
if it is local.
The CharacterIterator keeps track of its current position and index in the text
and can do the following
1. Move forward or backward one Unicode character at a time
2. Jump to a new location using absolute or relative positioning
3. Move to the beginning or end of its range
4. Return a character or the index to a character
The information can be restricted to a sub-range of characters, can contain a
large block of text that can be iterated as a whole, or can be broken into
smaller blocks for the purpose of iteration.
> :point_right: **Note**: *CharacterIterator is different from
[Normalizer](../transforms/normalization/index.md) in that CharacterIterator
walks through the Unicode characters without interpretation.*
Prior to ICU release 1.6, the CharacterIterator class allowed access to a single
UChar at a time and did not support variable-width encoding. Single UChar
support makes it difficult when supplementary support is expected in UTF16
encodings. Beginning with ICU release 1.6, the CharacterIterator class now
efficiently supports UTF-16 encodings and provides new APIs for UTF32 return
values. The API names for the UTF16 and UTF32 encodings differ because the UTF32
APIs include "32" within their naming structure. For example,
CharacterIterator::current() returns the code unit and Character::current32()
returns a code point.
## Base class inherited by CharacterIterator
The class,
[ForwardCharacterIterator,](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classForwardCharacterIterator.html)
is a superclass of the CharacterIterator class. This superclass provides methods
for forward iteration only for both UTF16 and UTF32 access, and is and based on
a efficient forward iteration mechanism. In some situations, where you need to
iterate over text that does not allow random-access, the
ForwardCharacterIterator superclass is the most efficient method. For example,
iterate a UChar string using a character converter with the [ucnv_getNextUChar()
function.](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html)
## Subclasses of CharacterIterator provided by ICU
ICU provides the following concrete subclasses of the CharacterIteratorclass:
1. [UCharCharacterIterator](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classUCharCharacterIterator.html)
subclass iterates over a `UChar[]` array.
2. [StringCharacterIterator](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classStringCharacterIterator.html)
subclass extends from `UCharCharacterIterator` and iterates over the contents
of a `UnicodeString`.
## Usage
To use the methods specified in CharacterIterator class, do one of the
following:
1. Make a subclass that inherits from the CharacterIterator class
2. Use the StringCharacterIterator subclass
3. Use the UCharCharacterIterator subclass
CharacterIterator objects keep track of its current position within the text
that is iterated over. The CharacterIterator class uses an object similar to a
cursor that gets initialized to the beginning of the text and advances according
to the operations that are used on the object. The current index can move
between two positions (a start and a limit) that are set with the text. The
limit position is one character greater than the position of the last UChar
character that is used.
### Forward iteration
For efficiency, ICU can iterate over text using post-increment semantics or
Forward Iteration. Forward Iteration is an access method that reads a character
from the current index position and moves the index forward. It leaves the index
behind the character it read and returns the character read. ICU can use
nextPostInc() or next32PostInc() calls with hasNext() to perform Forward
Iteration. These calls are the only character access methods provided by the
ForwardCharacterIterator. An iteration loop can be started with the
setToStart(), firstPostInc() or first32PostInc()calls . (The setToStart() call
is implied after instantiating the iterator or setting the text.)
The less efficient forward iteration mechanism that is available for
compatibility with Java™ provides pre-increment semantics. With these methods,
the current character is skipped, and then the following character is read and
returned. This is a less efficient method for a variable-width encoding because
the width of each character is determined twice; once to read it and once to
skip it the next time ICU calls the method. The methods used for Forward
Iteration are the next() or next32() calls. An iteration loop must start with
first() or first32() calls to get the first character.
### Backward iteration
Backward Iteration has pre-decrement semantics, which are the exact opposite of
the post-increment Forward Iteration. The current index reads the character that
precedes the index, the character is returned, and the index is left at the
beginning of this character. The methods used for Backward Iteration are the
previous() or previous32() calls with the hasPrevious() call . An iteration loop
can be started with setToEnd(), last(), or last32() calls.
### Direct index manipulation
The index can be set and moved directly without iteration to start iterating at
an arbitrary position, skip some characters, or reset the index to an earlier
position. It is possible to set the index to one after the last text code unit
for backward iteration.
The setIndex() and setIndex32() calls set the index to a new position and return
the character at that new position. The setIndex32() call ensures that the new
position is at the beginning of the character (on its first code unit). Since
the character at the new position is returned, these functions can be used for
both pre-increment and post-increment iteration semantics.
Similarly, the current() and current32() calls return the character at the
current index without modifying the index. The current32() call retrieves the
complete character whether the index is on the first code unit or not.
The index and the iteration boundaries can be retrieved using separate
functions. The following syntax is used by ICU: startIndex() <= getIndex() <=
endIndex().
Without accessing the text, the setToStart() and setToEnd() calls set the index
to the start or to the end of the text. Therefore, these calls are efficient in
starting a forward (post-increment) or backward iteration.
The most general functions for manipulating the index position are the move()
and move32() calls. These calls allow you to move the index forward or backward
relative to its current position, start the index, or move to the end of the
index. The move() and move32() calls do not access the text and are best used
for skipping part of it. The move32() call skips complete code points like
next32PostInc() call and other UChar32-access methods.
### Access to the iteration text
The CharacterIterator class provides the following access methods for the entire
text under iteration:
1. getText() sets a UnicodeString with the text
2. getLength() returns just the length of the text.
This text (and the length) may include more than the actual iteration area
because the start and end indexes may not be the start and end of the entire
text. The text and the iteration range are set in the implementing subclasses.
## Additional Sample Code
C/C++: See
[icu4c/source/samples/citer/](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/citer/)
in the ICU source distribution for code samples.

View file

@ -0,0 +1,696 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Strings
## Overview
This section explains how to handle Unicode strings with ICU in C and C++.
Sample code is available in the ICU source code library at
[icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/ustring/ustring.cpp)
.
## Text Access Overview
Strings are the most common and fundamental form of handling text in software.
Logically, and often physically, they contain contiguous arrays (vectors) of
basic units. Most of the ICU API functions work directly with simple strings,
and where possible, this is preferred.
Sometimes, text needs to be accessed via more powerful and complicated methods.
For example, text may be stored in discontiguous chunks in order to deal with
frequent modification (like typing) and large amounts, or it may not be stored
in the internal encoding, or it may have associated attributes like bold or
italic styles.
### Guidance
ICU provides multiple text access interfaces which were added over time. If
simple strings cannot be used, then consider the following:
1. [UText](utext.md): Added in ICU4C 3.4 as a technology preview. Intended to
be the strategic text access API for use with ICU. C API, high performance,
writable, supports native indexes for efficient non-UTF-16 text storage. So
far (3.4) only supported in BreakIterator. Some API changes are anticipated
for ICU 3.6.
2. Replaceable (Java & C++) and UReplaceable (C): Writable, designed for use
with Transliterator.
3. CharacterIterator (Java JDK & C++): Read-only, used in many APIs. Large
differences between the JDK and C++ versions.
4. UCharacterIterator (Java): Back-port of the C++ CharacterIterator to ICU4J
for support of supplementary code points and post-increment iteration.
5. UCharIterator (C): Read-only, C interface used mostly in incremental
normalization and collation.
The following provides some historical perspective and comparison between the
interfaces.
### CharacterIterator
ICU has long provided the CharacterIterator interface for some services. It
allows for abstract text access, but has limitations:
1. It has a per-character function call overhead.
2. Originally, it was designed for UCS-2 operation and did not support direct
handling of supplementary Unicode code points. Such support was later added.
3. Its pre-increment iteration semantics are uncommon, and are inefficient when
used with a variable-width encoding form (UTF-16). Functions for
post-increment iteration were added later.
4. The C++ version added iteration start/limit boundaries only because the C++
UnicodeString copies string contents during substringing; the Java
CharacterIterator does not have these extra boundaries substringing is
more efficient in Java.
5. CharacterIterator is not available for use in C.
6. CharacterIterator is a read-only interface.
7. It uses UTF-16 indexes into the text, which is not efficient for other
encoding forms.
8. With the additions to the API over time, the number of methods that have to
be overridden by subclasses has become rather large.
The core Java adopted an early version of CharacterIterator; later
functionality, like support for supplementary code points, was back-ported from
ICU4C to ICU4J to form the UCharacterIterator class.
The UCharIterator C interface was added to allow for incremental normalization
and collation in C. It is entirely code unit (UChar)-oriented, uses only
post-increment iteration and has a smaller number of overridable methods.
### Replaceable
The Replaceable (Java & C++) and UReplaceable (C) interfaces are designed for,
and used in, Transliterator. They are random-access interfaces, not iterators.
### UText
The [UText](utext.md) text access interface was designed as a possible
replacement for all previous interfaces listed above, with additional
functionality. It allows for high-performance operation through the use of
storage-native indexes (for efficient use of non-UTF-16 text) and through
accessing multiple characters per function call. Code point iteration is
available with functions as well as with C macros, for maximum performance.
UText is also writable, mostly patterned after Replaceable. For details see the
UText chaper.
## Strings in ICU
### Strings in Java
In Java, ICU uses the standard String and StringBuffer classes, `char[]`, etc.
See the Java documentation for details.
### Strings in C/C++
Strings in C and C++ are, at the lowest level, arrays of some particular base
type. In most cases, the base type is a char, which is an 8-bit byte in modern
compilers. Some APIs use a "wide character" type wchar_t that is typically 8,
16, or 32 bits wide and upwards compatible with char. C code passes `char *` or
wchar_t pointers to the first element of an array. C++ enables you to create a
class for encapsulating these kinds of character arrays in handy and safe
objects.
The interpretation of the byte or wchar_t values depends on the platform, the
compiler, the signed state of both char and wchar_t, and the width of wchar_t.
These characteristics are not specified in the language standards. When using
internationalized text, the encoding often uses multiple chars for most
characters and a wchar_t that is wide enough to hold exactly one character code
point value each. Some APIs, especially in the standard library (stdlib), assume
that wchar_t strings use a fixed-width encoding with exactly one character code
point per wchar_t.
### ICU: 16-bit Unicode strings
In order to take advantage of Unicode with its large character repertoire and
its well-defined properties, there must be types with consistent definitions and
semantics. The Unicode standard defines a default encoding based on 16-bit code
units. This is supported in ICU by the definition of the UChar to be an unsigned
16-bit integer type. This is the base type for character arrays for strings in
ICU.
> :point_right: **Note**: *Endianness is not an issue on this level because the interpretation of an
integer is fixed within any given platform.*
With the UTF-16 encoding form, a single Unicode code point is encoded with
either one or two 16-bit UChar code units (unambiguously). "Supplementary" code
points, which are encoded with pairs of code units, are rare in most texts. The
two code units are called "surrogates", and their unit value ranges are distinct
from each other and from single-unit value ranges. Code should be generally
optimized for the common, single-unit case.
16-bit Unicode strings in internal processing contain sequences of 16-bit code
units that may not always be well-formed UTF-16. ICU treats single, unpaired
surrogates as surrogate code points, i.e., they are returned in per-code point
iteration, they are included in the number of code points of a string, and they
are generally treated much like normal, unassigned code points in most APIs.
Surrogate code points have Unicode properties although they cannot be assigned
an actual character.
ICU string handling functions (including append, substring, etc.) do not
automatically protect against producing malformed UTF-16 strings. Most of the
time, indexes into strings are naturally at code point boundaries because they
result from other functions that always produce such indexes. If necessary, the
user can test for proper boundaries by checking the code unit values, or adjust
arbitrary indexes to code point boundaries by using the C macros
U16_SET_CP_START() and U16_SET_CP_LIMIT() (see utf.h) and the UnicodeString
functions getChar32Start() and getChar32Limit().
UTF-8 and UTF-32 are supported with converters (ucnv.h), macros (utf.h), and
convenience functions (ustring.h), but only a subset of APIs works with UTF-8
directly as string encoding form.
**See the **[**UTF-8**](utf-8.md)** subpage for details about working with
UTF-8.** Some of the following sections apply to UTF-8 APIs as well; for example
sections about handling lengths and overflows.
### Separate type for single code points
A Unicode code point is an integer with a value from 0 to 0x10FFFF. ICU 2.4 and
later defines the UChar32 type for single code point values as a 32 bits wide
signed integer (int32_t). This allows the use of easily testable negative values
as sentinels, to indicate errors, exceptions or "done" conditions. All negative
values and positive values greater than 0x10FFFF are illegal as Unicode code
points.
ICU 2.2 and earlier defined UChar32 depending on the platform: If the compiler's
wchar_t was 32 bits wide, then UChar32 was defined to be the same as wchar_t.
Otherwise, it was defined to be an unsigned 32-bit integer. This means that
UChar32 was either a signed or unsigned integer type depending on the compiler.
This was meant for better interoperability with existing libraries, but was of
little use because ICU does not process 32-bit strings — UChar32 is only used
for single code points. The platform dependence of UChar32 could cause problems
with C++ function overloading.
### Compiler-dependent definitions
The compiler's and the runtime character set's codepage encodings are not
specified by the C/C++ language standards and are usually not a Unicode encoding
form. They typically depend on the settings of the individual system, process,
or thread. Therefore, it is not possible to instantiate a Unicode character or
string variable directly with C/C++ character or string literals. The only safe
way is to use numeric values. It is not an issue for User Interface (UI) strings
that are translated. These UI strings are loaded from a resource bundle, which
is generated from a text file that can be in Unicode or in any other
ICU-provided codepage. The binary form of the genrb tool generates UTF-16
strings that are ready for direct use.
There is a useful exception to this for program-internal strings and test
strings. Within each "family" of character encodings, there is a set of
characters that have the same numeric code values. Such characters include Latin
letters, the basic digits, the space, and some punctuation. Most of the ASCII
graphic characters are invariant characters. The same set, with different but
again consistent numeric values, is invariant among almost all EBCDIC codepages.
For details, see
[icu4c/source/common/unicode/utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html)
. With strings that contain only these invariant characters, it is possible to
use efficient ICU constructs to write a C/C++ string literal and use it to
initialize Unicode strings.
In some APIs, ICU uses `char *` strings. This is either for file system paths or
for strings that contain invariant characters only (such as locale identifiers).
These strings are in the platform-specific encoding of either ASCII or EBCDIC.
All other codepage differences do not matter for invariant characters and are
manipulated by the C stdlib functions like strcpy().
In some APIs where identifiers are used, ICU uses `char *` strings with invariant
characters. Such strings do not require the full Unicode repertoire and are
easier to handle in C and C++ with `char *` string literals and standard C
library functions. Their useful character repertoire is actually smaller than
the set of graphic ASCII characters; for details, see
[utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html) . Examples of
`char *` identifier uses are converter names, locale IDs, and resource bundle
table keys.
There is another, less efficient way to have human-readable Unicode string
literals in C and C++ code. ICU provides a small number of functions that allow
any Unicode characters to be inserted into a string with escape sequences
similar to the one that is used in the C and C++ language. In addition to the
familiar \\n and \\xhh etc., ICU also provides the \\uhhhh syntax with four hex
digits and the \\Uhhhhhhhh syntax with eight hex digits for hexadecimal Unicode
code point values. This is very similar to the newer escape sequences used in
Java and defined in the latest C and C++ standards. Since ICU is not a compiler
extension, the "unescaping" is done at runtime and the backslash itself must be
escaped (duplicated) so that the compiler does not attempt to "unescape" the
sequence itself.
## Handling Lengths, Indexes, and Offsets in Strings
The length of a string and all indexes and offsets related to the string are
always counted in terms of UChar code units, not in terms of UChar32 code
points. (This is the same as in common C library functions that use `char *`
strings with multi-byte encodings.)
Often, a user thinks of a "character" as a complete unit in a language, like an
'Ä', while it may be represented with multiple Unicode code points including a
base character and combining marks. (See the Unicode standard for details.) This
often requires users to index and pass strings (UnicodeString or `UChar *`) with
multiple code units or code points. It cannot be done with single-integer
character types. Indexing of such "characters" is done with the BreakIterator
class (in C: ubrk_ functions).
Even with such "higher-level" indexing functions, the actual index values will
be expressed in terms of UChar code units. When more than one code unit is used
at a time, the index value changes by more than one at a time.
ICU uses signed 32-bit integers (int32_t) for lengths and offsets. Because of
internal computations, strings (and arrays in general) are limited to 1G base
units or 2G bytes, whichever is smaller.
## Using C Strings: NUL-Terminated vs. Length Parameters
Strings are either terminated with a NUL character (code point 0, U+0000) or
their length is specified. In the latter case, it is possible to have one or
more NUL characters inside the string.
**Input string **arguments are typically passed with two parameters: The (const)
`UChar *` pointer and an int32_t length argument. If the length is -1 then the
string must be NUL-terminated and the ICU function will call the u_strlen()
method or treat it equivalently. If the input string contains embedded NUL
characters, then the length must be specified.
**Output string **arguments are typically passed with a destination `UChar *`
pointer and an int32_t capacity argument and the function returns the length of
the output as an int32_t. There is also almost always a UErrorCode argument.
Essentially, a `UChar[]` array is passed in with its start and the number of
available UChars. The array is filled with the output and if space permits the
output will be NUL-terminated. The length of the output string is returned. In
all cases the length of the output string does not include the terminating NUL.
This is the same behavior found in most ICU and non-ICU string APIs, for example
u_strlen(). The output string may **contain** NUL characters as part of its
actual contents, depending on the input and the operation. Note that the
UErrorCode parameter is used to indicate both errors and warnings (non-errors).
The following describes some of the situations in which the UErrorCode will be
set to a non-zero value:
1. If the output length is greater than the output array capacity, then the
UErrorCode will be set to U_BUFFER_OVERFLOW_ERROR and the contents of the
output array is undefined.
2. If the output length is equal to the capacity, then the output has been
completely written minus the terminating NUL. This is also indicated by
setting the UErrorCode to U_STRING_NOT_TERMINATED_WARNING.
Note that U_STRING_NOT_TERMINATED_WARNING does not indicate failure (it
passes the U_SUCCESS() macro).
Note also that it is more reliable to check the output length against the
capacity, rather than checking for the warning code, because warning codes
do not cause the early termination of a function and may subsequently be
overwritten.
3. If neither of these two conditions apply, the error code will indicate
success and not a U_STRING_NOT_TERMINATED_WARNING. (If a
U_STRING_NOT_TERMINATED_WARNING code had been set in the UErrorCode
parameter before the function call, then it is reset to a U_ZERO_ERROR.)
**Preflighting:** The returned length is always the full output length even if
the output buffer is too small. It is possible to pass in a capacity of 0 (and
an output array pointer of NUL) for "pure preflighting" to determine the
necessary output buffer size. Add one to make the output string NUL-terminated.
Note that — whether the caller intends to "preflight" or not — if the output
length is equal to or greater than the capacity, then the UErrorCode is set to
U_STRING_NOT_TERMINATED_WARNING or U_BUFFER_OVERFLOW_ERROR respectively, as
described above.
However, "pure preflighting" is very expensive because the operation has to be
processed twice — once for calculating the output length, and a second time to
actually generate the output. It is much more efficient to always provide an
output buffer that is expected to be large enough for most cases, and to
reallocate and repeat the operation only when an overflow occurred. (Remember to
reset the UErrorCode to U_ZERO_ERROR before calling the function again.) In
C/C++, the initial output buffer can be a stack buffer. In case of a
reallocation, it may be possible and useful to cache and reuse the new, larger
buffer.
> :point_right: **Note**:*The exception to these rules are the ANSI-C-style functions like u_strcpy(),
which generally require NUL-terminated strings, forbid embedded NULs, and do not
take capacity arguments for buffer overflow checking.*
## Using Unicode Strings in C
In C, Unicode strings are similar to standard `char *` strings. Unicode strings
are arrays of UChar and most APIs take a `UChar *` pointer to the first element
and an input length and/or output capacity, see above. ICU has a number of
functions that provide the Unicode equivalent of the stdlib functions such as
strcpy(), strstr(), etc. Compared with their C standard counterparts, their
function names begin with u_. Otherwise, their semantics are equivalent. These
functions are defined in icu/source/common/unicode/ustring.h.
### Code Point Access
Sometimes, Unicode code points need to be accessed in C for iteration, movement
forward, or movement backward in a string. A string might also need to be
written from code points values. ICU provides a number of macros that are
defined in the icu/source/common/unicode/utf.h and utf8.h/utf16.h headers that
it includes (utf.h is in turn included with utypes.h).
Macros for 16-bit Unicode strings have a U16_ prefix. For example:
U16_NEXT(s, i, length, c)
U16_PREV(s, start, i, c)
U16_APPEND(s, i, length, c, isError)
There are also macros with a U_ prefix for code point range checks (e.g., test
for non-character code point), and U8_ macros for 8-bit (UTF-8) strings. See the
header files and the API References for more details.
#### UTF Macros before ICU 2.4
In ICU 2.4, the utf\*.h macros have been revamped, improved, simplified, and
renamed. The old macros continue to be available. They are in utf_old.h,
together with an explanation of the change. utf.h, utf8.h and utf16.h contain
the new macros instead. The new macros are intended to be more consistent, more
useful, and less confusing. Some macros were simply renamed for consistency with
a new naming scheme.
The documentation of the old macros has been removed. If you need it, see a User
Guide version from ICU 4.2 or earlier (see the [download
page](http://site.icu-project.org/download)).
C Unicode String Literals
There is a pair of macros that together enable users to instantiate a Unicode
string in C — a `UChar []` array — from a C string literal:
/*
* In C, we need two macros: one to declare the UChar[] array, and
* one to populate it; the second one is a noop on platforms where
* wchar_t is compatible with UChar and ASCII-based.
* The length of the string literal must be counted for both macros.
*/
/* declare the invString array for the string */
U_STRING_DECL(invString, "such characters are safe 123 %-.", 32);
/* populate it with the characters */
U_STRING_INIT(invString, "such characters are safe 123 %-.", 32);
With invariant characters, it is also possible to efficiently convert `char *`
strings to and from UChar \ strings:
static const char *cs1="such characters are safe 123 %-.";
static UChar us1[40];
static char cs2[40];
u_charsToUChars(cs1, us1, 33); /* include the terminating NUL */
u_UCharsToChars(us1, cs2, 33);
## Testing for well-formed UTF-16 strings
It is sometimes useful to test if a 16-bit Unicode string is well-formed UTF-16,
that is, that it does not contain unpaired surrogate code units. For a boolean
test, call a function like u_strToUTF8() which sets an error code if the input
string is malformed. (Provide a zero-capacity destination buffer and treat the
buffer overflow error as "is well-formed".) If you need to know the position of
the unpaired surrogate, you can iterate through the string with U16_NEXT() and
U_IS_SURROGATE().
## Using Unicode Strings in C++
[UnicodeString](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classUnicodeString.html) is
a C++ string class that wraps a UChar array and associated bookkeeping. It
provides a rich set of string handling functions.
UnicodeString combines elements of both the Java String and StringBuffer
classes. Many UnicodeString functions are named and work similar to Java String
methods but modify the object (UnicodeString is "mutable").
UnicodeString provides functions for random access and use (insert/append/find
etc.) of both code units and code points. For each non-iterative string/code
point macro in utf.h there is at least one UnicodeString member function. The
names of most of these functions contain "32" to indicate the use of a UChar32.
Code point and code unit iteration is provided by the
[CharacterIterator](characteriterator.md) abstract class and its subclasses.
There are concrete iterator implementations for UnicodeString objects and plain
`UChar []` arrays.
Most UnicodeString constructors and functions do not have a UErrorCode
parameter. Instead, if the construction of a UnicodeString fails, for example
when it is constructed from a NULL `UChar *` pointer, then the UnicodeString
object becomes "bogus". This can be tested with the isBogus() function. A
UnicodeString can be put into the "bogus" state explicitly with the setToBogus()
function. This is different from an empty string (although a "bogus" string also
returns TRUE from isEmpty()) and may be used equivalently to NULL in `UChar *` C
APIs (or null references in Java, or NULL values in SQL). A string remains
"bogus" until a non-bogus string value is assigned to it. For complete details
of the behavior of "bogus" strings see the description of the setToBogus()
function.
Some APIs work with the
[Replaceable](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classReplaceable.html)
abstract class. It defines a simple interface for random access and text
modification and is useful for operations on text that may have associated
meta-data (e.g., styled text), especially in the Transliterator API.
UnicodeString implements Replaceable.
### C++ Unicode String Literals
Like in C, there are macros that enable users to instantiate a UnicodeString
from a C string literal. One macro requires the length of the string as in the C
macros, the other one implies a strlen().
UnicodeString s1=UNICODE_STRING("such characters are safe 123 %-.", 32);
UnicodeString s1=UNICODE_STRING_SIMPLE("such characters are safe 123 %-.");
It is possible to efficiently convert between invariant-character strings and
UnicodeStrings by using constructor, setTo() or extract() overloads that take
codepage data (`const char *`) and specifying an empty string ("") as the
codepage name.
## Using C++ Strings in C APIs
The internal buffer of UnicodeString objects is available for direct handling in
C (or C-style) APIs that take `UChar *` arguments. It is possible but usually not
necessary to copy the string contents with one of the extract functions. The
following describes several direct buffer access methods.
The UnicodeString function getBuffer() const returns a readonly const `UChar *`.
The length of the string is indicated by UnicodeString's length() function.
Generally, UnicodeString does not NUL-terminate the contents of its internal
buffer. However, it is possible to check for a NUL character if the length of
the string is less than the capacity of the buffer. The following code is an
example of how to check the capacity of the buffer:
`(s.length()<s.getCapacity() && buffer[s.length()]==0)`
An easier way to NUL-terminate the buffer and get a `const UChar *` pointer to it
is the getTerminatedBuffer() function. Unlike getBuffer() const,
getTerminatedBuffer() is not a const function because it may have to (reallocate
and) modify the buffer to append a terminating NUL. Therefore, use getBuffer()
const if you do not need a NUL-terminated buffer.
There is also a pair of functions that allow controlled write access to the
buffer of a UnicodeString: `UChar *getBuffer(int32_t minCapacity)` and
`releaseBuffer(int32_t newLength)`. `UChar *getBuffer(int32_t minCapacity)`
provides a writeable buffer of at least the requested capacity and returns a
pointer to it. The actual capacity of the buffer after the
`getBuffer(minCapacity)` call may be larger than the requested capacity and can be
determined with `getCapacity()`.
Once the buffer contents are modified, the buffer must be released with the
`releaseBuffer(int32_t newLength)` function, which sets the new length of the
UnicodeString (newLength=-1 can be passed to determine the length of
NUL-terminated contents like `u_strlen()`).
Between the `getBuffer(minCapacity)` and `releaseBuffer(newLength)` function calls,
the contents of the UnicodeString is unknown and the object behaves like it
contains an empty string. A nested `getBuffer(minCapacity)`, `getBuffer() const` or
`getTerminatedBuffer()` will fail (return NULL) and modifications of the string
via UnicodeString member functions will have no effect. Copying a string with an
"open buffer" yields an empty copy. The move constructor, move assignment
operator and Return Value Optimization (RVO) transfer the state, including the
open buffer.
See the UnicodeString API documentation for more information.
## Using C Strings in C++ APIs
There are efficient ways to wrap C-style strings in C++ UnicodeString objects
without copying the string contents. In order to use C strings in C++ APIs, the
`UChar *` pointer and length need to be wrapped into a UnicodeString. This can be
done efficiently in two ways: With a readonly alias and a writable alias. The
UnicodeString object that is constructed actually uses the `UChar *` pointer as
its internal buffer pointer instead of allocating a new buffer and copying the
string contents.
If the original string is a readonly `const UChar *`, then the UnicodeString must
be constructed with a read only alias. If the original string is a writable
(non-const) `UChar *` and is to be modified (e.g., if the `UChar *` buffer is an
output buffer) then the UnicodeString should be constructed with a writeable
alias. For more details see the section "Maximizing Performance with the
UnicodeString Storage Model" and search the unistr.h header file for "alias".
## Maximizing Performance with the UnicodeString Storage Model
UnicodeString uses four storage methods to maximize performance and minimize
memory consumption:
1. Short strings are normally stored inside the UnicodeString object. The
object has fields for the "bookkeeping" and a small UChar array. When the
object is copied, the internal characters are copied into the destination
object.
2. Longer strings are normally stored in allocated memory. The allocated UChar
array is preceded by a reference counter. When the string object is copied,
the allocated buffer is shared by incrementing the reference counter. If any
of the objects that share the same string buffer are modified, they receive
their own copy of the buffer and decrement the reference counter of the
previously co-used buffer.
3. A UnicodeString can be constructed (or set with a setTo() function) so that
it aliases a readonly buffer instead of copying the characters. In this
case, the string object uses this aliased buffer for as long as the object
is not modified and it will never attempt to modify or release the buffer.
This model has copy-on-write semantics. For example, when the string object
is modified, the buffer contents are first copied into writable memory
(inside the object for short strings or the allocated buffer for longer
strings). When a UnicodeString with a readonly setting is copied to another
UnicodeString using the fastCopyFrom() function, then both string objects
share the same readonly setting and point to the same storage. Copying a
string with the normal assignment operator or copy constructor will copy the
buffer. This prevents accidental misuse of readonly-aliased strings. (This
is new in ICU 2.4; earlier, the assignment operator and copy constructor
behaved like the new fastCopyFrom() does now.)
**Important:**
1. The aliased buffer must remain valid for as long as any UnicodeString
object aliases it. This includes unmodified fastCopyFrom()and
`movedFrom()` copies of the object (including moves via the move
constructor and move assignment operator), and when the compiler uses
Return Value Optimization (RVO) where a function returns a UnicodeString
by value.
2. Be prepared that return-by-value may either make a copy (which does not
preserve aliasing), or moves the value or uses RVO (which do preserve
aliasing).
3. It is an error to readonly-alias temporary buffers and then pass the
resulting UnicodeString objects (or references/pointers to them) to APIs
that store them for longer than the buffers are valid.
4. If it is necessary to make sure that a string is not a readonly alias,
then use any modifying function without actually changing the contents
(for example, s.setCharAt(0, s.charAt(0))).
5. In ICU 2.4 and later, a simple assignment or copy construction will also
copy the buffer.
4. A UnicodeString can be constructed (or set with a setTo() function) so that
it aliases a writable buffer instead of copying the characters. The
difference from the above is that the string object writes through to this
aliased buffer for write operations. A new buffer is allocated and the
contents are copied only when the capacity of the buffer is not sufficient.
An efficient way to get the string contents into the original buffer is to
use the `extract(..., UChar *dst, ...)` function.
The `extract(..., UChar *dst, ...)` function copies the string contents only if the dst buffer is
different from the buffer of the string object itself. If a string grows and
shrinks during a sequence of operations, then it will not use the same
buffer, even if the string would fit. When a UnicodeString with a writeable
alias is assigned to another UnicodeString, the contents are always copied.
The destination string will not point to the buffer that the source string
aliases point to. However, a move constructor, move assignment operator, and
Return Value Optimization (RVO) do preserve aliasing.
In general, UnicodeString objects have "copy-on-write" semantics. Several
objects may share the same string buffer, but a modification only affects the
object that is modified itself. This is achieved by copying the string contents
if it is not owned exclusively by this one object. Only after that is the object
modified.
Even though it is fairly efficient to copy UnicodeString objects, it is even
more efficient, if possible, to work with references or pointers. Functions that
output strings can be faster by appending their results to a UnicodeString that
is passed in by reference, compared with returning a UnicodeString object or
just setting the local results alone into a string reference.
> :point_right: **Note**: *UnicodeStrings can be copied in a thread-safe manner by just using their
standard copy constructors and assignment operators. fastCopyFrom() is also
thread-safe, but if the original string is a readonly alias, then the copy
shares the same aliased buffer.*
## Using UTF-8 strings with ICU
As mentioned in the overview of this chapter, ICU and most other
Unicode-supporting software uses 16-bit Unicode for internal processing.
However, there are circumstances where UTF-8 is used instead. This is usually
the case for software that does little or no processing of non-ASCII characters,
and/or for APIs that predate Unicode, use byte-based strings, and cannot be
changed or replaced for various reasons.
A common perception is that UTF-8 has an advantage because it was designed for
compatibility with byte-based, ASCII-based systems, although it was designed for
string storage (of Unicode characters in Unix file names) rather than for
processing performance.
While ICU mostly does not natively use UTF-8 strings, there are many ways to
work with UTF-8 strings and ICU. For more information see the newer
[UTF-8](utf-8.md) subpage.
## Using UTF-32 strings with ICU
It is even rarer to use UTF-32 for string processing than UTF-8. While 32-bit
Unicode is convenient because it is the only fixed-width UTF, there are few or
no legacy systems with 32-bit string processing that would benefit from a
compatible format, and the memory bandwidth requirements of UTF-32 diminish the
performance and handling advantage of the fixed-width format.
Over time, the wchar_t type of some C/C++ compilers became a 32-bit integer, and
some C libraries do use it for Unicode processing. However, application software
with good Unicode support tends to have little use for the rudimentary Unicode
and Internationalization support of the standard C/C++ libraries and often uses
custom types (like ICU's) and UTF-16 or UTF-8.
For those systems where 32-bit Unicode strings are used, ICU offers some
convenience functions.
1. Conversion of whole strings: u_strFromUTF32() and u_strFromUTF32() in
ustring.h.
2. Access to code points is trivial and does not require any macros.
3. Using a UTF-32 converter with all of the ICU conversion APIs in ucnv.h,
including ones with an "Algorithmic" suffix.
4. UnicodeString has `fromUTF32()` and `toUTF32()` methods.
5. For conversion directly between UTF-32 and another charset use
ucnv_convertEx(). However, since ICU converters work with byte streams in
external charsets on the non-"Unicode" side, the UTF-32 string will be
treated as a byte stream (UTF-32 Character Encoding *Scheme*) rather than a
sequence of 32-bit code units (UTF-32 Character Encoding *Form*). The
correct converter must be used: UTF-32BE or UTF-32LE according to the
platform endianness (U_IS_BIG_ENDIAN). Treating the string like a byte
stream also makes a difference in data types (`char *`), lengths and indexes
(counting bytes), and NUL-termination handling (input NUL-termination not
possible, output writes only a NUL byte, not a NUL 32-bit code unit). For
the difference between internal encoding forms and external encoding schemes
see the Unicode Standard.
6. Some ICU APIs work with a CharacterIterator, a UText or a UCharIterator
instead of directly with a C/C++ string parameter. There is currently no ICU
instance of any of these interfaces that reads UTF-32, although an
application could provide one.
## Changes in ICU 2.0
Beginning with ICU release 2.0, there are a few changes to the ICU string
facilities compared with earlier ICU releases.
Some of the NUL-termination behavior was inconsistent across the ICU API
functions. In particular, the following functions used to count the terminating
NUL character in their output length (counted one more before ICU 2.0 than now):
ucnv_toUChars, ucnv_fromUChars, uloc_getLanguage, uloc_getCountry,
uloc_getVariant, uloc_getName, uloc_getDisplayLanguage, uloc_getDisplayCountry,
uloc_getDisplayVariant, uloc_getDisplayName
Some functions used to set an overflow error code even when only the terminating
NUL did not fit into the output buffer. These functions now set UErrorCode to
U_STRING_NOT_TERMINATED_WARNING rather than to U_BUFFER_OVERFLOW_ERROR.
The aliasing UnicodeString constructors and most extract functions have existed
for several releases prior to ICU 2.0. There is now an additional extract
function with a UErrorCode parameter. Also, the getBuffer, releaseBuffer and
getCapacity functions are new to ICU 2.0.
For more information about these changes, please consult the old and new API
documentation.

View file

@ -0,0 +1,332 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Properties
## Overview
Text processing requires that a program treat text appropriately. If text is
exchanged between several systems, it is important for them to process the text
consistently. This is done by assigning each character, or a range of
characters, attributes or properties used for text processing, and by defining
standard algorithms for at least the basic text operations.
Traditionally, such attributes and algorithms have not been well-defined for
most character sets, and text processing had to rely on ad-hoc solutions. Over
time, standards were created for querying properties of the system codepage.
However, the set of these properties was limited. Their data was not coordinated
among implementations, and standard algorithms were not available.
It is one of the strengths of Unicode that it not only defines a very large
character set, but also assigns a comprehensive set of properties and usage
notes to all characters. It defines standard algorithms for critical text
processing, and the data is publicly provided and kept up-to-date. See
https://www.unicode.org/ and https://www.unicode.org/main.html for more information.
Sample code is available in the ICU source code library at
[icu4c/source/samples/props/props.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/props/props.cpp).
See also the source code for the [Unicode
browser](https://github.com/unicode-org/icu-demos/tree/master/ubrowse) demo
application, which can be used
[online](http://demo.icu-project.org/icu-bin/ubrowse) to browse Unicode
characters with their properties.
## Unicode Character Database properties in ICU APIs
The following table shows all Unicode Character Database properties (except for
purely "extracted" ones and Unihan properties) and the corresponding ICU APIs.
Most of the time, ICU4C provides functions in
icu4c/source/common/unicode/uchar.h and ICU4J provides parallel functions in the
com.ibm.icu.lang.UCharacter class. Properties of a single Unicode character are
accessed by its 21-bit code point value (type: UChar32=int32_t in C/C++, int in
Java).
[Surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point)
mostly have default property values, except for the General_Category (gc=Cs).
For integer values outside the Unicode code point range (negative or ≥
0x110000), most API functions return null values (false, 0, etc.). API functions
that map a code point to another (e.g., u_foldCase()/UCharacter.foldCase())
normally return out-of-range values (i.e., map them to themselves), just like
for unassigned code points or generally code points that have no specific
mappings. In particular, -1 (=U_SENTINEL in ICU4C) is mapped to -1.
Most properties are also available via UnicodeSet APIs and patterns. See the
Lookup section below.
See [UAX #44, Unicode Character
Database](https://www.unicode.org/reports/tr44/#Properties) itself for
comparison. The UCD files
[PropertyAliases.txt](https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt)
and
[PropertyValueAliases.txt](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
list all properties and their values by name and type.
UAX #44 also shows which UCD files have data for which properties,
and many other useful details.
Most properties that use binary, integer, or enumerated values are available via
functions u_hasBinaryProperty and u_getIntPropertyValue which take UProperty
enum constants to select the property. (ICU4J UCharacter member functions do not
have the "u_" prefix.) The constant names include the long property name
according to PropertyAliases.txt, e.g., UCHAR_LINE_BREAK. Corresponding property
value enum constant names often contain the short property name and the long
value name, e.g., U_LB_LINE_FEED. For enumeration/integer type properties, the
enumeration result type is also listed here.
Some UnicodeSet APIs use the same UProperty constants. Other UnicodeSet APIs and
UnicodeSet and regular expression patterns use the long or short property
aliases and property value aliases (see PropertyAliases.txt and
PropertyValueAliases.txt).
There is one pseudo-property, UCHAR_GENERAL_CATEGORY_MASK for which the APIs do
not use a single value but a bit-set (a mask) of zero or more values, with each
bit corresponding to one UCHAR_GENERAL_CATEGORY value. This allows ICU to
represent property value aliases for multiple general categories, like "Letters"
(which stands for "Uppercase Letters", "Lowercase Letters", etc.). In other
words, there are two ICU properties for the same Unicode property, one
delivering single values (for per-code point lookup) and the other delivering
sets of values (for use with value aliases and UnicodeSet).
| UCD Name | Type | | ICU4C uchar.h / ICU4J UCharacter |
|--------------|--------|-----|------------------------------|
| Age | Unicode version | (U) | C: u_charAge fills in UVersionInfo<br>Java: getAge returns a VersionInfo reference |
| Alphabetic | binary | (U) | u_isUAlphabetic, UCHAR_ALPHABETIC |
| ASCII_Hex_Digit | binary | (U) | UCHAR_ASCII_HEX_DIGIT |
| Bidi_Class | enum | (U) | u_charDirection, UCHAR_BIDI_CLASS<br>returns enum UCharDirection |
| Bidi_Control | binary | (U) | UCHAR_BIDI_CONTROL |
| Bidi_Mirrored | binary | (U) | u_isMirrored, UCHAR_BIDI_MIRRORED |
| Bidi_Mirroring_Glyph | code point | | u_charMirror |
| Block | enum | (U) | ublock_getCode, UCHAR_BLOCK<br>returns enum UBlockCode |
| Canonical_Combining_Class | 0..255 | (U) | u_getCombiningClass, UCHAR_CANONICAL_COMBINING_CLASS |
| Case_Folding | Unicode string | | u_strFoldCase (ustring.h) |
| Case_Ignorable | binary | (U) | UCHAR_CASE_IGNORABLE |
| Cased | binary | (U) | UCHAR_CASED |
| Changes_When_Casefolded | binary | (U) | UCHAR_CHANGES_WHEN_CASEFOLDED |
| Changes_When_Casemapped | binary | (U) | UCHAR_CHANGES_WHEN_CASEMAPPED |
| Changes_When_NFKC_Casefolded | binary | (U) | UCHAR_CHANGES_WHEN_NFKC_CASEFOLDED |
| Changes_When_Lowercased | binary | (U) | UCHAR_CHANGES_WHEN_LOWERCASED |
| Changes_When_Titlecased | binary | (U) | UCHAR_CHANGES_WHEN_TITLECASED |
| Changes_When_Uppercased | binary | (U) | UCHAR_CHANGES_WHEN_UPPERCASED |
| Composition_Exclusion | binary | (c) | contributes to Full_Composition_Exclusion |
| Dash | binary | (U) | UCHAR_DASH |
| Decomposition_Mapping | Unicode string | | NFKC Normalizer2::getRawDecomposition() |
| Decomposition_Type | enum | (U) | UCHAR_DECOMPOSITION_TYPE<br>returns enum UDecompositionType |
| Default_Ignorable_Code_Point | binary | (U) | UCHAR_DEFAULT_IGNORABLE_CODE_POINT |
| Deprecated | binary | (U) | UCHAR_DEPRECATED |
| Diacritic | binary | (U) | UCHAR_DIACRITIC |
| East_Asian_Width | enum | (U) | UCHAR_EAST_ASIAN_WIDTH<br>returns enum UEastAsianWidth |
| Expands_On_NF* | binary | | available via normalization API (normalizer2.h) |
| Extender | binary | (U) | UCHAR_EXTENDER |
| FC_NFKC_Closure | Unicode string | | u_getFC_NFKC_Closure |
| Full_Composition_Exclusion | binary | (U) | UCHAR_FULL_COMPOSITION_EXCLUSION |
| General_Category | enum | (U) | u_charType, UCHAR_GENERAL_CATEGORY, UCHAR_GENERAL_CATEGORY_MASK<br>returns enum UCharCategory |
| Grapheme_Base | binary | (U) | UCHAR_GRAPHEME_BASE |
| Grapheme_Cluster_Break | enum | (U) | UCHAR_GRAPHEME_CLUSTER_BREAK<br>returns enum UGraphemeClusterBreak |
| Grapheme_Extend | binary | (U) | UCHAR_GRAPHEME_EXTEND |
| Grapheme_Link | binary | (U) | UCHAR_GRAPHEME_LINK |
| Hangul_Syllable_Type | enum | (U) | UCHAR_HANGUL_SYLLABLE_TYPE<br>returns enum UHangulSyllableType |
| Hex_Digit | binary | (U) | UCHAR_HEX_DIGIT |
| Hyphen | binary | (U) | UCHAR_HYPHEN |
| ID_Continue | binary | (U) | UCHAR_ID_CONTINUE |
| ID_Start | binary | (U) | UCHAR_ID_START |
| Ideographic | binary | (U) | UCHAR_IDEOGRAPHIC |
| IDS_Binary_Operator | binary | (U) | UCHAR_IDS_BINARY_OPERATOR |
| IDS_Triary_Operator | binary | (U) | UCHAR_IDS_TRINARY_OPERATOR |
| Indic_Positional_Category | enum | (U) | UCHAR_INDIC_POSITIONAL_CATEGORY<br>returns enum UIndicPositionalCategory |
| Indic_Syllabic_Category | enum | (U) | UCHAR_INDIC_SYLLABIC_CATEGORY<br>returns enum UIndicSyllabicCategory |
| ISO_Comment | ASCII string | | u_getISOComment |
| Jamo_Short_Name | ASCII string | (c) | contributes to Name |
| Join_Control | binary | (U) | UCHAR_JOIN_CONTROL |
| Joining_Group | enum | (U) | UCHAR_JOINING_GROUP<br>returns enum UJoiningGroup |
| Joining_Type | enum | (U) | UCHAR_JOINING_TYPE<br>returns enum UJoiningType |
| Line_Break | enum | (U) | UCHAR_LINE_BREAK<br>returns enum ULineBreak |
| Logical_Order_Exception | binary | (U) | UCHAR_LOGICAL_ORDER_EXCEPTION |
| Lowercase | binary | (U) | u_isULowercase, UCHAR_LOWERCASE |
| Lowercase_Mapping | Unicode string | | available via u_strToLower (ustring.h) |
| Math | binary | (U) | UCHAR_MATH |
| Name | ASCII string | (U) | u_charName(U_UNICODE_CHAR_NAME or U_EXTENDED_CHAR_NAME) |
| Name_Alias | ASCII string | | u_charName(U_CHAR_NAME_ALIAS) |
| NF*_QuickCheck | enum | (U) | UCHAR_NF*_QUICK_CHECK and available via quickCheck (normalizer2.h)<br>returns UNormalizationCheckResult (no/maybe/yes) |
| NFKC_Casefold | Unicode string | | available via normalization API (normalizer2.h "nfkc_cf") |
| Noncharacter_Code_Point | binary | (U) | UCHAR_NONCHARACTER_CODE_POINT, <br /> U_IS_UNICODE_NONCHAR (utf.h) |
| Numeric_Type | enum | (U) | UCHAR_NUMERIC_TYPE<br>returns enum UNumericType |
| Numeric_Value | double | (U) | u_getNumericValueJava/UnicodeSet: only non-negative integers, no fractions |
| Other_Alphabetic | binary | (c) | contributes to Alphabetic |
| Other_Default_Ignorable_Code_Point | binary | (c) | contributes to Default_Ignorable_Code_Point |
| Other_Grapheme_Extend | binary | (c) | contributes to Grapheme_Extend |
| Other_Lowercase | binary | (c) | contributes to Lowercase |
| Other_Math | binary | (c) | contributes to Math |
| Other_Uppercase | binary | (c) | contributes to Uppercase |
| Pattern_Syntax | binary | (U) | UCHAR_PATTERN_SYNTAX |
| Pattern_White_Space | binary | (U) | UCHAR_PATTERN_WHITE_SPACE |
| Quotation_Mark | binary | (U) | UCHAR_QUOTATION_MARK |
| Radical | binary | (U) | UCHAR_RADICAL |
| Script | enum | (U) | uscript_getCode (uscript.h), UCHAR_SCRIPT<br>returns enum UScriptCode |
| Script_Extensions | list | (U) | uscript_getScriptExtensions & uscript_hasScript (uscript.h), UCHAR_SCRIPT_EXTENSIONS<br>returns a list of enum UScriptCode values |
| Sentence_Break | enum | (U) | UCHAR_SENTENCE_BREAK<br>returns enum USentenceBreak |
| Simple_Case_Folding | code point | | u_foldCase |
| Simple_Lowercase_ Mapping | code point | | u_tolower |
| Simple_Titlecase_ Mapping | code point | | u_totitle |
| Simple_Uppercase_ Mapping | code point | | u_toupper |
| Soft_Dotted | binary | (U) | UCHAR_SOFT_DOTTED |
| STerm | binary | (U) | UCHAR_S_TERM |
| Terminal_Punctuation | binary | (U) | UCHAR_TERMINAL_PUNCTUATION |
| Titlecase_Mapping | Unicode string | | u_strToTitle (ustring.h) |
| Unicode_1_Name | ASCII string | (U) | u_charName(U_UNICODE_10_CHAR_NAME or U_EXTENDED_CHAR_NAME) |
| Unified_Ideograph | binary | (U) | UCHAR_UNIFIED_IDEOGRAPH |
| Uppercase | binary | (U) | u_isUUppercase, UCHAR_UPPERCASE |
| Uppercase_Mapping | Unicode string | | u_strToUpper (ustring.h) |
| Vertical_Orientation | enum | (U) | UCHAR_VERTICAL_ORIENTATION<br>returns enum UVerticalOrientation |
| White_Space | binary | (U) | u_isUWhiteSpace, UCHAR_WHITE_SPACE |
| Word_Break | enum | (U) | UCHAR_WORD_BREAK<br>returns enum UWordBreakValues |
| XID_Continue | binary | (U) | UCHAR_XID_CONTINUE |
| XID_Start | binary | (U) | UCHAR_XID_START |
Notes:
1. (c) - This property only **contributes** to "real" properties (mostly
"Other_..." properties), so there is no direct support for this property in
ICU.
2. (U) - This property is available via the UnicodeSet APIs and patterns. Any
property available in UnicodeSet is also available in regular expressions.
Properties which are not available in UnicodeSet are generally those that
are not available through a UProperty selector.
3. UnicodeSet `[:scx=Arab:]` is a superset of `[:sc=Arab:]`;
see https://www.unicode.org/reports/tr18/#Script_Property
4. Full case mapping properties (e.g., Lowercase_Mapping) are complex.
The string case mapping functions that implement them handle language-specific
and/or context-sensitive mappings.
The output may have more code points or fewer code points than the input.
## Customization
ICU does not provide the means to modify properties at runtime. The properties
are provided exactly as specified by a recent version of the Unicode Standard
(as published in the [Character
Database](http://www.unicode.org/unicode/onlinedat/online.html) ).
For custom sets and maps, it is easiest to make UnicodeSet or
UCPTrie/CodePointTrie objects with the desired values.
However, if an application requires custom properties (for example, for [Private
Use](http://www.unicode.org/glossary/) characters), then it is possible to
change or add them at build-time. This is doable but not easy.
It is done by modifying the Character Database files copied into the ICU source
tree at
[icu4c/source/data/unidata](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/unidata).
Since ICU 49, most of the properties have been combined into one file,
unidata/ppucd.txt (see the [Preparsed
UCD](http://site.icu-project.org/design/props/ppucd) design doc). Some of the
remaining UCD files are still inputs, others are only used for unit tests.
To add a character to such a file, a line must be inserted into the file with
the format used in that file (see the online documentation on the [Unicode
site](http://www.unicode.org/reports/tr44/) for more information). After
modifying one or more of these files, the ICU data needs to be rebuilt, and the
resulting files need to be checked into the ICU source tree. The files are
processed by special ICU tools outside of the normal ICU build. The
[unidata/changes.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/changes.txt)
file documents the process that has been used for the last several Unicode
version updates; skip the file preparation and API update steps.
Any available Unicode code point (0 to 10FFFF<sub>16</sub>) can be used.
Code point values
should be written with either 4, 5, or 6 hex digits. The minimum number of
digits possible should be used (but no fewer than 4). Note that the Unicode
Standard specifies that the 32 code points U+FDD0..U+FDEF and the 34 code points
U+...xFFFE and U+...xFFFF (where x=0, 1, 2, ..., F, 10) are not characters,
therefore they should not be added to any of the character database files.
## Lookup
For lookup by code point, iterate through the string, fetch code points, and
either call the unicode/uchar.h / UCharacter or similar functions, or use
dedicated sets and maps. For binary properties, and sets in general, there are
also more efficient methods for iterating over substrings.
### Binary property from code point
Call one of the binary-property functions. Alternatively, make a UnicodeSet for
the property (remember to freeze() it) or for a custom set of characters, and
call contains().
### Binary property over string
It is often useful to partition a string into substrings where every character
has the property, and substrings where every character does not have the
property. For example, to split the string at separator characters, remove
certain types of characters, trim white space, etc. Use a UnicodeSet with its
span() and spanBack() methods (available in C++ in UTF-8 versions). In Java, you
can also use a UnicodeSetSpanner.
### Enumerated property from code point
Call one of the int-property functions. Alternatively, build a UCPTrie /
CodePointTrie (new in ICU 63) via its mutable version and build method, then use
that to get the int value for each code point.
### Enumerated property over string
Easiest is to iterate over code points of the string and call per-code point
lookup methods (or use a code point trie).
The UCPTrie / CodePointTrie (new in ICU 63) also offers C macros and a Java
String iterator class where the iteration and data lookup are integrated to
avoid redundancies in validation and range checks.
The UTF-16 code point macros and the Java String iterator also provide the code
point as output, because it has to be fetched or assembled anyway.
The UTF-8 macros do not assemble the code point because that would be some
amount of extra work, but often only the lookup value is used and the code point
is not needed. When it is needed after all, it is possible to take advantage of
the macros having validated the byte sequence: If the sequence was ill-formed,
then the trie's error value is set. Therefore, if a value other than the trie
error value was returned, then the sequence was well-formed, and the code point
can be fetched without revalidating the sequence (e.g., via U8_NEXT_UNSAFE()).
Since the length of the sequence (1..4 bytes) is also known from the iteration
(string index before/after next() call), an even simpler piece of code can be
used. (See for example the ICU-internal function codePointFromValidUTF8() in
normalizer2impl.cpp.)
### Code point trie most-optimized UTF-16 access
UTF-16 text processing can be further optimized by detecting surrogate pairs and
assembling supplementary code points only when there is non-trivial data
available.
At build time, iterate over all supplementary code points
(umutablecptrie_getRange() / MutableCodePointTrie.getRange() starting from
U+10000) to see if there is non-trivial data for any of the supplementary code
points associated with a lead surrogate. If so, then set a special
(application-specific) value for the lead surrogate.
At runtime, use UCPTRIE_FAST_BMP_GET() per code *unit*. If there is non-trivial
data and the code unit is a lead surrogate, then check if a trail surrogate
follows. If so, assemble the supplementary code point with
U16_GET_SUPPLEMENTARY() and look up its value with UCPTRIE_FAST_SUPP_GET();
otherwise deal with the unpaired surrogate in some way. (Java CodePointTrie.Fast
and java.lang.Character have equivalent methods.)
If there is only trivial data for lead and trail surrogates, then processing can
often skip them. (In this case, there will be two data lookups, one for the lead
surrogate and one for the trail surrogate, but they are fast, and this
optimization speeds up the more common BMP characters by not checking for
surrogates each time.)
For example, in normalization or case mapping all characters that do not have
any mappings are simply copied as is.
## Properties in ICU Rule Syntax
ICU rule syntaxes should use the Unicode Pattern_White_Space set as syntactic
"spaces" to allow for the usage of white space characters outside of the normal
ASCII range while still maintaining backward compatibility. See
<https://www.unicode.org/reports/tr31/#Pattern_Syntax> for more information.

View file

@ -0,0 +1,504 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Regular Expressions
## Overview
ICU's Regular Expressions package provides applications with the ability to
apply regular expression matching to Unicode string data. The regular expression
patterns and behavior are based on Perl's regular expressions. The C++
programming API for using ICU regular expressions is loosely based on the JDK
1.4 package java.util.regex, with some extensions to adapt it for use in a C++
environment. A plain C API is also provided.
The ICU Regular expression API supports operations including testing for a
pattern match, searching for a pattern match, and replacing matched text.
Capture groups allow subranges within an overall match to be identified, and to
appear within replacement text.
A Perl-inspired split() function that breaks a string into fields based on a
delimiter pattern is also included.
ICU Regular Expressions conform to version 19 of the
[Unicode Technical Standard \#18](http://www.unicode.org/reports/tr18/),
Unicode Regular Expressions, level 1, and in addition include Default Word
boundaries and Name Properties from level 2.
A detailed description of regular expression patterns and pattern matching
behavior is not included in this user guide. The best reference for this topic
is the book "Mastering Regular Expressions, 3rd Edition" by Jeffrey E. F.
Friedl, O'Reilly Media; 3rd edition (August 2006). Matching behavior can
sometimes be surprising, and this book is highly recommended for anyone doing
significant work with regular expressions.
## Using ICU Regular Expressions
The ICU C++ Regular Expression API includes two classes, `RegexPattern` and
`RegexMatcher`, that parallel the classes from the Java JDK package
java.util.regex. A `RegexPattern` represents a compiled regular expression while
`RegexMatcher` associates a `RegexPattern` and an input string to be matched, and
provides API for the various find, match and replace operations. In most cases,
however, only the class `RegexMatcher` is needed, and the existence of class
RegexPattern can safely be ignored.
The first step in using a regular expression is typically the creation of a
`RegexMatcher` object from the source (string) form of the regular expression.
`RegexMatcher` holds a pre-processed (compiled) pattern and a reference to an
input string to be matched, and provides API for the various find, match and
replace operations. `RegexMatchers` can be reset and reused with new input, thus
avoiding object creation overhead when performing the same matching operation
repeatedly on different strings.
The following code will create a `RegexMatcher` from a string containing a regular
expression, and then perform a simple `find()` operation.
#include <unicode/regex.h>
UErrorCode status = U_ZERO_ERROR;
...
RegexMatcher *matcher = new RegexMatcher("abc+", 0, status);
if (U_FAILURE(status)) {
// Handle any syntax errors in the regular expression here
...
}
UnicodeString stringToTest = "Find the abc in this string";
matcher->reset(stringToTest);
if (matcher->find()) {
// We found a match.
int startOfMatch = matcher->start(status); // string index of start of match.
...
}
Several types of matching tests are available
| Function | Description |
|:--------------|:---------------------------------------------------------------|
| `matches()` | True if the pattern matches the entire string, from the start through to the last character.
| `lookingAt()` | True if the pattern matches at the start of the string. The match need not include the entire string.
| `find()` | True if the pattern matches somewhere within the string. Successive calls to find() will find additional matches, until the string is exhausted.
If additional text is to be checked for a match with the same pattern, there is
no need to create a new matcher object; just reuse the existing one.
myMatcher->reset(anotherString);
if (myMatcher->matches(status)) {
// We have a match with the new string.
}
Note that matching happens directly in the string supplied by the application.
This reduces the overhead when resetting a matcher to an absolute minimum the
matcher need only store a reference to the new string but it does mean that
the application must be careful not to modify or delete the string while the
matcher is holding a reference to the string.
After finding a match, additional information is available about the range of
the input matched, and the contents of any capture groups. Note that, for
simplicity, any error parameters have been omitted. See the [API
reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classRegexMatcher.html) for
complete a complete description of the API.
| Function | Description |
|:----------------|:---------------------------------------------------------------|
| `start()` | Return the index of the start of the matched region in the input string.
| `end()` | Return the index of the first character following the match.
| `group()` | Return a UnicodeString containing the text that was matched.
| `start(n)` | Return the index of the start of the text matched by the nth capture group.
| `end(n)` | Return the index of the first character following the text matched by the nth capture group.
| `group(n)` | Return a UnicodeString containing the text that was matched by the nth capture group.
## Regular Expression Metacharacters
| Character | outside of sets | \[inside sets\] | Description |
|:----------|:----------------|:----------------|:-------------|
| \\a | ✓ | ✓ | Match a BELL, \\u0007.
| \\A | ✓ | | Match at the beginning of the input. Differs from ^ in that \\A will not match after a new line within the input.
| \\b | ✓ | | Match if the current position is a word boundary. Boundaries occur at the transitions between word (\\w) and non-word (\\W) characters, with combining marks ignored. For better word boundaries, see [ICU Boundary Analysis](../boundaryanalysis/index.md).
| \\B | ✓ | | Match if the current position is not a word boundary.
| \\cX | ✓ | ✓ | Match a control-X character.
| \\d | ✓ | ✓ | Match any character with the Unicode General Category of Nd (Number, Decimal Digit.)
| \\D | ✓ | ✓ | Match any character that is not a decimal digit.
| \\e | ✓ | ✓ | Match an ESCAPE, \\u001B.
| \\E | ✓ | ✓ | Terminates a \\Q ... \\E quoted sequence.
| \\f | ✓ | ✓ | Match a FORM FEED, \\u000C.
| \\G | ✓ | ✓ | Match if the current position is at the end of the previous match.
| \\h | ✓ | ✓ | Match a Horizontal White Space character. They are characters with Unicode General Category of Space_Separator plus the ASCII tab (\\u0009).
| \\H | ✓ | ✓ | Match a non-Horizontal White Space character.
| \\k<name> | ✓ | | Named Capture Back Reference.
| \\n | ✓ | ✓ | Match a LINE FEED, \\u000A.
| \\N{UNICODE CHARACTER NAME} | ✓ | ✓ | Match the named character.
| \\p{UNICODE PROPERTY NAME} | ✓ | ✓ | Match any character with the specified Unicode Property.
| \\P{UNICODE PROPERTY NAME} | ✓ | ✓ | Match any character not having the specified Unicode Property.
| \\Q | ✓ | ✓ | Quotes all following characters until \\E.
| \\r | ✓ | ✓ | Match a CARRIAGE RETURN, \\u000D.
| \\R | ✓ | | Match a new line character, or the sequence CR LF. The new line characters are \\u000a, \\u000b, \\u000c, \\u000d, \\u0085, \\u2028, \\u2029.
| \\s | ✓ | ✓ | Match a white space character. White space is defined as \[\\t\\n\\f\\r\\p{Z}\].
| \\S | ✓ | ✓ | Match a non-white space character.
| \\t | ✓ | ✓ | Match a HORIZONTAL TABULATION, \\u0009.
| \\uhhhh | ✓ | ✓ | Match the character with the hex value hhhh.
| \\Uhhhhhhhh | ✓ | ✓ | Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \\U0010ffff.
| \\v | ✓ | ✓ | Match a new line character. The new line characters are \\u000a, \\u000b, \\u000c, \\u000d, \\u0085, \\u2028, \\u2029. Does not match the new line sequence CR LF.
| \\V | ✓ | ✓ | Match a non-new line character.
| \\w | ✓ | ✓ | Match a word character. Word characters are \[\\p{Alphabetic}\\p{Mark}\\p{Decimal_Number}\\p{Connector_Punctuation}\\u200c\\u200d\].
| \\W | ✓ | ✓ | Match a non-word character.
| \\x{hhhh} | ✓ | ✓ | Match the character with hex value hhhh. From one to six hex digits may be supplied.
| \\xhh | ✓ | ✓ | Match the character with two digit hex value hh.
| \\X | ✓ | | Match a [Grapheme Cluster](http://www.unicode.org/unicode/reports/tr29/#Grapheme_Cluster_Boundaries).
| \\Z | ✓ | | Match if the current position is at the end of input, but before the final line terminator, if one exists.
| \\z | ✓ | | Match if the current position is at the end of input.
| \\*n* | ✓ | | Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern.
| \\0ooo | ✓ | ✓ | Match an Octal character. 'ooo' is from one to three octal digits. 0377 is the largest allowed Octal character. The leading zero is required; it distinguishes Octal constants from back references.
| \[pattern\] | ✓ | ✓ | Match any one character from the set.
| . | ✓ | | Match any character.
| ^ | ✓ | | Match at the beginning of a line.
| $ | ✓ | | Match at the end of a line. Line terminating characters are \\u000a, \\u000b, \\u000c, \\u000d, \\u0085, \\u2028, \\u2029 and the sequence \\u000d \\u000a.
| \\ | ✓ | | Quotes the following character. Characters that must be quoted to be treated as literals are \* ? + \[ ( ) { } ^ $ | \\ .
| \\ | | ✓ | Quotes the following character. Characters that must be quoted to be treated as literals are \[ \] \\ Characters that may need to be quoted, depending on the context are - &
## Regular Expression Operators
| Operator | Description
|:--------------|:---------------------------------------------------------------|
| `\|` | Alternation. A\|B matches either A or B.
| `*` | Match 0 or more times. Match as many times as possible.
| `+` | Match 1 or more times. Match as many times as possible.
| `?` | Match zero or one times. Prefer one.
| `{n}` | Match exactly n times
| `{n,}` | Match at least n times. Match as many times as possible.
| `{n,m}` | Match between n and m times. Match as many times as possible, but not more than m.
| `*?` | Match 0 or more times. Match as few times as possible.
| `+?` | Match 1 or more times. Match as few times as possible.
| `??` | Match zero or one times. Prefer zero.
| `{n}?` | Match exactly n times.
| `{n,}?` | Match at least n times, but no more than required for an overall pattern match.
| `{n,m}?` | Match between n and m times. Match as few times as possible, but not less than n.
| `*+` | Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match).
| `++` | Match 1 or more times. Possessive match.
| `?+` | Match zero or one times. Possessive match.
| `{n}+` | Match exactly n times.
| `{n,}+` | Match at least n times. Possessive Match.
| `{n,m}+` | Match between n and m times. Possessive Match.
| `( ...)` | Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
| `(?: ...)` | Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
| `(?> ...)` | Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>".
| `(?# ...)` | Free-format comment (?# comment ).
| `(?= ...)` | Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
| `(?! ...)` | Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
| `(?<= ...)` | Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no \* or + operators.)
| `(?<\! ...)` | Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no \* or + operators.)
| `(?<name>...)` | Named capture group. The <angle brackets> are literal - they appear in the pattern.
| `(?ismwx-ismwx:...)` | Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
| `(?ismwx-ismwx)` | Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.
## Set Expressions (Character Classes)
| Example | Description
|:--------------|:---------------------------------------------------------------|
| `[abc]` | Match any of the characters a, b or c.
| `[^abc]` | Negation - match any character except a, b or c.
| `[A-M]` | Range - match any character from A to M. The characters to include are determined by Unicode code point ordering.
| `[\u0000-\U0010ffff]` | Range - match all characters.
| `[\p{L}] [\p{Letter}] [\p{General_Category=Letter}]` | Characters with Unicode Category = Letter. All forms shown are equivalent.
| `[\P{Letter}]` | Negated property. (Upper case \P) Match everything except Letters.
| `[\p{numeric_value=9}]` | Match all numbers with a numeric value of 9. Any Unicode Property may be used in set expressions.
| `[\p{Letter}&&\p{script=cyrillic}]` | Logical AND or intersection. Match the set of all Cyrillic letters.
| `[\p{Letter}--\p{script=latin}]` | Subtraction. Match all non-Latin letters.
| `[[a-z][A-Z][0-9]]` `[a-zA-Z0-9]]` | Implicit Logical OR or Union of Sets. The examples match ASCII letters and digits. The two forms are equivalent.
| `[:script=Greek:]` | Alternate POSIX-like syntax for properties. Equivalent to \\p{script=Greek}.
## Case Insensitive Matching
Case insensitive matching is specified by the UREGEX_CASE_INSENSITIVE flag
during pattern compilation, or by the (?i) flag within a pattern itself. Unicode
case insensitive matching is complicated by the fact that changing the case of a
string may change its length. See <http://unicode.org/faq/casemap_charprop.html>
for more information on Unicode casing operations.
Full case-insensitive matching handles situations where the number of characters
in equal string may differ. "fußball" compares equal "FUSSBALL", for example.
Simple case insensitive matching operates one character at a time on the strings
being compared. "fußball" does not compare equal to "FUSSBALL"
For ICU regular expression matching,
* Anything from a regular expression pattern that looks like a literal string
(even of one character) will be matched against the text using full case
folding. The pattern string and the matched text may be of different
lengths.
* Any sequence that is composed by the matching engine from originally
separate parts of the pattern will not match with the composition boundary
within a case folding expansion of the text being matched.
* Matching of \[set expressions\] uses simple matching. A \[set\] will match
exactly one code point from the text.
Examples:
* pattern "fussball" will match "fußball or "fussball".
* pattern "fu(s)(s)ball" or "fus{2}ball" will match "fussball" or "FUSSBALL"
but not "fußball.
* pattern "ß" will find occurrences of "ss" or "ß".
* pattern "s+" will not find "ß".
With these rules, a match or capturing sub-match can never begin or end in the
interior of an input text character that expanded when case folded.
## Flag Options
The following flags control various aspects of regular expression matching. The
flag values may be specified at the time that an expression is compiled into a
RegexPattern object, or they may be specified within the pattern itself using
the `(?ismx-ismx)` pattern options.
> :point_right: **Note**: The UREGEX_CANON_EQ option is not yet available.
| Flag (pattern) | Flag (API Constant) | Description
|:---------------|:--------------------|:-----------------|
| | UREGEX_CANON_EQ | If set, matching will take the canonical equivalence of characters into account. NOTE: this flag is not yet implemented.
| i | UREGEX_CASE_INSENSITIVE | If set, matching will take place in a case-insensitive manner.
| x | UREGEX_COMMENTS | If set, allow use of white space and #comments within patterns.
| s | UREGEX_DOTALL | If set, a "." in a pattern will match a line terminator in the input text. By default, it will not. Note that a carriage-return / line-feed pair in text behave as a single line terminator, and will match a single "." in a RE pattern. Line terminators are \\u000a, \\u000b, \\u000c, \\u000d, \\u0085, \\u2028, \\u2029 and the sequence \\u000d \\u000a.
| m | UREGEX_MULTILINE | Control the behavior of "^" and "$" in a pattern. By default these will only match at the start and end, respectively, of the input text. If this flag is set, "^" and "$" will also match at the start and end of each line within the input text.
| w | UREGEX_UWORD | Controls the behavior of \\b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.
## Using split()
ICU's split() function is similar in concept to Perl's it will split a string
into fields, with a regular expression match defining the field delimiters and
the text between the delimiters being the field content itself.
Suppose you have a string of words separated by spaces:
UnicodeString s = “dog cat giraffe”;
This code will extract the individual words from the string:
UErrorCode status = U_ZERO_ERROR;
RegexMatcher m(“\\s+”, 0, status);
const int maxWords = 10;
UnicodeString words[maxWords];
int numWords = m.split(s, words, maxWords, status);
After the split():
| Variable | value |
|:----------------|:-------------|
| `numWords` | `3`
| `words[0]` | `“dog”`
| `words[1]` | `“cat”`
| `words[2]` | `“giraffe”`
| `words[3 to 9]` | `“”`
The field delimiters, the spaces from the original string, do not appear in the
output strings.
Note that, in this example, `words` is a local, or stack array of actual
UnicodeString objects. No heap allocation is involved in initializing this array
of empty strings (C++ is not Java!). Local UnicodeString arrays like this are a
very good fit for use with split(); after extracting the fields, any values that
need to be kept in some more permanent way can be copied to their ultimate
destination.
If the number of fields in a string being split exceeds the capacity of the
destination array, the last destination string will contain all of the input
string data that could not be split, including any embedded field delimiters.
This is similar to split() in Perl.
If the pattern expression contains capturing parentheses, the captured data ($1,
$2, etc.) will also be saved in the destination array, interspersed with the
fields themselves.
If, in the “dog cat giraffe” example, the pattern had been `“(\s+)”` instead of
`“\s+”`, `split()` would have produced five output strings instead of three.
`Words[1]` and `words[3]` would have been the spaces.
## Find and Replace
Find and Replace operations are provided with the following functions.
| Function | Description |
|:------------|:--------------|
| `replaceFirst()` | Replace the first matching substring with the replacement text. Performs the complete operation, including the `find()`.
| `replaceAll()` | Replace all matching substrings with the replacement text. Performs the complete operation, including all `find()`s.
| `appendReplacement()` | Incremental replace operation, intended to be used in a loop with `find()`.
| `appendTail()` | Final step in an incremental find & replace; appends any remaining text following the last replacement.
The replacement text for find-and-replace operations may contain references to
capture-group text from the find.
| Character | Descriptions |
|:----------|:--------------|
| `$n` | The text of capture group 'n' will be substituted for `$n`. n must be >= 0 and not greater than the number of capture groups. An unescaped $ in replacement text that is not followed by a capture group specification, either a number or name, is an error.
| `${name}` | The text of named capture group will be substituted. The name must appear in the pattern.
| `\` | Treat the following character as a literal, suppressing any special meaning. Backslash escaping in substitution text is only required for '$' and '\\', but may be used on any other character without bad effects.
**Sample code showing the use of appendReplacement()**
#include <stdio.h>
#include "unicode/regex.h"
int main() {
UErrorCode status = U_ZERO_ERROR;
RegexMatcher m(UnicodeString(" +"), 0, status);
UnicodeString text("Here is some text.");
m.reset(text);
UnicodeString result;
UnicodeString replacement("_");
int replacement_count = 0;
while (m.find(status) && U_SUCCESS(status)) {
m.appendReplacement(result, replacement, status);
replacement_count++;
}
m.appendTail(result);
char result_buf[100];
result.extract(0, result.length(), result_buf, sizeof(result_buf));
printf("The result of find & replace is \"%s\"\n", result_buf);
printf("The number of replacements is %d\n", replacement_count);
}
Running this sample produces the following:
The result of find & replace is "Here_is_some_text."
The number of replacements is 3
## Performance Tips
Some regular expression patterns can result in very slow match operations,
sometimes so slow that it will appear as though the match has gone into an
infinite loop. The problem is not unique to ICU - it affects any regular
expression implementation using a conventional nondeterministic finite automaton
(NFA) style matching engine. This section gives some suggestion on how to avoid
problems.
The performance problems tend to show up most commonly on failing matches - when
an input string does not match the regexp pattern. With a complex pattern
containing multiple \* or + (or similar) operators, the match engine will
tediously redistribute the input text between the different pattern terms, in a
doomed effort to find some combination that leads to a match (that doesn't
exist).
The running time for troublesome patterns is exponential with the length of the
input string. Every added character in the input doubles the (non)matching time.
It doesn't take a particularly long string for the projected running time to
exceed the age of the universe.
A simple pattern showing the problem is
`(A+)+B`
matching against the string
`AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC`
The expression can't match - there is no 'B' in the input - but the engine is
too dumb to realize that, and will try all possible permutations of rearranging
the input between the terms of the expression before failing.
Some suggestions:
* Avoid, or examine carefully, any expressions with nested repeating
quantifiers, like in the example above. They can often be recast in some
other way. Any ambiguity in how input text could be distributed between the
terms of the expression will cause problems.
* Narrow every term in a pattern to match as small a set of characters as
possible at each point. Fail as early as possible with bad input, rather
than letting broad `.*` style terms eat intermediate input and relying on
later terms in the expression to produce a failure.
* Use possessive quantifiers when possible - `*+` instead of `*`, `++`
instead of `+`
These operators prevent backtracking; the initial match of a `*+` qualified
pattern is either used in its entirety as part of the complete match, or it
is not used at all.
* Follow or surround `*` or `+` expressions with terms that the repeated
expression can not match. The idea is to have only one possible way to match
the input, with no possibility of redistributing the input between adjacent
terms of the pattern.
* Avoid overly long and complex regular expressions. Just because it's
possible to do something completely in one large expression doesn't mean
that you should. Long expressions are difficult to understand and can be
almost impossible to debug when they go wrong. It is no sin to break a
parsing problem into pieces and to have some code involved involved in the
process.
* Set a time limit. ICU includes the ability to limit the time spent on a
regular expression match. This is a good idea when running untested
expressions from users of your application, or as a fail safe for servers or
other processes that cannot afford to be hung.
Examples from actual bug reports,
The pattern
(?:[A-Za-z0-9]+[._]?){1,}[A-Za-z0-9]+\@(?:(?:[A-Za-z0-9]+[-]?){1,}[A-Za-z0-9]+\.){1,}`
^^^^^^^^^^^ `
and the text
abcdefghijklmnopq
cause an infinite loop.
The problem is in the region marked with `^^^^^^^^^^`. The `"[._]?"` term can be ignored, because
it need not match anything. `{1,}` is the same as `+`. So we effectively have
`(?:\[A-Za-z0-9\]+)+`, which is trouble.
The initial part of the expression can be recast as
`[A-Za-z0-9\]+([._][A-Za-z0-9]+)*`
which matches the same thing. The nested `+` and `*` qualifiers do not cause a
problem because the `[._]` term is not optional and contains no characters that
overlap with `[A-Za-z0-9]`, leaving no ambiguity in how input characters can be
distributed among terms in the match.
A further note: this expression was intended to parse email addresses, and has a
number of other flaws. For common tasks like this there are libraries of freely
available regular expressions that have been well debugged. It's worth making a
quick search before writing a new expression.
> :construction: **TODO**: add more examples.*
### Heap and Stack Usage
ICU keeps its match backtracking state on the heap. Because badly designed or
malicious patterns can result in matches that require large amounts of storage,
ICU sets a limit on heap usage by matches. The default is 8 MB; it can be
changed or removed via an API.
Because ICU does not use program recursion to maintain its backtracking state,
stack usage during matching operations is minimal, and does not increase with
complex patterns or large amounts of backtracking state. This is worth
mentioning only because excessive stack usage, resulting in blown off threads or
processes, can be a problem with some regular expression packages.
## Differences with Java Regular Expressions
* ICU does not support UREGEX_CANON_EQ. See
<https://unicode-org.atlassian.net/browse/ICU-9111>.
* The behavior of \\cx (Control-X) differs from Java when x is outside the
range A-Z. See <https://unicode-org.atlassian.net/browse/ICU-6068>.
* Java allows quantifiers (\*, +, etc) on zero length tests. ICU does not.
Occurrences of these in patterns are most likely unintended user errors, but
it is an incompatibility with Java.
<https://unicode-org.atlassian.net/browse/ICU-6080>
* ICU recognizes all Unicode properties known to ICU, which is all of them.
Java is restricted to just a few.
* ICU case insensitive matching works with all Unicode characters, and, within
string literals, does full Unicode matching (where matching strings may be
different lengths.) Java does ASCII only by default, with Unicode aware case
folding available as an option.
* ICU has an extended syntax for set \[bracket\] expressions, including
additional operators. Added for improved compatibility with the original ICU
implementation, which was based on ICU UnicodeSet pattern syntax.
* The property expression `\p{punct}` differs in what it matches. Java matches
matches any of ```!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~```. From that list,
ICU omits ```$+<=>^\`|~``` &nbsp; &nbsp;
ICU follows the recommendations from Unicode UTS-18,
<http://unicode.org/reports/tr18/#Compatibility_Properties>. See also
<https://unicode-org.atlassian.net/browse/ICU-20095>.

View file

@ -0,0 +1,393 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# StringPrep
## Overview
Comparing strings in a consistent manner becomes imperative when a large
repertoire of characters such as Unicode is used in network protocols.
StringPrep provides sets of rules for use of Unicode and syntax for prevention
of spoofing. The implementation of StringPrep and IDNA services and their usage
in ICU is described below.
## StringPrep
StringPrep, the process of preparing Unicode strings for use in network
protocols is defined in RFC 3454 (<http://www.rfc-editor.org/rfc/rfc3454.txt> ).
The RFC defines a broad framework and rules for processing the strings.
Protocols that prescribe use of StringPrep must define a profile of StringPrep,
whose applicability is limited to the protocol. Profiles are a set of rules and
data tables which describe the how the strings should be prepare. The profiles
can choose to turn on or turn off normalization, checking for bidirectional
characters. They can also choose to add or remove mappings, unassigned and
prohibited code points from the tables provided.
StringPrep uses Unicode Version 3.2 and defines a set of tables for use by the
profiles. The profiles can chose to include or exclude tables or code points
from the tables defined by the RFC.
StringPrep defines tables that can be broadly classified into
1. *Unassigned Table*: Contains code points that are unassigned in Unicode
Version 3.2. Unassigned code points may be allowed or disallowed in the
output string depending on the application. The table in Appendix A.1 of the
RFC contains the code points.
1. *Mapping Tables*: Code points that are commonly deleted from the output and
code points that are case mapped are included in this table. There are two
mapping tables in the Appendix namely B.1 and B.2
2. *Prohibited Tables*: Contains code points that are prohibited from the
output string. Control codes, private use area code points, non-character
code points, surrogate code points, tagging and deprecated code points are
included in this table. There are nine mapping tables in Appendix which
include the prohibited code points namely C.1, C.2, C.3, C.4, C.5, C.6, C.7,
C.8 and C.9.
The procedure for preparing strings for use can be described in the following
steps:
1. *Map*: For each code point in the input check if it has a mapping defined in
the mapping table, if so, replace it with the mapping in the output.
2. *Normalize*: Normalize the output of step 1 using Unicode Normalization Form
NFKC, it the option is set. Normalization algorithm must conform to UAX 15.
3. *Prohibit*: For each code point in the output of step 2 check if the code
point is present in the prohibited table, if so, fail returning an error.
4. *Check BiDi*: Check for code points with strong right-to-left directionality
in the output of step 3. If present, check if the string satisfies the rules
for bidirectional strings as specified.
## NamePrep
NamePrep is a profile of StringPrep for use in IDNA. This profile in defined in
RFC 3491(<http://www.rfc-editor.org/rfc/rfc3491.txt> ).
The profile specifies the following rules:
1. *Map* : Include all code point mappings specified in the StringPrep.
2. *Normalize*: Normalize the output of step 1 according to NFKC.
3. *Prohibit*: Prohibit all code points specified as prohibited in StringPrep
except for the space ( U+0020) code point from the output of step 2.
4. *Check BiDi*: Check for bidirectional code points and process according to
the rules specified in StringPrep.
## Punycode
Punycode is an encoding scheme for Unicode for use in IDNA. Punycode converts
Unicode text to unique sequence of ASCII text and back to Unicode. It is an
ASCII Compatible Encoding (ACE). Punycode is described in RFC 3492
(<http://www.rfc-editor.org/rfc/rfc3492.txt> ).
The Punycode algorithm is a form of a general Bootstring algorithm which allows
strings composed of smaller set of code points to uniquely represent any string
of code points from a larger set. Punycode represents Unicode code points from
U+0000 to U+10FFFF by using the smaller ASCII set U+0000 to U+0007F. The
algorithm can also preserve case information of the code points in the lager set
while and encoding and decoding. This feature, however, is not used in IDNA.
## Internationalizing Domain Names in Applications (IDNA)
The Domain Name Service (DNS) protocol defines the procedure for matching of
ASCII strings case insensitively to the names in the lookup tables containing
mapping of IP (Internet Protocol) addresses to server names. When Unicode is
used instead of ASCII in server names then two problems arise which need to be
dealt with differently. When the server name is displayed to the user then
Unicode text should be displayed. When Unicode text is stored in lookup tables,
for compatibility with older DNS protocol and the resolver libraries, the text
should be the ASCII equivalent. The IDNA protocol, defined by RFC 3490
(<http://www.rfc-editor.org/rfc/rfc3490.txt> ), satisfies the above
requirements.
Server names stored in the DNS lookup tables are usually formed by concatenating
domain labels with a label separator, for example:
The protocol defines operations to be performed on domain labels before the
names are stored in the lookup tables and before the names fetched from lookup
tables are displayed to the user. The operations are :
1. ToASCII: This operation is performed on domain labels before sending the
name to a resolver and before storing the name in the DNS lookup table. The
domain labels are processed by StringPrep algorithm by using the rules
specified by NamePrep profile. The output of this step is then encoded by
using Punycode and an ACE prefix is added to denote that the text is encoded
using Punycode. IDNA uses “xn--” before the encoded label.
1. ToUnicode: This operation is performed on domain labels before displaying
the names to to users. If the domain label is prefixed with the ACE prefix
for IDNA, then the label excluding the prefix is decoded using Punycode. The
output of Punycode decoder is verified by applying ToASCII operation and
comparing the output with the input to the ToUnicode operation.
Unicode contains code points that are glyphically similar to the ASCII Full Stop
(U+002E). These code points must be treated as label separators when performing
ToASCII operation. These code points are :
1. Ideographic Full Stop (U+3002)
2. Full Width Full Stop (U+FF0E)
3. Half Width Ideographic Full Stop (U+FF61)
Unassigned code points in Unicode Version 3.2 as given in StringPrep tables are
treated differently depending on how the processed string is used. For query
operations, where a registrar is requested for information regarding
availability of a certain domain name, unassigned code points are allowed to be
present in the string. For storing the string in DNS lookup tables, unassigned
code points are prohibited from the input.
IDNA specifies that the ToUnicode and ToASCII have options to check for
Letter-Digit-Hyphen code points and adhere to the STD3 ASCII Rules.
IDNA specifies that domain labels are equivalent if and only if the output of
ToASCII operation on the labels match using case insensitive ASCII comparison.
## StringPrep Service in ICU
The StringPrep service in ICU is data driven. The service is based on
Open-Use-Close pattern. A StringPrep profile is opened, the strings are
processed according to the rules specified in the profile and the profile is
closed once the profile is ready to be disposed.
Tools for filtering RFC 3454 and producing a rule file that can be compiled into
a binary format containing all the information required by the service are
provided.
The procedure for producing a StringPrep profile data file are as given below:
1. Run filterRFC3454.pl Perl tool, to filter the RFC file and produce a rule
file. The text file produced can be edited by the clients to add/delete
mappings or add/delete prohibited code points.
2. Run the gensprep tool to compile the rule file into a binary format. The
options to turn on normalization of strings and checking of bidirectional
code points are passed as command line options to the tool. This tool
produces a binary profile file with the extension “spp”.
3. Open the StringPrep profile with path to the binary and name of the binary
profile file as the options to the open call. The profile data files are
memory mapped and cached for optimum performance.
### Code Snippets
> :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should
keep the profile object around for reuse, instead of opening and closing the
profile each time.*
#### C++
UErrorCode status = U_ZERO_ERROR;
UParseError parseError;
/* open the StringPrep profile */
UStringPrepProfile* nameprep = usprep_open("/usr/joe/mydata",
"nfscsi", &status);
if(U_FAILURE(status)) {
/* handle the error */
}
/* prepare the string for use according
* to the rules specified in the profile
*/
int32_t retLen = usprep_prepare(src, srcLength, dest,
destCapacity, USPREP_ALLOW_UNASSIGNED,
nameprep, &parseError, &status);
/* close the profile */
usprep_close(nameprep);
#### Java
private static final StringPrep nfscsi = null;
//singleton instance
private static final NFSCSIStringPrep prep=new NFSCSIStringPrep();
private NFSCSIStringPrep() {
try {
InputStream nfscsiFile = TestUtil.getDataStream("nfscsi.spp");
nfscsi = new StringPrep(nfscsiFile);
nfscsiFile.close();
} catch(IOException e) {
throw new RuntimeException(e.toString());
}
}
private static byte[] prepare(byte[] src, StringPrep prep)
throws StringPrepParseException, UnsupportedEncodingException {
String s = new String(src, "UTF-8");
UCharacterIterator iter = UCharacterIterator.getInstance(s);
StringBuffer out = prep.prepare(iter,StringPrep.DEFAULT);
return out.toString().getBytes("UTF-8");
}
## IDNA API in ICU
ICU provides APIs for performing the ToASCII, ToUnicode and compare operations
as defined by the RFC 3490. Convenience methods for comparing IDNs are also
provided. These APIs follow ICU policies for string manipulation and coding
guidelines.
### Code Snippets
> :point_right: **Note**: The code snippets demonstrate the usage of the APIs. Applications should
keep the profile object around for reuse, instead of opening and closing the
profile each time.*
### ToASCII operation
***C***
UChar* dest = (UChar*) malloc(destCapacity * U_SIZEOF_UCHAR);
destLen = uidna_toASCII(src, srcLen, dest, destCapacity,
UIDNA_DEFAULT, &parseError, &status);
if(status == U_BUFFER_OVERFLOW_ERROR) {
status = U_ZERO_ERROR;
destCapacity= destLen + 1; /* for the terminating Null */
free(dest); /* free the memory */
dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR);
destLen = uidna_toASCII(src, srcLen, dest, destCapacity,
UIDNA_DEFAULT, &parseError, &status);
}
if(U_FAILURE(status)) {
/* handle the error */
}
/* do interesting stuff with output*/
***Java***
try {
StringBuffer out= IDNA.convertToASCII(inBuf,IDNA.DEFAULT);
} catch(StringPrepParseException ex) {
/*handle the exception*/
}
### toUnicode operation
***C***
UChar * dest = (UChar *) malloc(destCapacity * U_SIZEOF_UCHAR);
destLen = uidna_toUnicode(src, srcLen, dest, destCapacity,
UIDNA_DEFAULT
&parseError, &status);
if(status == U_BUFFER_OVERFLOW_ERROR) {
status = U_ZERO_ERROR;
destCapacity= destLen + 1; /* for the terminating Null */
/* free the memory */
free(dest);
dest = (UChar*) malloc(destLen * U_SIZEOF_UCHAR);
destLen = uidna_toUnicode(src, srcLen, dest, destCapacity,
UIDNA_DEFAULT, &parseError, &status);
}
if(U_FAILURE(status)) {
/* handle the error */
}
/* do interesting stuff with output*/
***Java***
try {
StringBuffer out= IDNA.convertToUnicode(inBuf,IDNA.DEFAULT);
} catch(StringPrepParseException ex) {
// handle the exception
}
### compare operation
***C***
int32_t rc = uidna_compare(source1, length1,
source2, length2,
UIDNA_DEFAULT,
&status);
if(rc==0) {
/* the IDNs are same ... do something interesting */
} else {
/* the IDNs are different ... do something */
}
***Java***
try {
int retVal = IDNA.compare(s1,s2,IDNA.DEFAULT);
// do something interesting with retVal
} catch(StringPrepParseException e) {
// handle the exception
}
## Design Considerations
StringPrep profiles exhibit the following characteristics:
1. The profiles contain information about code points. StringPrep allows
profiles to add/delete code points or mappings.
2. Options such as turning normalization and checking for bidirectional code
points on or off are the properties of the profiles.
3. The StringPrep algorithm is not overridden by the profile.
4. Once defined, the profiles do not change.
The StringPrep profiles are used in network protocols so runtime performance is
important.
Many profiles have been and are being defined, so applications should be able to
plug-in arbitrary profiles and get the desired result out of the framework.
ICU is designed for this usage by providing build-time tools for arbitrary
StringPrep profile definitions, and loading them from application-supplied data
in binary form with data structures optimized for runtime use.
## Demo
A web application at <http://demo.icu-project.org/icu-bin/idnbrowser>
illustrates the use of IDNA API. The source code for the application is
available at <https://github.com/unicode-org/icu-demos/tree/master/idnbrowser>.
## Appendix
#### NFS Version 4 Profiles
Network File System Version 4 defined by RFC 3530
(<http://www.rfc-editor.org/rfc/rfc3530.txt> ) defines use of Unicode text in
the protocol. ICU provides the requisite profiles as part of test suite and code
for processing the strings according the profiles as a part of samples.
The RFC defines three profiles :
1. *nfs4_cs_prep Profile*: This profile is used for preparing file and path
name strings. Normalization of code points and checking for bidirectional
code points are turned off. Case mappings are included if the NFS
implementation supports case insensitive file and path names.
2. *nfs4_cis_prep Profile*: This profile is used for preparing NFS server
names. Normalization of code points and checking for bidirectional code
points are turned on. This profile is equivalent to NamePrep profile.
3. *nfs4_mixed_prep Profile*: This profile is used for preparing strings in the
Access Control Entries of NFS servers. These strings consist of two parts,
prefix and suffix, separated by '@' (U+0040). The prefix is processed with
case mappings turned off and the suffix is processed with case mappings
turned on. Normalization of code points and checking for bidirectional code
points are turned on.
#### XMPP Profiles
Extensible Messaging and Presence Protocol (XMPP) is an XML based protocol for
near real-time extensible messaging and presence. This protocol defines use of
two StringPrep profiles:
1. *ResourcePrep Profile*: This profile is used for processing the resource
identifiers within XMPP. Normalization of code points and checking of
bidirectional code points are turned on. Case mappings are excluded. The
space code point (U+0020) is excluded from the prohibited code points set.
2. *NodePrep Profile*: This profile is used for processing the node identifiers
within XMPP. Normalization of code points and checking of bidirectional code
points are turned on. Case mappings are included. All code points specified
as prohibited in StringPrep are prohibited. Additional code points are added
to the prohibited set.

View file

@ -0,0 +1,272 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# UnicodeSet
## Overview
A UnicodeSet is an object that represents a set of Unicode characters or
character strings. The contents of that object can be specified either by
patterns or by building them programmatically.
Here are a few examples of sets:
| Pattern | Description |
|--------------|-------------------------------------------------------------|
| `[a-z]` | The lower case letters a through z |
| `[abc123]` | The six characters a,b,c,1,2 and 3 |
| `[\p{Letter}]` | All characters with the Unicode General Category of Letter. |
String Values In addition to being a set of characters (of Unicode code points),
a UnicodeSet may also contain string values. Conceptually, the UnicodeSet is
always a set of strings, not a set of characters, although in many common use
cases the strings are all of length one, which reduces to being a set of
characters.
This concept can be confusing when first encountered, probably because similar
set constructs from other environments (regular expressions) can only contain
characters.
## UnicodeSet Patterns
Patterns are a series of characters bounded by square brackets that contain
lists of characters and Unicode property sets. Lists are a sequence of
characters that may have ranges indicated by a '-' between two characters, as in
"a-z". The sequence specifies the range of all characters from the left to the
right, in Unicode order. For example, `[a c d-f m]` is equivalent to `[a c d e f m]`.
Whitespace can be freely used for clarity as `[a c d-f m]` means the same
as `[acd-fm]`.
Unicode property sets are specified by a Unicode property, such as `[:Letter:]`.
For a list of supported properties, see the [Properties](properties.md) chapter.
For details on the use of short vs. long property and property value names, see
the end of this section. The syntax for specifying the property names is an
extension of either POSIX or Perl syntax with the addition of "=value". For
example, you can match letters by using the POSIX syntax `[:Letter:]`, or by
using the Perl-style syntax \\p{Letter}. The type can be omitted for the
Category and Script properties, but is required for other properties.
The table below shows the two kinds of syntax: POSIX and Perl style. Also, the
table shows the "Negative", which is a property that excludes all characters of
a given kind. For example, `[:^Letter:]` matches all characters that are not
`[:Letter:]`.
| | Positive | Negative |
|--------------------|------------------|-------------------|
| POSIX-style Syntax | `[:type=value:]` | `[:^type=value:]` |
| Perl-style Syntax | `\p{type=value}` | `\P{type=value}` |
These following low-level lists or properties then can be freely combined with
the normal set operations (union, inverse, difference, and intersection):
| | Example | Corresponding Method | Meaning |
|-------|-------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| A B | `[[:letter:] [:number:]]` | `A.addAll(B)` | To union two sets A and B, simply concatenate them |
| A & B | `[[:letter:] & [a-z]]` | `A.retainAll(B)` | To intersect two sets A and B, use the '&' operator. |
| A - B | `[[:letter:] - [a-z]]` | `A.removeAll(B)` | To take the set-difference of two sets A and B, use the '-' operator. |
| [^A] | `[^a-z]` | `A.complement(B)` | To invert a set A, place a '^' immediately after the opening '['. Note that the complement only affects code points, not string values. In any other location, the '^' does not have a special meaning. |
### Precedence
The binary operators of union, intersection, and set-difference have equal
precedence and bind left-to-right. Thus the following are equivalent:
* `[[:letter:] - [a-z] [:number:] & [\u0100-\u01FF]]`
* `[[[[[:letter:] - [a-z]] [:number:]] & [\u0100-\u01FF]]`
Another example is that the set `[[ace][bdf\] - [abc][def]]` is **not**
the empty set, but instead the set `[def]`. That is because the syntax
corresponds to the following UnicodeSet operations:
1. start with `[ace]`
2. addAll `[bdf]` *-- we now have `[abcdef]`*
3. removeAll `[abc]` *-- we now have `[def]`*
4. addAll `[def]` *-- no effect, we still have `[def]`*
This only really matters where there are the difference and intersection
operations, as the union operation is commutative. To make sure that the - is
the main operator, add brackets to group the operations as desired, such as
`[[ace][bdf] - [[abc][def]]]`.
Another caveat with the '&' and '-' operators is that they operate between
**sets**. That is, they must be immediately preceded and immediately followed by
a set. For example, the pattern `[[:Lu:]-A]` is illegal, since it is
interpreted as the set `[:Lu:]` followed by the incomplete range `-A`. To specify
the set of uppercase letters except for 'A', enclose the 'A' in a set:
`[[:Lu:]-[A]]`.
### Examples
| `[a]` | The set containing 'a' |
|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `[a-z]` | The set containing 'a' through 'z' and all letters in between, in Unicode order |
| `[^a-z]` | The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF |
| `[[pat1][pat2]]` | The union of sets specified by pat1 and pat2 |
| `[[pat1]& [pat2]]` | The intersection of sets specified by pat1 and pat2 |
| `[[pat1]- [pat2]]` | The asymmetric difference of sets specified by pat1 and pat2 |
| `[:Lu:]` | The set of characters belonging to the given Unicode category, as defined by `Character.getType()`; in this case, Unicode uppercase letters. The long form for this is `[:UppercaseLetter:]`. |
| `[:L:]` | The set of characters belonging to all Unicode categories starting with 'L', that is, `[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]`. The long form for this is `[:Letter:]`. |
### String Values in Sets
String values are enclosed in {curly brackets}.
| Set expression | Description |
|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `[abc{def}]` | A set containing four members, the single characters a, b and c, and the string “def” |
| `[{abc}{def}]` | A set containing two members, the string “abc” and the string “def”. |
| `[{a}{b}{c}]` `[abc]` | These two sets are equivalent. Each contains three items, the three individual characters a, b and c. A {string} containing a single character is equivalent to that same character specified in any other way. |
### Character Quoting and Escaping in Unicode Set Patterns
#### Single Quote
Two single quotes represents a single quote, either inside or outside single
quotes.
Text within single quotes is not interpreted in any way (except for two adjacent
single quotes). It is taken as literal text (special characters become
non-special).
These quoting conventions for ICU UnicodeSets differ from those of regular
expression character set expressions. In regular expressions, single quotes have
no special meaning and are treated like any other literal character.
#### Backslash Escapes
Outside of single quotes, certain backslashed characters have special meaning:
| `\uhhhh` | Exactly 4 hex digits; h in [0-9A-Fa-f] |
|------------|----------------------------------------|
| `\Uhhhhhhhh` | Exactly 8 hex digits |
| `\xhh` | 1-2 hex digits |
| `\ooo` | 1-3 octal digits; o in [0-7] |
| `\a` | U+0007 (BELL) |
| `\b` | U+0008 (BACKSPACE) |
| `\t` | U+0009 (HORIZONTAL TAB) |
| `\n` | U+000A (LINE FEED) |
| `\v` | U+000B (VERTICAL TAB) |
| `\f` | U+000C (FORM FEED) |
| `\r` | U+000D (CARRIAGE RETURN) |
| `\\` | U+005C (BACKSLASH) |
Anything else following a backslash is mapped to itself, except in an
environment where it is defined to have some special meaning. For example,
`\\p{Lu}` is the set of uppercase letters in UnicodeSet.
Any character formed as the result of a backslash escape loses any special
meaning and is treated as a literal. In particular, note that \\u and \\U
escapes create literal characters. (In contrast, the Java compiler treats
Unicode escapes as just a way to represent arbitrary characters in an ASCII
source file, and any resulting characters are **not** tagged as literals.)
#### Whitespace
Whitespace (as defined by our API) is ignored unless it is quoted or
backslashed.
> :point_right: **Note**: *The rules for quoting and white space handling are common to most ICU APIs that
process rule or expression strings, including UnicodeSet, Transliteration and
Break Iterators.*
> :point_right: **Note**:*ICU Regular Expression set expressions have a different (but similar) syntax,
and a different set of recognized backslash escapes. \[Sets\] in ICU Regular
Expressions follow the conventions from Perl and Java regular expressions rather
than the pattern syntax from ICU UnicodeSet.*
## Using a UnicodeSet
For best performance, once the set contents is complete, freeze() the set to
make it immutable and to speed up contains() and span() operations (for which it
builds a small additional data structure).
The most basic operation is contains(code point) or, if relevant,
contains(string).
For splitting and partitioning strings, it is simpler and faster to use span()
and spanBack() rather than iterate over code points and calling contains(). In
Java, there is also a class UnicodeSetSpanner for somewhat higher-level
operations. See also the “Lookup” section of the [Properties](properties.md)
chapter.
## Programmatically Building UnicodeSets
ICU users can programmatically build a UnicodeSet by adding or removing ranges
of characters or by using the retain (intersection), remove (difference), and
add (union) operations.
## Property Values
The following property value variants are recognized:
| Format | Description | Example |
|--------|-----------------------------------------------------------------------------------------------------|-----------------------------------|
| short | omits the type (used to prevent ambiguity and only allowed with the Category and Script properties) | Lu |
| medium | uses an abbreviated type and value | gc=Lu |
| long | uses a full type and value | General_Category=Uppercase_Letter |
If the type or value is omitted, then the equals sign is also omitted. The short
style is only
used for Category and Script properties because these properties are very common
and their omission is unambiguous.
In actual practice, you can mix type names and values that are omitted,
abbreviated, or full. For example, if Category=Unassigned you could use what is
in the table explicitly, `\p{gc=Unassigned}`, `\p{Category=Cn}`, or
`\p{Unassigned}`.
When these are processed, case and whitespace are ignored so you may use them
for clarity, if desired. For example, `\p{Category = Uppercase Letter}` or
`\p{Category = uppercase letter}`.
For a list of supported properties, see the [Properties](properties.md) chapter.
## Getting UnicodeSet from Script
ICU provides the functionality of getting UnicodeSet from the script. Here is an
example of generating a pattern from all the scripts that are associated to a
Locale and then getting the UnicodeSet based on the generated pattern.
**In C:**
UErrorCode err = U_ZERO_ERROR;
const int32_t capacity = 10;
const char * shortname = NULL;
int32_t num, j;
int32_t strLength =4;
UChar32 c = 0x00003096 ;
UScriptCode script[10] = {USCRIPT_INVALID_CODE};
UScriptCode scriptcode = USCRIPT_INVALID_CODE;
num = uscript_getCode("ja",script,capacity, &err);
printf("%s %d \n" ,"Number of script code associated are :", num);
UnicodeString temp = UnicodeString("[", 1, US_INV);
UnicodeString pattern;
for(j=0;j<num;j++){
shortname = uscript_getShortName(script[j]);
UnicodeString str(shortname,strLength,US_INV);
temp.append("[:");
temp.append(str);
temp.append(":]+");
}
pattern = temp.remove(temp.length()-1,1);
pattern.append("]");
UnicodeSet cnvSet(pattern, err);
printf("%d\n", cnvSet.size());
printf("%d\n", cnvSet.contains(c));
**In Java:**
ULocale ul = new ULocale("ja");
int script[] = UScript.getCode(ul);
String str ="[";
for(int i=0;i<script.length;i++){
str = str + "[:"+UScript.getShortName(script[i])+":]+";
}
String pattern =str.substring(0, (str.length()-1));
pattern = pattern + "]";
System.out.println(pattern);
UnicodeSet ucs = new UnicodeSet(pattern);
System.out.println(ucs.size());
System.out.println(ucs.contains(0x00003096));

View file

@ -0,0 +1,382 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# UText
## Overview
UText is a text abstraction facility for ICU
The intent is to make it possible to extend ICU to work with text data that is
in formats above and beyond those that are native to ICU.
UText directly supports text in these formats:
1. UTF-8 (`char*`) strings
2. UTF-16 (`UChar*` or `UnicodeString`) strings
3. `Replaceable`
The ICU services that can accept UText based input are:
1. Regular Expressions
2. Break Iteration
Examples of text formats that UText could be extended to support:
1. UTF-32 format.
2. Text that is stored in discontiguous chunks in memory, or in application-specific representations.
3. Text that is in a non-Unicode code page
If ICU does not directly support a desired text format, it is possible for
application developers themselves to extend UText, and in that way gain the
ability to use their text with ICU.
## Using UText
There are three fairly distinct classes of use of UText. These are:
1. **Simple wrapping of existing text.** Application text data exists in a
format that is already supported by UText (such as UTF-8). The application
opens a UText on the data, and then passes the UText to an ICU service for
analysis/processing. Most use of UText from applications will follow this
simple pattern. Only a very few UText APIs and only a few lines of code are
required.
2. **Accessing the underlying text.** UText provides APIs for iterating over
the text in various ways, and for fetching individual code points from the
text. These functions will probably be used primarily from within ICU, in
the implementation of services that can accept input in the form of a UText.
While applications are certainly free to use these text access functions if
necessary, there may often be no need.
3. **UText support for new text storage formats.** If an application has text
data stored in a format that is not directly supported by ICU, extending
UText to support that format will provide the ability to conveniently use
those ICU services that support UText.
Extending UText to a new format is accomplished by implementing a well
defined set of *Text Provider Functions* for that format.
## UText compared with CharacterIterator
CharacterIterator is an abstract base class that defines a protocol for
accessing characters in a text-storage object. This class has methods for
iterating forward and backward over Unicode characters to return either the
individual Unicode characters or their corresponding index values.
UText and CharacterIterator both provide an abstraction for accessing text while
hiding details of the actual storage format. UText is the more flexible of the
two, however, with these advantages:
1. UText can conveniently operate on text stored in formats other than UTF-16.
2. UText includes functions for modifying or editing the text.
3. UText is more efficient. When iterating over a range of text using the
CharacterIterator API, a function call is required for every character. With
UText, iterating to the next character is usually done with small amount of
inline code.
At this time, more ICU services support CharacterIterator than UText. ICU
services that can operate on text represented by a CharacterIterator are
1. Normalizer
2. Break Iteration
3. String Search
4. Collation Element Iteration
## Example: Counting the Words in a UTF-8 String
Here is a function that uses UText and an ICU break iterator to count the number
of words in a nul-terminated UTF-8 string. The use of UText only adds two lines
of code over what a similar function operating on normal UTF-16 strings would
require.
```c
#include "unicode/utypes.h"
#include "unicode/ubrk.h"
#include "unicode/utext.h"
int countWords(const char *utf8String) {
UText *ut = NULL;
UBreakIterator *bi = NULL;
int wordCount = 0;
UErrorCode status = U_ZERO_ERROR;
ut = utext_openUTF8(ut, utf8String, -1, &status);
bi = ubrk_open(UBRK_WORD, "en_us", NULL, 0, &status);
ubrk_setUText(bi, ut, &status);
while (ubrk_next(bi) != UBRK_DONE) {
if (ubrk_getRuleStatus(bi) != UBRK_WORD_NONE) {
/* Count only words and numbers, not spaces or punctuation */
wordCount++;
}
}
utext_close(ut);
ubrk_close(bi);
assert(U_SUCCESS(status));
return wordCount;
}
```
## UText API Functions
The UText API is declared in the ICU header file
[utext.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utext.h)
### Opening and Closing.
Normal usage of UText by an application consists of opening a UText to wrap some
existing text, then passing the UText to ICU functions for processing. For this
kind of usage, all that is needed is the appropriate UText open and close
functions.
| Function | Description |
|--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| `uext_openUChars` | Open a UText over a standard ICU (`UChar *`) string. The string consists of a UTF-16 array in memory, either nul terminated or with an explicit length. |
| `utext_openUnicodeString` | Open a UText over an instance of an ICU C++ `UnicodeString`. |
| `Utext_openConstUnicodeString` | Open a UText over a read-only `UnicodeString`. Disallows UText APIs that modify the text. |
| `utext_openReplaceable` | Open a UText over an instance of an ICU C++ `Replaceable`. |
| `utext_openUTF8` | Open a UText over a UTF-8 encoded C string. May be either Nul terminated or have an explicit length. |
| `utext_close` | Close an open UText. Frees any allocated memory; required to prevent memory leaks. |
Here are some suggestions and techniques for efficient use of UText.
#### Minimizing Heap Usage
Utext's open functions include features to allow applications to minimize the
number of heap memory allocations that will be needed. Specifically,
1. UText structs may declared as local variables, that is, they may be stack
allocated rather than heap allocated.
2. Existing UText structs may be reused to refer to new text, avoiding the need
to allocate and initialize a new UText instance.
Minimizing heap allocations is important in code that has critical performance
requirements, and is doubly important for code that must scale well in
multithreaded, multiprocessor environments.
#### Stack Allocation
Here is code for stack-allocating a UText:
```c
UText mytext = UTEXT_INITIALIZER;
utext_openUChars(&myText, ...
```
The first parameter to all `utext_open` functions is a pointer to a UText. If it
is non-null, the supplied UText will be used; if it is null, a new UText will be
heap allocated.
Stack allocated UText objects *must *be initialized with `UTEXT_INITIALIZER`. An
uninitialized instance will fail to open.
#### Heap Allocation
Here is code for creating a heap allocated UText:
```c
UText *mytext = utext_openUChars(NULL, ...
```
This is slightly smaller and more convenient to write than the stack allocated
code, and there is no reason not to use heap allocated UText objects in the vast
majority of code that does not have extreme performance constraints.
#### Reuse
To reuse an existing UText, simply pass it as the first parameter to any of the
UText open functions. There is no need to close the UText first, and it may
actually be more efficient not to close it first.
Here is an example of a function that iterates over an array of UTF-8 strings,
wrapping each in a UText and passing it off to another function. On the first
time through the loop the utext open function will heap allocate a UText. On
each subsequent iterations the existing UText will be reused.
```c
#include "unicode/utypes.h"
#include "unicode/utext.h"
void f(char **strings, int numStrings) {
UText *ut = NULL;
UErrorCode status;
int i;
for (i=0; i<numStrings; i++) {
status = U_ZERO_ERROR;
ut = utext_openUTF8(ut, strings[i], -1, &status);
assert(U_SUCCESS(status));
do_something(ut);
}
utext_close(ut);
}
```
#### close
Closing a UText with `utext_close()` frees any storage associated with it, including the UText itself
for those that are heap allocated. Stack allocated UTexts should also be closed
because in some cases there may be additional heap allocated storage associated
with them, depending on the type of the underlying text storage.
## Accessing the Text
For accessing the underlying text, UText provides functions both for iterating
over the characters, and for direct random access by index. Here are the
conventions that apply for all of the access functions:
1. access to individual characters is always by code points, that is, 32 bit
Unicode values are always returned. UTF-16 surrogate values from a surrogate
pair, like bytes from a UTF-8 sequence, are not separately visible.
2. Indexing always uses the index values from the original underlying text
storage, in whatever form it has. If the underlying storage is UTF-8, the
indexes will be UTF-8 byte indexes, not UTF-16 offsets.
3. Indexes always refer to the first position of a character. This is
equivalent to saying that indexes always lie at the boundary between
characters. If an index supplied to a UText function refers to the 2<sup>nd</sup>
through the N<sup>th</sup> positions of a multi byte or multi-code-unit character, the
index will be normalized back to the first or lowest index.
4. An input index that is greater than the length of the text will be set to
refer to the end of the string, and will not generate out of bounds error.
This is similar to the indexing behavior in the UnicodeString class.
5. Iteration uses post-increment and pre-decrement conventions. That is,
`utext_next32()` fetches the code point at the current index, then leaves the
index pointing at the next character.
Here are the functions for accessing the actual text data represented by a
UText. The primary use of these functions will be in the implementation of ICU
services that accept input in the form of a UText, although application code may
also use them if the need arises.
For more detailed descriptions of each, see the API reference.
| Function | Description |
|-------------------------|------------------------------------------------------------------------------------------------------------|
| `utext_nativeLength` | Get the length of the text string in terms of the underlying native storage bytes for UTF-8, for example |
| `utext_isLengthExpensive` | Indicate whether determining the length of the string would require scanning the string. |
| `utext_char32At` | Get the code point at the specified index. |
| `utext_current32` | Get the code point at the current iteration position. Does not advance the position. |
| `utext_next32` | Get the next code point, iterating forwards. |
| `utext_previous32` | Get the previous code point, iterating backwards. |
| `utext_next32From` | Begin a forwards iteration at a specified index. |
| `utext_previous32From` | Begin a reverse iteration at a specified index. |
| `utext_getNativeIndex` | Get the current iteration index. |
| `utext_setNativeIndex` | Set the iteration index. |
| `utext_moveIndex32` | Move the current index forwards or backwards by the specified number of code points. |
| `utext_extract` | Retrieve a range of text, placing it into a UTF-16 buffer. |
| `UTEXT_NEXT32` | inline (high performance) version of `utext_next32` |
| `UTEXT_PREVIOUS32` | inline (high performance) version of `utext_previous32` |
## Modifying the Text
UText provides API for modifying or editing the text.
| Function | Description |
|---------------------|----------------------------------------------------------------------------------------------------|
| `utext_replace` | Replace a range of the original text with a replacement string. |
| `utext_copy` | Copy or Move a range of the text to a new position. |
| `utext_isWritable` | Test whether a UText supports writing operations. |
| `utext_hasMetaData` | Test whether the text includes metadata. See the class `Replaceable` for more information on meta data.. |
Certain conventions must be followed when modifying text using these functions:
1. Not all types of UText can support modifying the data. Code working with
UText instances of unknown origin should check `utext_isWritable()` first, and
be prepared to deal with failures.
2. There must be only one UText open onto the underlying string that is being
modified. (Strings that are not being modified can be the target of any
number of UTexts at the same time) The existence of a second UText that
refers to a string that is being modified is not a situation that is
detected by the implementation. The application code must be structured to
avoid the situation.
#### Cloning
UText instances may be cloned. The clone function,
```c
UText * utext_clone(UText *dest,
const UText *src,
UBool deep,
UBool readOnly,
UErrorCode *status)
```
behaves very much like a UText open functions, with the source of the text being
another UText rather than some other form of a string.
A *shallow* clone creates a new UText that maintains its own iteration state,
but does not clone the underlying text itself.
A *deep* clone copies the underlying text in addition to the UText state. This
would be appropriate if you wished to modify the text without the changes being
reflected back to the original source string. Not all text providers support
deep clone, so checking for error status returns from `utext_clone()` is
importatnt.
#### Thread Safety
UText follows the usual ICU conventions for thread safety: concurrent calls to
functions accessing the same non-const UText is not supported. If concurrent
access to the text is required, the UText can be cloned, allowing each thread
access via a separate UText. So long as the underlying text is not being
modified, a shallow clone is sufficient.
## Text Providers
A *text provider* is a set of functions that let UText support a specific text
storage format.
ICU includes several UText text provider implementations, and applications can
provide additional ones if needed.
To implement a new UText text provider, it is necessary to have an understanding
of how UText is designed.
Underneath the covers, UText is a struct that includes:
1. A pointer to a *Text Chunk*, which is a UTF-16 buffer containing a section
(or all) of the text being referenced.
For text sources whose native format
is UTF-16, the chunk description can refer directly to the original text
data. For non-UTF-16 sources, the chunk will refer to a side buffer
containing some range of the text that has been converted to UTF-16 format.
2. The iteration position, as a UTF-16 offset within the chunk.
If a text access function (one of those described above, in the previous
section) can do its thing based on the information maintained in the UText
struct, it will. If not, it will call out to one of the provider functions
(below) to do the work, or to update the UText.
The best way to really understand what is required of a UText provider is to
study the implementations that are included with ICU, and to borrow as much as
possible.
Here is the list of text provider functions.
| Function | Description |
|----------------------------|----------------------------------------------------------------------------------------------------|
| `UTextAccess` | Set up the Text Chunk associated with this UText so that it includes a requested index position. |
| `UTextNativeLength` | Return the full length of the text. |
| `UTextClone` | Clone the UText. |
| `UTextExtract` | Extract a range of text into a caller-supplied buffer |
| `UTextReplace` | Replace a range of text with a caller-supplied replacement. May expand or shrink the overall text. |
| `UTextCopy` | Move or copy a range of text to a new position. |
| `UTextMapOffsetToNative` | Within the current text chunk, translate a UTF-16 buffer offset to an absolute native index. |
| `UTextMapNativeIndexToUTF16` | Translate an absolute native index to a UTF-16 buffer offset within the current text. |
| `UTextClose` | Provider specific close. Free storage as required. |
Not every provider type requires all of the functions. If the text type is
read-only, no implementation for Replace or Copy is required. If the text is in
UTF-16 format, no implementation of the native to UTF-16 index conversions is
required.
To fully understand what is required to support a new string type with UText, it
will be necessary to study both the provider function declarations from
[utext.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/utext.h)
and the existing text provider implementations in
[utext.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/utext.cpp).

View file

@ -0,0 +1,147 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# UTF-8
*Note: This page is only relevant for C/C++. In Java, all strings are encoded in
UTF-16, except for conversion from bytes to strings (via InputStreamReader or
similar) and from strings to bytes (OutputStreamWriter etc.).*
While most of ICU works with UTF-16 strings and uses data structures optimized
for UTF-16, there are APIs that facilitate working with UTF-8, or are optimized
for UTF-8, or work with Unicode code points (21-bit integer values) regardless
of string encoding. Some data structures are designed to work equally well with
UTF-16 and UTF-8.
For UTF-8 strings, ICU normally uses `(const) char *` pointers and `int32_t`
lengths, normally with semantics parallel to UTF-16 handling. (Input length=-1
means NUL-terminated, output is NUL-terminated if there is space, output
overflow is handled with preflighting; for details see the parent [Strings
page](index.md).) Some newer APIs take an `icu::StringPiece` argument and write
to an `icu::ByteSink` or to a string class object like `std::string`.
## Conversion Between UTF-8 and UTF-16
The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++
`icu::UnicodeString` methods `fromUTF8(const StringPiece &utf8)` and
`toUTF8String(StringClass &result)`. There is also `toUTF8(ByteSink &sink)`.
In C, `unicode/ustring.h` has functions like `u_strFromUTF8WithSub()` and
`u_strToUTF8WithSub()`. (Also `u_strFromUTF8()`, `u_strToUTF8()` and
`u_strFromUTF8Lenient()`.)
The conversion functions in `unicode/ucnv.h` are intended for very flexible
handling of conversion to/from external byte streams (with customizable error
handling and support for split buffers at arbitrary boundaries) which is
normally unnecessary for internal strings.
Note: `icu::``UnicodeString` has constructors, `setTo()` and `extract()` methods
which take either a converter object or a charset name. These can be used for
UTF-8, but are not as efficient or convenient as the
`fromUTF8()`/`toUTF8()`/`toUTF8String()` methods mentioned above. (Among
conversion methods, APIs with a charset name are more convenient but internally
open and close a converter; ones with a converter object parameter avoid this.)
## UTF-8 as Default Charset
ICU has many functions that take or return `char *` strings that are assumed to
be in the default charset which should match the system encoding. Since this
could be one of many charsets, and the charset can be different for different
processes on the same system, ICU uses its conversion framework for converting
to and from UTF-16.
If it is known that the default charset is always UTF-8 on the target platform,
then you should `#define`` U_CHARSET_IS_UTF8 1` in or before `unicode/utypes.h`.
(For example, modify the default value there or pass `-D``U_CHARSET_IS_UTF8=1`
as a compiler flag.) This will change most of the implementation code to use
dedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the
conversion framework. (Avoiding such dependencies helps with statically linked
libraries and may allow the use of `UCONFIG_NO_LEGACY_CONVERSION` or even
`UCONFIG_NO_CONVERSION` \[see `unicode/uconfig.h`\].)
## Low-Level UTF-8 String Operations
`unicode/utf8.h` defines macros for UTF-8 with semantics parallel to the UTF-16
macros in `unicode/utf16.h`. The macros handle many cases inline, but call
internal functions for complicated parts of the UTF-8 encoding form. For
example, the following code snippet counts white space characters in a string:
```C
#include "unicode/utypes.h"
#include "unicode/stringpiece.h"
#include "unicode/utf8.h"
#include "unicode/uchar.h"
int32_t countWhiteSpace(StringPiece sp) {
const char *s=sp.data();
int32_t length=sp.length();
int32_t count=0;
for(int32_t i=0; i<length;) {
UChar32 c;
U8_NEXT(s, i, length, c);
if(u_isUWhiteSpace(c)) {
++count;
}
}
return count;
}
```
## Dedicated UTF-8 APIs
ICU has some APIs dedicated for UTF-8. They tend to have been added for "worker
functions" like comparing strings, to avoid the string conversion overhead,
rather than for "builder functions" like factory methods and attribute setters.
For example, `icu::Collator::compareUTF8()` compares two UTF-8 strings
incrementally, without converting all of the two strings to UTF-16 if there is
an early base letter difference.
`ucnv_convertEx()` can convert between UTF-8 and another charset, if one of the
two `UConverter`s is a UTF-8 converter. The conversion *from UTF-8 to* most
other charsets uses a dedicated, optimized code path, avoiding the pivot through
UTF-16. (Conversion *from* other charsets *to UTF-8* could be optimized as well,
but that has not been implemented yet as of ICU 4.4.)
Other examples: (This list may or may not be complete.)
* ucasemap_utf8ToLower(), ucasemap_utf8ToUpper(), ucasemap_utf8ToTitle(),
ucasemap_utf8FoldCase()
* ucnvsel_selectForUTF8()
* icu::UnicodeSet::spanUTF8(), spanBackUTF8() and uset_spanUTF8(),
uset_spanBackUTF8() (These are highly optimized for UTF-8 processing.)
* ures_getUTF8String(), ures_getUTF8StringByIndex(), ures_getUTF8StringByKey()
* uspoof_checkUTF8(), uspoof_areConfusableUTF8(), uspoof_getSkeletonUTF8()
## Abstract Text APIs
ICU offers several interfaces for text access, designed for different use cases.
(Some interfaces are simply newer and more modern than others.) Some ICU
services work with some of these interfaces, and for some of these interfaces
ICU offers UTF-8 implementations out of the box.
`UText` can be used with `BreakIterator` APIs (character/word/sentence/...
segmentation). `utext_openUTF8()` creates a read-only `UText` for a UTF-8
string.
* *Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 (or any
other charset with non-1:1 index conversion to UTF-16) if no dictionary is
supported. This excludes Thai word break. See [ticket
#5532](http://bugs.icu-project.org/trac/ticket/5532). No fix is currently
scheduled.*
* *As a workaround for Thai word breaking, you can convert the string to
UTF-16 and convert indexes to UTF-8 string indexes via
`u_strToUTF8(dest=NULL, destCapacity=0, *destLength gets UTF-8 index).`*
* *ICU 4.4 has a technology preview for UText in the regular expression API,
but some of the UText regex API and semantics are likely to change for ICU
4.6. (Especially indexing semantics.)*
A `UCharIterator` can be used with several collation APIs (although there is
also the newer `icu::Collator::compareUTF8()`) and with `u_strCompareIter()`.
`uiter_setUTF8()` creates a UCharIterator for a UTF-8 string.
It is also possible to create a `CharacterIterator` subclass for UTF-8 strings,
but `CharacterIterator` has a lot of virtual methods and it requires UTF-16
string index semantics.

View file

@ -0,0 +1,116 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# BiDi Algorithm
## Overview
Bidirectional text consists of mainly right-to-left text with some left-to-right
nested segments (such as an Arabic text with some information in English), or
vice versa (such as an English letter with a Hebrew address nested within it.)
The predominant direction is called the global orientation.
Languages involving bidirectional text are used mainly in the Middle East. They
include Arabic, Urdu, Farsi, Hebrew, and Yiddish.
In such a language, the general flow of text proceeds horizontally from right to
left, but numbers are written from left to right, the same way as they are
written in English. In addition, if some text (addresses, acronyms, or
quotations) in English or another left-to-right language is embedded, it is also
written from left to right.
* Libraries that perform a bidirectional algorithm and reorder strings
accordingly are sometimes called "Storage Layout Engines". ICU's BiDi (ubidi.h)
and shaping (ushape.h) APIs can be used at the core of such "Storage Layout
Engines". *
## Countries with Languages that Require Bidirectional Scripting
There are over 300 million people who depend on bidirectional scripts, including
Farsi and Urdu which share the same script as Arabic, but have additional
characters.
| Language | Number of Countries |
|----------|------------------------------------------------------|
| Arabic | 18 |
| Farsi | 1 (Iran) |
| Urdu | 2 (India, Pakistan) |
| Hebrew | 1 (Israel) |
| Yiddish | Israel, North America, South America, Russia, Europe |
## Logical Order versus Visual Order
When reading bidirectional text, whenever the eye of the experienced reader
encounters an embedded segment, it "automatically" jumps to the other end of the
segment and reads it in the opposite direction. The sequence in which the
characters are pronounced is thus a logical sequence which differs from the
visual sequence in which they are presented on the screen or page.
The logical order of bidirectional text is also the order in which it is usually
keyed, and in which it is stored in memory.
Consider the following example, where Arabic or Hebrew letters are represented
by uppercase English letters and English text is represented by lowercase
letters:
english CIBARA text
The English letter h is visually followed by the Arabic letter C, but logically
h is followed by the rightmost letter A. The next letter, in logical order, will
be R. In other words, the logical and storage order of the same text would be:
english ARABIC text
Text is stored and processed in logical order to make processing feasible: A
contiguous substring of logical-order text (e.g., from a copy&paste operation)
contains a logically contiguous piece of the text. For example, "ish ARA" is a
logically contiguous piece of the sample text above. By contrast, a contiguous
substring of visual-order text may contain pieces of the text from distant parts
of a paragraph. ("ish" and "CIB" from the sample text above are not logically
adjacent.) Sorting and searching in text (establishing lexical order among
strings) as well as any other kind of context-sensitive text analysis also rely
on the storage of text in logical order because such processing must match user
expectations.
When text is displayed or printed, it must be "reordered" into visual order with
some parts of the text laid out left-to-right, and other parts laid out
right-to-left. The Unicode standard specifies an algorithm for this
logical-to-visual reordering. It always works on a paragraph as a whole; the
actual positioning of the text on the screen or paper must then take line breaks
into account, based on the output of the bidirectional algorithm. The reordering
output is also used for cursor movement and selection.
Legacy systems frequently stored text in visual order to avoid reordering for
display. When exchanging data with such systems for processing in Unicode it is
necessary to reorder the data from visual order to logical order and back. Such
not-for-display transformations are sometimes referred to as "storage layout"
transformations.
The are two problems with an "inverse reordering" from visual to logical order:
There may be more than one logical order of text that results in the same
display (logical-to-visual reordering is a many-to-one function), and there is
no standard algorithm for it. ICU's BiDi API provides a setting for "inverse"
operation that modifies the standard Unicode Bidi algorithm. However, it may not
always produce the expected results. Bidirectional data should be converted to
Unicode and reordered to logical order only once to avoid roundtrip losses. Just
as it is best to never convert to non-Unicode charsets, data should not be
reordered from logical to visual order except for display and printing.
## References
ICU provides an implementation of the Unicode BiDi algorithm, as well as simple
functions to write a reordered version of the string using the generated
meta-data. An "inverse" flag can be set to **approximate** visual-to-logical
reordering. See the ubidi.h header file and the [BiDi API
References](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html) .
See [Unicode Standard Annex #9: The Bidirectional
Algorithm](http://www.unicode.org/unicode/reports/tr9/) .
## Programming Examples in C and C++
See the [BiDi API reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ubidi_8h.html)
for more information.

View file

@ -0,0 +1,107 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Case Mappings
## Overview
Case mapping is used to handle the mapping of upper-case, lower-case, and title
case characters for a given language. Case is a normative property of characters
in specific alphabets (e.g. Latin, Greek, Cyrillic, Armenian, and Georgian)
whereby characters are considered to be variants of a single letter. ICU refers
to these variants, which may differ markedly in shape and size, as uppercase
letters (also known as capital or majuscule) and lower-case letters (also known
as small or minuscule). Alphabets with case differences are called bicameral and
alphabets without case differences are called unicameral.
Due to the inclusion of certain composite characters for compatibility, such as
the Latin capital letter 'DZ' (\\u01F1 'DZ'), there is a third case called title
case. Title case is used to capitalize the first character of a word such as the
Latin capital letter 'D' with small letter 'z' ( \\u01F2 'Dz'). The term "title
case" can also be used to refer to words whose first letter is an uppercase or
title case letter and the rest are lowercase letters. However, not all words in
the title of a document or first words in a sentence will be title case. The use
of title case words is language dependent. For example, in English, "Taming of
the Shrew" would be the appropriate capitalization and not "Taming Of The
Shrew".
> :point_right: **Note**: *As of Unicode 11, Georgian now has Mkhedruli (lowercase) and Mtavruli
(uppercase) which form case pairs, but are not used in title case.*
Sample code is available in the ICU source code library at
[icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/ustring/ustring.cpp)
.
Please refer to the following sections in the [The Unicode Standard](http://www.unicode.org/versions/latest/)
for more information about case mapping:
* 3.13 Default Case Algorithms
* 4.2 Case
* 5.18 Case Mappings
## Simple (Single-Character) Case Mapping
The general case mapping in ICU is non-language based and a 1 to 1 generic
character map.
A character is considered to have a lowercase, uppercase, or title case
equivalent if there is a respective "simple" case mapping specified for the
character in the [Unicode Character Database](http://unicode.org/ucd/) (UnicodeData.txt).
If a character has no mapping equivalent, the result is the character itself.
The APIs provided for the general case mapping, located in `uchar.h` file, handles
only single characters of type `UChar32` and returns only single characters. To
convert a string to a non-language based specific case, use the APIs in either
the `unistr.h` or `ustring.h` files with a `NULL` argument locale.
## Full (Language-Specific) Case Mapping
There are different case mappings for different locales. For instance, unlike
English, the character Latin small letter 'i' in Turkish has an equivalent Latin
capital letter 'I' with dot above ( \\u0130 'İ').
Similar to the simple case mapping API, a character is considered to have a
lowercase, uppercase or title case equivalent if there is a respective mapping
specified for the character in the Unicode Character database (UnicodeData.txt).
In the case where a character has no mapping equivalent, the result is the
character itself.
To convert a string to a language based specific case, use the APIs in `ustring.h`
and `unistr.h` with an intended argument locale.
ICU implements full Unicode string case mappings.
**In general:**
* **case mapping can change the number of code points and/or code units of a
string,**
* **is language-sensitive (results may differ depending on language), and**
* **is context-sensitive (a character in the input string may map differently
depending on surrounding characters).**
## Case Folding
Case folding maps strings to a canonical form where case differences are erased.
Using the case folding API, ICU supports fast matches without regard to case in
lookups, since only binary comparison is required.
The CaseFolding.txt file in the Unicode Character Database is used for
performing locale-independent case folding. This text file is generated from the
case mappings in the Unicode Character Database, using both the single-character
and the multi-character mappings. The CaseFolding.txt file transforms all
characters having different case forms into a common form. To compare two
strings for non-case-sensitive matching, you can transform each string and then
use a binary comparison. There are also functions to compare two strings
case-insensitively using the same case folding data.
Unicode case folding is not context-sensitive. It is also not
language-sensitive, although there is a flag for whether to apply special
mappings for use with Turkic (Turkish/Azerbaijani) text data.
Character case folding APIs implementations are located in:
1. `uchar.h` for single character folding
2. `ustring.h` and `unistr.h` for character string folding.

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,668 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Transform Rule Tutorial
This tutorial describes the process of building a custom transform based on a
set of rules. The tutorial does not describe, in detail, the features of
transform; instead, it explains the process of building rules and describes the
features needed to perform different tasks. The focus is on building a script
transform since this process provides concrete examples that incorporates most
of the rules.
## Script Transliterators
The first task in building a script transform is to determine which system of
transliteration to use as a model. There are dozens of different systems for
each language and script.
The International Organization for Standardization
([ISO](http://www.elot.gr/tc46sc2/)) uses a strict definition of
transliteration, which requires it to be reversible. Although the goal for ICU
script transforms is to be reversible, they do not have to adhere to this
definition. In general, most transliteration systems in use are not reversible.
This tutorial will describe the process for building a reversible transform
since it illustrates more of the issues involved in the rules. (For guidelines
in building transforms, see "Guidelines for Designing Script Transliterations"
(§) in the [General Transforms](index.md) chapter. For external sources for
script transforms, see Script Transliterator Sources (§) in that same chapter)
> :point_right: **Note**: See *[*Properties and ICU Rule Syntax*](../../strings/properties.md) *for
information regarding syntax characters.*
In this example, we start with a set of rules for Greek since they provide a
real example based on mathematics. We will use the rules that do not involve the
pronunciation of Modern Greek; instead, we will use rules that correspond to the
way that Greek words were incorporated into the English language. For example,
we will transliterate "Βιολογία-Φυσιολογία" as "Biología-Physiología", not as
"Violohía-Fisiolohía". To illustrate some of the trickier cases, we will also
transliterate the Greek accents that are no longer in use in modern Greek.
> :point_right: **Note**: *Some of the characters may not be visible on the screen unless you have a
Unicode font with all the Greek letters. If you have a licensed copy of
Microsoft® Office, you can use the "Arial Unicode MS" font, or you can download
the [CODE2000](http://www.code2000.net/) font for free. For more information,
see [Display Problems?](http://www.unicode.org/help/display_problems.html) on
the Unicode web site.*
We will also verify that every Latin letter maps to a Greek letter. This insures
that when we reverse the transliteration that the process can handle all the
Latin letters.
> :point_right: **Note**: *This direction is not reversible. The following table illustrates this
situation:*
| Source→Target | Reversible | φ → ph → φ |
|---------------|------------|------------|
| Target→Source | Not (Necessarily) Reversible | f → φ → ph |
## Basics
In non-complex cases, we have a one-to-one relationship between letters in both
Greek and Latin. These rules map between a source string and a target string.
The following shows this relationship:
```
π <> p;
```
This rule states that when you transliterate from Greek to Latin, convert π to p
and when you transliterate from Latin to Greek, convert p to π. The syntax is
```
string1 <> string2 ;
```
We will start by adding a whole batch of simple mappings. These mappings will
not work yet, but we will start with them. For now, we will not use the
uppercase versions of characters.
# One to One Mappings
α <> a;
β <> b;
γ <> g;
δ <> d;
ε <> e;
We will also add rules for completeness. These provide fallback mappings for
Latin characters that do not normally result from transliterating Greek
characters.
# Completeness Mappings
κ < c;
κ < q;
## Context and Range
We have completed the simple one-to-one mappings and the rules for completeness.
The next step is to look at the characters in context. In Greek, for example,
the transform converts a "γ" to an "n" if it is before any of the following
characters: γ, κ, ξ, or χ. Otherwise the transform converts it to a "g". The
following list a all of the possibilities:
γγ > ng;
γκ > nk;
γξ > nx;
γχ > nch;
γ > g;
All the rules are evaluated in the order they are listed. The transform will
first try to match the first four rules. If all of these rules fail, it will use
the last one.
However, this method quickly becomes tiresome when you consider all the possible
uppercase and lowercase combinations. An alternative is to use two additional
features: context and range.
### Context
First, we will consider the impact of context on a transform. We already have
rules for converting γ, κ, ξ, and χ. We must consider how to convert the γ
character when it is followed by γ, κ, ξ, and χ. Otherwise we must permit
those characters to be converted using their specific rules. This is done with
the following:
γ } γ > n;
γ } κ > n;
γ } ξ > n;
γ } χ > n;
γ > g;
A left curly brace marks the start of a context rule. The context rule will be
followed when the transform matches the rules against the source text, but
itself will not be converted. For example, if we had the sequence γγ, the
transform converts the first γ into an "n" using the first rule, then the second
γ is unaffected by that rule. The "γ" matches a "k" rule and is converts it into
a "k". The result is "nk".
### Range
Using context, we have the same number of rules. But, by using range, we can
collapse the first four rules into one. The following shows how we can use
range:
{γ}[γκξχ] > n;
γ > g;
Any list of characters within square braces will match any one of the
characters. We can then add the uppercase variants for completeness, to get:
γ } [ΓΚΞΧγκξχ] > n;
γ > g;
Remember that we can use spaces for clarity. We can also write this rule as the
following:
γ } [ Γ Κ Ξ Χ γ κ ξ χ ] > n ;
γ > g ;
If a range of characters happens to have adjacent code numbers, we can just use
a hyphen to abbreviate it. For example, instead of writing `[a b c d e f g m n o]`,
we can simplify the range by writing `[a-g m-o]`.
## Styled Text
Another reason to use context is that transforms will convert styled text. When
transforms convert styled text, they copy the style source text to the target
text. However, the transforms are limited in that they can only convert whole
replacements since it is impossible to know how any boundaries within the source
text will correspond to the target text. Thus the following shows the effects of
the two types of rules on some sample text:
For example, suppose that we were to convert "γγ" to "ng". By using context, if
there is a different style on the first gamma than on the second (such as font,
size, color, etc), then that style difference is preserved in the resulting two
characters. That is, the "n" will have the style of the first gamma, while the
"g" will have the style of the second gamma.
> :point_right: **Note**: *Contexts preserve the styles at a much finer granularity.*
## Case
When converting from Greek to Latin, we can just convert "θ" to and from "th".
But what happens with the uppercase theta (Θ)? Sometimes we need to convert it
to uppercase "TH", and sometimes to uppercase "T" and lowercase "h". We can
choose between these based on the letters before and afterwards. If there is a
lowercase letter after an uppercase letter, we can choose "Th", otherwise we
will use "TH".
We could manually list all the lowercase letters, but we also can use ranges.
Ranges not only list characters explicitly, but they also give you access to all
the characters that have a given Unicode property. Although the abbreviations
are a bit arcane, we can specify common sets of characters such as all the
uppercase letters. The following example shows how case and range can be used
together:
Θ } [:LowercaseLetter:] <> Th;
Θ <> TH;
The example allows words like Θεολογικές‚ to map to Theologikés and not
THeologikés
> :point_right: **Note**: *You either can specify properties with the POSIX-style syntax, such as
[:LowercaseLetter:], or with the Perl-style syntax, such as
\\p{LowercaseLetter}.*
## Properties and Values
A Greek sigma is written as "ς" if it is at the end of a word (but not
completely separate) and as "σ" otherwise. When we convert characters from Greek
to Latin, this is not a problem. However, it is a problem when we convert the
character back to Greek from Latin. We need to convert an s depending on the
context. While we could list all the possible letters in a range, we can also
use a character property. Although the range `[:Letter:]` stands for all
letters, we really want all the characters that aren't letters. To accomplish
this, we can use a negated range: `[:^Letter:]`. The following shows a negated
range:
σ < [:^Letter:] { s } [:^Letter:] ;
ς < s } [:^Letter:] ;
σ < s ;
These rules state that if an "s" is surrounded by non-letters, convert it to
"σ". Otherwise, if the "s" is followed by a non-letter, convert it to "ς". If
all else fails, convert it to "σ"
> :point_right: **Note**: *Negated ranges [^...] will match at the beginning and the end of a string.
This makes the rules much easier to write. *
To make the rules clearer, you can use variables. Instead of the example above,
we can write the following:
$nonletter = [:^Letter:] ;
σ < $nonletter { s } $nonletter ;
ς < s } $nonletter ;
σ < s ;
There are many more properties available that can be used in combination. For
following table lists some examples:
| Combination | Example | Description: All code points that are: |
|----------------|--------------------------|--------------------------------------------|
| Union | [[:Greek:] [:letter:]] | either in the Greek script, or are letters |
| Intersection | [[:Greek:] & [:letter:]] | are both Greek and letters |
| Set Difference | [[:Greek:] - [:letter:]] | are Greek but not letters |
| Complement | [^[:Greek:] [:letter:]] | are neither Greek nor letters |
For more on properties, see the [UnicodeSet](../../strings/unicodeset.md) and
[Properties](../../strings/properties.md) chapters.
## Repetition
Elements in a rule can also repeat. For example, in the following rules, the
transform converts an iota-subscript into a capital I if the preceding base
letter is an uppercase character. Otherwise, the transform converts the
iota-subscript into a lowercase character.
[:Uppercase Letter:] { ͅ } > I;
ͅ > i;
However, this is not sufficient, since the base letter may be optionally
followed by non-spacing marks. To capture that, we can use the \* syntax, which
means repeat zero or more times. The following shows this syntax:
[:Uppercase Letter:] [:Nonspacing Mark:] \* { ͅ } > I ;
ͅ > i ;
The following operators can be used for repetition:
| Repetition Operators | |
|----------------------|------------------|
| X* | zero or more X's |
| X+ | one or more X's |
| X? | Zero or one X |
We can also use these operators as sequences with parentheses for grouping. For
example, "a ( b c ) \* d" will match against "ad" or "abcd" or "abcbcd".
*Currently, any repetition will cause the sequence to match as many times as allowed even if that causes the rest of the rule to fail. For example, suppose we have the following (contrived) rules:*
*The intent was to transform a sequence like "able blue" into "ablæ blué". The rule does not work as it produces "ablé blué". The problem is that when the left side is matched against the text in the first rule, the `[:Letter:]*` matches all the way back through the "al" characters. Then there is no "a" left to match. To have it match properly, we must subtract the 'a' as in the following example:*
## Æther
The start and end of a string are treated specially. Essentially, characters off
the end of the string are handled as if they were the noncharacter \\uFFFF,
which is called "æther". (The code point \\uFFFF will never occur in any valid
Unicode text). In particular, a negative Unicode set will generally also match
against the start/end of a string. For example, the following rule will execute
on the first **a** in a string, as well as an **a** that is actually preceded by
a non-letter.
| Rule | [:^L:] { a > b ; |
|---------|------------------|
| Source | a xa a |
| Results | b xa b |
This is because \\uFFFF is an element of `[:^L:]`, which includes all codepoints
that do not represent letters. To refer explicitly to æther, you can use a **$**
at the end of a range, such as in the following rules:
| Rules | [0-9$] { a > b ; a } [0-9$] > b ;|
|------------------|------------------|
| Source | a 5a a |
| Results | b 5b a |
In these rules, an **a** before or after a number -- or at the start or end of a
string -- will be matched. (You could also use \\uFFFF explicitly, but the $ is
recommended).
Thus to disallow a match against æther in a negation, you need to add the $ to
the list of negated items. For example, the first rule and results from above
would change to the following (notice that the first a is not replaced):
| Rule | [^[:L:]$] { a > b ; |
|---------|---------------------|
| Source | a xa a |
| Results | a xa b |
> :point_right: **Note**: *Characters that are outside the context limits -- contextStart to contextEnd -- are also treated as
æther.*
The property `[:any:]` can be used to match all code points, including æther.
Thus the following are equivalent:
| Rule1 | [\u0000-\U0010FFFF] { a > A ; |
|-------|-------------------------------|
| Rule2 | [:any:] { a > A ; |
However, since the transform is always greedy with no backup, this property is
not very useful in practice. What is more often required is dealing with the end
of lines. If you want to match the start or end of a line, then you can define a
variable that includes all the line separator characters, and then use it in the
context of your rules. For example:
| Rules | $break = [[:Zp:][:Zl:] \u000A-\u000D \u0085 $] ; $break { a > A ;|
|------------------|--------------------------------------------------|
| Source | a a a a |
| Results | A a A a |
There is also a special character, the period (.), that is equivalent to the
**negation** of the $break variable we defined above. It can be used to match
any characters excluding those for linebreaks or æther. However, it cannot be
used within a range: you can't have `[[.] - \u000A]`, for example. If you
want to have different behavior you can define your own variables and use them
instead of the period.
> :point_right: **Note**: *There are a few other special escapes, that can be used in ranges. These are
listed in the table below. However, instead of the latter two it is safest to
use the above $break definition since it works for line endings across different
platforms.*
| Escape | Meaning | Code |
|--------|-----------------|--------|
| \t | Tab | \u0009 |
| \n | Linefeed | \u000A |
| \r | Carriage Return | \u000D |
## Accents
We could handle each accented character by itself with rules such as the
following:
ά > á;
έ > é;
...
This procedure is very complicated when we consider all the possible
combinations of accents and the fact that the text might not be normalized. In
ICU 1.8, we can add other transforms as rules either before or after all the
other rules. We then can modify the rules to the following:
:: NFD (NFC) ;
α <> a;
...
ω <> ō;
:: NFC (NFD);
These modified rules first separate accents from their base characters and then
put them in a canonical order. We can then deal with the individual components,
as desired. We can use NFC (NFC) at the end to put the entire result into
standard canonical form. The inverse uses the transform rules in reverse order,
so the (NFD) goes at the bottom and (NFC) at the top.
A global filter can also be used with the transform rules. The following example
shows a filter used in the rules:
:: [[:Greek:][:Inherited:]];
:: NFD (NFC) ;
α <> a;
...
ω <> ō;
:: NFC (NFD);
:: ([[:Latin:][:Inherited:]]);
The global filter will cause any other characters to be unaffected. In
particular, the NFD then only applies to Greek characters and accents, leaving
all other characters
## Disambiguation
If the transliteration is to be completely reversible, what would happen if we
happened to have the Greek combination νγ? Because ν converts to n, both νγ and
γγ convert to "ng" and we have an ambiguity. Normally, this sequence does not
occur in the Greek language. However, for consistency -- and especially to aid
in mechanical testing we must consider this situation. (There are other cases
in this and other languages where both sequences occur.)
To resolve this ambiguity, use the mechanism recommended by the Japanese and
Korean transliteration standards by inserting an apostrophe or hyphen to
disambiguate the results. We can add a rule like the following that inserts an
apostrophe after an "n" if we need to reverse the transliteration process:
ν } [ΓΚΞΧγκξχ] > n\';
In ICU, there are several of these mechanisms for the Greek rules. The ICU rules
undergo some fairly rigorous mechanical testing to ensure reversibility. Adding
these disambiguation rules ensure that the rules can pass these tests and handle
all possible sequences of characters correctly.
There are some character forms that never occur in normal context. By
convention, we use tilde (\~) for such cases to allow for reverse
transliteration. Thus, if you had the text "Θεολογικές (ς)", it would
transliterate to "Theologikés (\~s)". Using the tilde allows the reverse
transliteration to detect the character and convert correctly back to the
original: "Θεολογικές (ς)". Similarly, if we had the phrase "Θεολογικέσ", it
would transliterate to "Theologiké~s". These are called anomalous characters.
## Revisiting
Rules allow for characters to be revisited after they are replaced. For example,
the following converts "C" back "S" in front of "E", "I" or "Y". The vertical
bar means that the character will be revisited, so that the "S" or "K" in a
Greek transform will be applied to the result and will eventually produce a
sigma (Σ, σ, or ς) or kappa (Κ or κ).
$softener = [eiyEIY] ;
| S < C } $softener ;
| K < C ;
| s < c } $softener ;
| k < c ;
The ability to revisit is particularly useful in reducing the number of rules
required for a given language. For example, in Japanese there are a large number
of cases that follow the same pattern: "kyo" maps to a large hiragana for "ki"
(き) followed by a small hiragana for "yo" (ょ). This can be done with a small
number of rules with the following pattern:
First, the ASCII punctuation mark, tilde "~", represents characters that never
normally occur in isolation. This is a general convention for anomalous
characters within the ICU rules in any event.
'~yu' > ゅ;
'~ye' > ぇ;
'~yo' > ょ;
Second, any syllables that use this pattern are broken into the first hiragana
and are followed by letters that will form the small hiragana.
by > び|'~y';
ch > ち|'~y';
dj > ぢ|'~y';
gy > ぎ|'~y';
j > じ|'~y';
ky > き|'~y';
my > み|'~y';
ny > に|'~y';
py > ぴ|'~y';
ry > り|'~y';
sh > し|'~y';
Using these rules, "kyo" is first converted into "き~yo". Since the "~yo" is then
revisited, this produces the desired final result, "きょ". Thus, a small number of
rules (3 + 11 = 14) provide for a large number of cases. If all of the
combinations of rules were used instead, it would require 3 x 11 = 33 rules.
You can set the new revisit point (called the cursor) anywhere in the
replacement text. You can even set the revisit point before or after the target
text. The at-sign, as in the following example, is used as a filler to indicate
the position, for those cases:
[aeiou] { x > | @ ks ;
ak > ack ;
The first rule will convert "x", when preceded by a vowel, into "ks". The
transform will then backup to the position before the vowel and continue. In the
next pass, the "ak" will match and be invoked. Thus, if the source text is "ax",
the result will be "ack".
> :point_right: **Note**: *Although you can move the cursor forward or backward, it is limited in two
ways: (a) to the text that is matched, (b) within the original substring that is
to be converted. For example, if we have the rule "a b\* {x} > |@@@@@y" and it
matches in the text "mabbx", the result will be "m|abby" (| represents the
cursor position). Even though there are five @ signs, the cursor will only
backup to the first character that is matched.*
## Copying
We can copy part of the matched string to the target text. Use parenthesis to
group the text to copy and use "$n" (where n is a number from 1 to 99) to
indicate which group. For example, in Korean, any vowel that does not have a
consonant before it gets the null consonant (?) inserted before it. The
following example shows this rule:
([aeiouwy]) > ?| $1 ;
To revisit the vowel again, insert the null consonant, insert the vowel, and
then backup before the vowel to reconsider it. Similarly, we have a following
rule that inserts a null vowel (?), if no real vowel is found after a consonant:
([b-dg-hj-km-npr-t]) > | $1 eu;
In this case, since we are going to reconsider the text again, we put in the
Latin equivalent of the Korean null vowel, which is "eu".
## Order Matters
Two rules overlap when there is a string that both rules could match at the
start. For example, the first part of the following rule does not overlap, but
the last two parts do overlap:
β > b;
γ } [ Γ Κ Ξ Χ γ κ ξ χ ] > n ;
γ > g ;
When rules do not overlap, they will produce the same result no matter what
order they are in. It does not matter whether we have either of the following:
β > b;
γ > g ;
or
γ > g ;
β > b;
When rules do overlap, order is important. In fact, a rule could be rendered
completely useless. Suppose we have:
β } [aeiou] > b;
β } [^aeiou] > v;
β > p;
In this case, the last rule is masked as none of the text that will match the
rule will already be matched by previous rules. If a rule is masked, then a
warning will be issued when you attempt to build a transform with the rules.
## Combinations
In Greek, a rough breathing mark on one of the first two vowels in a word
represents an "H". This mark is invalid anywhere else in the language. In the
normalize (NFD) form, the rough-breathing mark will be first accent after the
vowel (with perhaps other accents following). So, we will start with the
following variables and rule. The rule transforms a rough breathing mark into an
"H", and moves it to before the vowels.
$gvowel = [ΑΕΗΙΟΥΩαεηιουω];
($gvowel + ) ̔ > H | $1;
A word like ὍΤΑΝ" is transformed into "HOTAN". This transformation does not work
with a lowercase word like "ὅταν". To handle lowercase words, we insert another
rule that moves the "H" over lowercase vowels and changes it to lowercase. The
following shows this rule:
$gvowel = [ΑΕΗΙΟΥΩαεηιουω];
$lcgvowel = [αεηιουω];
($lcgvowel +) ̔ > h | $1; # fix lowercase
($gvowel + ) ̔ > H | $1;
This rule provides the correct results as the lowercase word "ὅταν" is
transformed into "hotan".
There are also titlecase words such as "Ὅταν". For this situation, we need to
lowercase the uppercase letters as the transform passes over them. We need to do
that in two circumstances: (a) the breathing mark is on a capital letter
followed by a lowercase, or (b) the breathing mark is on a lowercase vowel. The
following shows how to write a rule for this situation:
$gvowel = [ΑΕΗΙΟΥΩαεηιουω];
$lcgvowel = [αεηιουω];
# fix Titlecase
{Ο ̔ } [:Nonspacing Mark:]* [:Ll:] > H | ο;
# fix Titlecase
{Ο ( $lcgvowel * ) ̔ } > H | ο $1;
# fix lowercase
( $lcgvowel + ) ̔ > h | $1 ;
($gvowel + ) ̔ > H | $1 ;
This rule gives the correct results for lowercase as "Ὅταν" is transformed into
"Hotan". We must copy the above insertion and modify it for each of the vowels
since each has a different lowercase.
We must also write a rule to handle a single letter word like "ὃ". In that case,
we would need to look beyond the word, either forward or backward, to know
whether to transform it to "HO" or to transform it to "Ho". Unlike the case of a
capital theta (Θ), there are cases in the Greek language where single-vowel
words have rough breathing marks. In this case, we would use several rules to
match either before or after the word and ignore certain characters like
punctuation and space (watch out for combining marks).
## Pitfalls
1. **Case** When executing script conversions, if the source script has
uppercase and lowercase characters, and the target is lowercase, then
lowercase everything before your first rule. For example:
```
# lowercase target before applying forward rules
:: [:Latin:] lower ();
```
This will allow the rules to work even when they are given a mixture of
upper and lower case character. This procedure is done in the following ICU
transforms:
- Latin-Hangul
- Latin-Greek
- Latin-Cyrillic
- Latin-Devanagari
- Latin-Gujarati
- etc
1. **Punctuation** When executing script conversions, remember that scripts
have different punctuation conventions. For example, in the Greek language,
the ";" means a question mark. Generally, these punctuation marks also
should be converted when transliterating scripts.
2. **Normalization** Always design transform rules so that they work no matter
whether the source is normalized or not. (This is also true for the target,
in the case of backwards rules.) Generally, the best way to do this is to
have `:: NFD (NFC);` as the first line of the rules, and `:: NFC (NFD);` as the
last line. To supply filters, as described above, break each of these lines
into two separate lines. Then, apply the filter to either the normal or
inverse direction. Each of the accents then can be manipulated as separate
items that are always in a canonical order. If we are not using any accent
manipulation, we could use `:: NFC (NFC) ;` at the top of the rules instead.
3. **Ignorable Characters** Letters may have following accents such as the
following example:
```
# convert z after letters into s
[:lowercase letter:] } z > s ;
```
Normally, we want to ignore any accents that are on the z in performing the
rule. To do that, restate the rule as:
```
# convert z after letters into s
[:lowercase letter:] [:mark:]* } z > s ;
```
Even if we are not using NFD, this is still a good idea since some languages
use separate accents that cannot be combined.
Moreover, some languages may have embedded format codes, such as a
Left-Right Mark, or a Non-Joiner. Because of that, it is even safer to use
the following:
```
# define at the top of your file
$ignore = [ [:mark:] [:format:] ] * ;
...
# convert z after letters into sh
[:letter:] $ignore } z > s ;
```
> :point_right: **Note**: *Remember that the rules themselves must be in the same normalization format.
Otherwise, nothing will match. To do this, run NFD on the rules themselves. In
some cases, we must rearrange the order of the rules because of masking. For
example, consider the following rules:*
*If these rules are put in normalized form, then the second rule will mask the first. To avoid this, exchange the order because the NFD representation has the accents separate from the base character. We will not be able to see this on the screen if accents are rendered correctly. The following shows the NFD representation:*

View file

@ -0,0 +1,46 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Transforms
## Overview
Transforms are used to process Unicode text in many different ways. Some include
case mapping, normalization, transliteration and bidirectional text handling.
### Case Mappings
[Case mapping](casemappings.md) is used to handle mappings of upper- and lower-case characters from
one language to another language, and writing systems that use letters of the
same alphabet to handle titlecase mappings that are particular to some class.
They provide for certain language-specific mappings as well.
### Normalization
[Normalization](normalization/index.md) is used to convert text to a unique, equivalent form. Systems can
normalize Unicode-encoded text to one particular sequence, such as a normalizing
composite character sequences into precomposed characters. While Normalization
Forms are specified for Unicode text, they can also be extended to non-Unicode
(legacy) character encodings. This is based on mapping the legacy character set
strings to and from Unicode.
### Transforms
[Transforms](general/index.md) provide a general-purpose package for processing Unicode text. They
are a powerful and flexible mechanism for handling a variety of different tasks,
including:
* Uppercase, Lowercase, Titlecase, Full/Halfwidth conversions
* Normalization
* Hex and Character Name conversions
* Script to Script conversion
### Bidirectional Algorithm
The [Bidirectional Algorithm](bidi.md) was developed to specify the direction of text in a
text flow.

View file

@ -0,0 +1,10 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Normalization Examples (Obsolete)
## This page contained examples showing obsolete APIs
The examples have been removed, and updated examples added to the main [Normalization page](index.md).

View file

@ -0,0 +1,215 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Normalization
## Overview
Normalization is used to convert text to a unique, equivalent form. Software can
normalize equivalent strings to one particular sequence, such as normalizing
composite character sequences into pre-composed characters.
Normalization allows for easier sorting and searching of text. The ICU
normalization APIs support the standard normalization forms which are described
in great detail in [Unicode Technical Report #15 (Unicode Normalization
Forms)](http://www.unicode.org/reports/tr15/) and the Normalization, Sorting and
Searching sections of chapter 5 of the [Unicode
Standard](http://www.unicode.org/versions/latest/). ICU also supports related,
additional operations. Some of them are described in [Unicode Technical Note #5
(Canonical Equivalence in Applications)](http://www.unicode.org/notes/tn5/).
## New API
ICU 4.4 adds the Normalizer2 API (in
[Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/Normalizer2.html),
[C++](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classNormalizer2.html) and
[C](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unorm2_8h.html)), replacing almost all
of the old Normalizer API. There is a [design
doc](http://site.icu-project.org/design/normalization/custom) with many details.
All of the replaced old API is now implemented as a thin wrapper around the new
API.
Here is a summary of the differences:
* Custom data: The new API uses non-static functions. A Normalizer2 instance
can be created from standard Unicode normalization data, or from a custom
(application-specific) data file with custom data processed by the new
gennorm2 tool.
* Examples for possible custom data include UTS #46 IDNA mappings, MacOS X
file system normalization, and a combination of NFKC with case folding
(see the Unicode FC_NFKC_Closure property).
* By using a single data file and a single processing step for
combinations like NFKC + case folding, the performance for such
operations is improved.
* NFKC_Casefold: ICU 4.4 supports the combination of NFKC, case folding and
removing ignorable characters which was introduced with Unicode 5.2.
* The old unorm.icu data file (used in Java, was hardcoded in the common
library in C/C++) has been replaced with two new files, nfc.nrm and
nfkc.nrm. If only canonical or only compatibility mappings are needed, then
the other data file can be removed. There is also a new nfkc_cf.nrm file for
NFKC_Casefold.
* FCD: The old API supports [FCD
processing](http://www.unicode.org/notes/tn5/#FCD) only for NFC/NFD data.
Normalizer2 supports it for any data file, including NFKC/NFKD and custom
data.
* FCC: Normalizer2 optionally supports [contiguous
composition](http://www.unicode.org/notes/tn5/#FCC) which is almost the same
as NFC/NFKC except that the normalized form also passes the FCD test. This
is also supported for any standard or custom data file.
* Quick check: There is a new spanQuickCheckYes() function for an optimized
combination of quick check and normalization.
* Filtered: The new FilteredNormalizer2 class combines a Normalizer2 instance
with a UnicodeSet to limit normalization to certain characters. For example,
The old API's UNICODE_3_2 option is implemented via a FilteredNormalizer2
using a UnicodeSet with the pattern `[:age=3.2:]`. (In other words, Unicode
3.2 normalization now requires the uprops.icu data.)
* Ease of use: In general, the switch to a factory method, otherwise
non-static functions, and multiple data files, simplifies all of the
function signatures.
* Iteration: Support for iterative normalization is now provided by functions
that test properties of code points, rather than requiring a particular type
of ICU character iterator. The old implementation anyway simply fetched the
code points and used equivalent code point test functions. The new API also
provides a wider variety of such test functions.
* String interfaces: In Java, input parameters are now CharSequence
references, and output is to StringBuilder or Appendable.
The new API does not replace a few pieces of the old API:
* The string comparison functions are still provided only on the old API,
although reimplemented using the new code. They use multiple Normalizer2
instances (FCD and NFD) and are therefore a poor fit for the new Normalizer2
class. If necessary, a modernized replacement taking multiple Normalizer2
instances as parameters is possible, but not planned.
* The old QuickCheck return values are used by the new API as well.
## Data File Syntax
The gennorm2 tool accepts one or more .txt files and generates a .nrm binary
data file for Normalizer2.getInstance(). For gennorm2 command line options,
invoke gennorm2 --help.
gennorm2 starts with no data. If you want to include standard Unicode
Normalization data, use the files in
[{ICU4C}/source/data/unidata/norm2/](http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/unidata/norm2)
. You can modify one of them, or provide it together with one or more additional
files that add or remove mappings.
Hangul/Jamo data (mappings and ccc=0) are predefined and cannot be modified.
Mappings in one text file can override mappings in previous files of the same
gennorm2 invocation.
Comments start with #. White space between tokens is ignored. Characters are
written as hexadecimal code points. Combining class values are written as
decimal numbers.
In each file, each character can have at most one mapping and at most one ccc
(canonical combining class) value. A ccc value must not be 0. (ccc=0 is the
default.)
Each line defines data for either a single code point (`00E1`) or a range of
code points (`0300..0314`).
A two-way mapping must map to a sequence of exactly two characters. Multi-code
point ranges cannot have two-way mappings.
A one-way mapping can map to zero, one, two or more characters. Mapping to zero
characters removes the original character in normalization.
The generator tool will apply each mapping recursively to each other. Groups of
mappings that are forbidden by the Unicode Normalization algorithms are reported
as errors. For example, if a character has a two-way mapping, then neither of
its mapping characters can have a one-way mapping.
* Unicode 6.1         # Optional Unicode version (since ICU 49; default: uchar.h U_UNICODE_VERSION)
00E1=0061 0301        # Two-way mapping
00AA>0061             # One-way mapping
0300..0314:230        # ccc for a code point range
0315:232              # ccc for a single code point
0132..0133>0069 006A  # Range, each code point mapping to "ij"
E0000..E0FFF>         # Range, each code point mapping to the empty string
It is possible to override mappings from previous source files, including
removing a mapping:
00AA-
E0000..E0FFF-
## Data Generation Tool
Normally, data from one or more input files is combined as described above,
processed, and a binary data file is written for use by the ICU library (same
file for C++ and Java). The binary data file format changes occasionally in
order to support additional functionality.
bin/gennorm2 -v -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
For the complete set of options, invoke `gennorm2 --help`.
Instead of the binary data file, the processed data can be written into a C
file. This is closely tied to the needs of the ICU library. The format may
change from one ICU version to the next.
bin/gennorm2 -v -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt **--csource**
With the --combined option, gennorm2 writes the combined data of the input
files. The following example writes the combined NFKC_Casefold data. (New in ICU
60.)
bin/gennorm2 -o /tmp/nfkc_cf.txt -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt **--combined**
With the "minus" operator, gennorm2 writes the diffs of the combined data from
two sets of input files. (New in ICU 60.)
For example, the nfkc_cf.txt file in ICU contains the Unicode NFKC_CF mappings,
extracted from the UCD file DerivedNormalizationProps.txt. It is not minimal.
The following command line generates the minimal differences of NFKC_Casefold
compared with NFKC.
bin/gennorm2 -o /tmp/nfkc_cf-minus-nfkc.txt -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt **minus** nfc.txt nfkc.txt
## Example
class NormSample {
public:
  // ICU service objects should be cached and reused, as usual.
  NormSample(UErrorCode &errorCode)
      : nfkc(*Normalizer2::getNFKCInstance(errorCode),
        fcd(*Normalizer2::getInstance(NULL, "nfc", UNORM2_FCD, errorCode) {}
  // Normalize a string.
  UnicodeString toNFKC(const UnicodeString &s, UErrorCode &errorCode) {
    return nfkc.normalize(s, errorCode);
  }
  // Ensure FCD before processing (like in sort key generation).
  // In practice, almost all strings pass the FCD test, so it might make sense to
  // test for it and only normalize when necessary, rather than always normalizing.
  void processText(const UnicodeString &s, UErrorCode &errorCode) {
    UnicodeString fcdString;
    const UnicodeString *ps;  // points to either s or fcdString
    int32_t spanQCYes=fcd.spanQuickCheckYes(s, errorCode);
    if(U_FAILURE(errorCode)) {
      return;  // report error
    }
    if(spanQCYes==s.length()) {
      ps=&s;  // s is already in FCD
    } else {
      // unnormalized suffix as a read-only alias (does not copy characters)
      UnicodeString unnormalized=s.tempSubString(spanQCYes);
      // set the fcdString to the FCD prefix as a read-only alias
      fcdString.setTo(FALSE, s.getBuffer(), spanQCYes);
      // automatic copy-on-write, and append the FCD'ed suffix
      fcd.normalizeSecondAndAppend(fcdString, unnormalized, errorCode);
      ps=&fcdString;
      if(U_FAILURE(errorCode)) {
        return;  // report error
      }
    }
    // ... now process the string *ps which is in FCD ...
  }
private:
  const Normalizer2 &nfkc;
  const Normalizer2 &fcd;
};

531
docs/userguide/unicode.md Normal file
View file

@ -0,0 +1,531 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Unicode Basics
## Introduction to Unicode
Unicode is a standard that precisely defines a character set as well as a small
number of encodings for it. It enables you to handle text in any language
efficiently. It allows a single application executable to work for a global
audience. ICU, like Java™, Microsoft® Windows NT™, Windows™ 2000 and other
modern systems, provides Internationalization solutions based on Unicode.
This chapter is intended as an introduction to codepages in general and Unicode
in particular. For further information, see:
1. [The Web site of the Unicode consortium](http://www.unicode.org/)
2. [What is
Unicode?](http://www.unicode.org/unicode/standard/WhatIsUnicode.html)
3. [IBM® Globalization](http://www.ibm.com/software/globalization/)
Go to the [online ICU demos](http://demo.icu-project.org/icu-bin/icudemos) to
see how a Unicode-based server application can handle text in many languages and
many encodings.
## Traditional Character Sets and Unicode
Representing text-format data in computers is a matter of defining a set of
characters and assigning each of them a number and a bit representation.
Underlying this basic idea are three related concepts:
1. A character set or repertoire is an unordered collection of characters that
can be represented by numeric values.
2. A coded character set maps characters from a character set or repertoire to
numeric values.
3. A character encoding scheme defines the representation of numeric values
from one or more coded character sets in bits and bytes.
For simple encodings such as ASCII, the last two concepts are basically the
same: ASCII assigns 128 characters and control codes to consecutive numbers from
0 to 127. These characters and control codes are encoded as simple, unsigned,
binary integers. Therefore, ASCII is both a coded character set and a character
encoding scheme.
ASCII only encodes 128 characters, 33 of which are control codes rather than
graphic, displayable characters. It was designed to represent English-language
text for an American user base, and is therefore insufficient for representing
text in almost any language other than American English. In fact, most
traditional encodings were limited to one or few languages and scripts.
ASCII offered a natural way to extend it: Designed in the 1960's to work in
systems with 7-bit bytes while most computers and Internet protocols since the
1970's use 8-bit bytes, the extra bit allowed another 128 byte values to
represent more characters. Various encodings were developed that supported
different languages. Some of these were based on ASCII, others were not.
Languages such as Japanese need to encode considerably more than 256 characters.
Various encoding schemes enable large character sets with thousands or tens of
thousands of characters to be represented. Most of those encodings are still
byte-based, which means that many characters require two or more bytes of
storage space. A process must be developed to interpret some byte values.
Various character sets and encoding schemes have been developed independently,
cover only one or few languages each, and are incompatible. This makes it very
difficult for a single system to handle text in more than one language at a
time, and especially difficult to do so in a way that is interoperable across
different systems.
Generally, the minimum requirement for the interoperable exchange of text data
is that the encoding (character set & encoding scheme) must be properly
specified in the document and in the protocol. For example, email/SMTP and
HTML/HTTP provide the means to specify the "charset", as it is called in
Internet standards. However, very often the encoding is not specified, specified
incorrectly, or the sender and receiver disagree on its implementation.
The ISO 2022 encoding scheme was created to store text in many different
languages. It allows other encodings to be embedded by first announcing them and
then switching between them. Full support for all features and possible
encodings with ISO 2022 requires complicated processing and the need to support
many encodings. For East Asian languages, subsets were developed that cover only
one language or a few at a time, but they are much more manageable. ISO 2022 is
not well-suited for use in internal processing. It is designed for data
exchange.
## Glyphs versus Characters
Programmers often need to distinguish between characters and glyphs. A character
is the smallest semantic unit in a writing system. It is an abstract concept
such as the letter A or the exclamation point. A glyph is the visual
presentation of one or more characters, and is often dependent on adjacent
characters.
There is not always a one-to-one mapping between characters and glyphs. In many
languages (Arabic is a prime example), the way a character looks depends heavily
on the surrounding characters. Standard printed Arabic has as many as four
different printed representations (glyphs) for every letter of the alphabet. In
many languages, two or more letters may combine together into a single glyph
(called a ligature), or a single character might be displayed with more than one
glyph.
Despite the different visual variants of a particular letter, it still retains
its identity. For example, the Arabic letter heh has four different visual
representations in common use. Whichever one is used, it still keeps its
identity as the letter heh. It is this identity that Unicode encodes, not the
visual representation. This also cuts down on the number of independent
character values required.
## Overview of Unicode
Unicode was developed as a single-coded character set that contains support for
all languages in the world. The first version of Unicode used 16-bit numbers,
which allowed for encoding 65,536 characters without complicated multibyte
schemes. With the inclusion of more characters, and following implementation
needs of many different platforms, Unicode was extended to allow more than one
million characters. Several other encoding schemes were added. This introduced
more complexity into the Unicode standard, but far less than managing a large
number of different encodings.
Starting with Unicode 2.0 (published in 1996), the Unicode standard began
assigning numbers from 0 to 10ffff16, which requires 21 bits but does not use
them completely. This gives more than enough room for all written languages in
the world. The original repertoire covered all major languages commonly used in
computing. Unicode continues to grow, and it includes more scripts.
The design of Unicode differs in several ways from traditional character sets
and encoding schemes:
1. Its repertoire enables users to include text efficiently in almost all
languages within a single document.
2. It can be encoded in a byte-based way with one or more bytes per character,
but the default encoding scheme uses 16-bit units that allow much simpler
processing for all common characters.
3. Many characters, such as letters with accents and umlauts, can be combined
from the base character and accent or umlaut modifiers. This combining
reduces the number of different characters that need to be encoded
separately. "Precomposed" variants for characters that existed in common
character sets at the time were included for compatibility.
4. Characters and their usage are well-defined and described. While traditional
character sets typically only provide the name or a picture of a character
and its number and byte encoding, Unicode has a comprehensive database of
properties available for download. It also defines a number of processes and
algorithms for dealing with many aspects of text processing to make it more
interoperable.
The early inclusion of all characters of commonly used character sets makes
Unicode a useful "pivot" point for converting between traditional character
sets, and makes it feasible to process non-Unicode text by first converting into
Unicode, process the text, and convert it back to the original encoding without
loss of data.
> :point_right: *The first 128 Unicode code point values are assigned to the same characters as
in US-ASCII. For example, the same number is assigned to the same character. The
same is true for the first 256 code point values of Unicode compared to ISO
8859-1 (Latin-1) which itself is a direct superset of US-ASCII. This makes it
easy to adapt many applications to Unicode because the numbers for many
syntactically important characters are the same.*
## Character Encoding Forms and Schemes for Unicode
Unicode assigns characters a number from 0 to 10FFFF16, giving enough elbow room
to allow for unambiguous encoding of every character in common use. Such a
character number is called a "code point".
> :point_right: *Unicode code points are just non-negative integer numbers in a certain range.
They do not have an implicit binary representation or a width of 21 or 32 bits.
Binary representation and unit widths are defined for encoding forms.*
For internal processing, the standard defines three encoding forms, and for file
storage and protocols, some of these encoding forms have encoding schemes that
differ in their byte ordering. The difference between an encoding form and an
encoding scheme is that an encoding form maps the character set codes to values
that fit into internal data types (like a short in C), while an encoding scheme
maps to bits and bytes. For traditional encodings, they are the same since the
encoding forms already map to bytes
. The different Unicode encoding forms are optimized for a variety of different
uses:
1. UTF-16, the default encoding form, maps a character code point to either one
or two 16-bit integers.
2. UTF-8 is a byte-based encoding that offers backwards compatibility with
ASCII-based, byte-oriented APIs and protocols. A character is stored with 1,
2, 3, or 4 bytes.
3. UTF-32 is the simplest but most memory-intensive encoding form: It uses one
32-bit integer per Unicode character.
4. SCSU is an encoding scheme that provides a simple compression of Unicode
text. It is designed only for input and output, not for internal use.
ICU uses UTF-16 internally. ICU 2.0 fully supports supplementary characters
(with code points 1000016..10FFFF16. Older versions of ICU provided only partial
support for supplementary characters.
For input/output, character encoding schemes define a byte serialization of
text. UTF-8 is itself both an encoding form and an encoding scheme because it is
byte-based. For each of UTF-16 and UTF-32, there are two variants defined: one
that serializes the code units in big-endian byte order (most significant byte
first), and one that serializes the code units in little-endian byte order
(least significant byte first). The corresponding encoding schemes are called
UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.
> :point_right: *The names "UTF-16" and "UTF-32" are ambiguous. Depending on context, they refer
either to character encoding forms where 16/32-bit words are processed and are
naturally stored in the platform endianness, or they refer to the
IANA-registered charset names, i.e., to character encoding schemes or byte
serializations. In addition to simple byte serialization, the charsets with
these names also use optional Byte Order Marks (see Serialized Formats (§)
below).*
## Overview of UTF-16
The default encoding form of the Unicode Standard uses 16-bit code units. Code
point values for the most common characters are in the range of 0 to FFFF16 and
are encoded with just one 16-bit unit of the same value. Code points from
1000016 to 10FFFF16 are encoded with two code units that are often called
"surrogates", and they are called a "surrogate pair" when, together, they
correctly encode one Unicode character. The first surrogate in a pair must be in
the range D80016 to DBFF16, and the second one must be in the range DC0016 to
DFFF16. Every Unicode code point has only one possible UTF-16 encoding with
either one code unit that is not a surrogate or with a correct pair of
surrogates. The code point values D80016 to DFFF16 are set aside just for this
mechanism and will never, by themselves, be assigned any characters.
Most commonly used characters have code points below FFFF16, but Unicode 3.1
assigns more than 40,000 supplementary characters that make use of surrogate
pairs in UTF-16.
Note that comparing UTF-16 strings lexically based on their 16-bit code units
does not result in the same order as comparing the code points. This is not
usually an issue since only rarely-used characters are affected. Most processes
do not rely on the same results in such comparisons. Where necessary, a simple
modification to a string comparison can be performed that still allows efficient
code unit-based comparisons and makes them compatible with code point
comparisons. ICU has C and C++ API functions for this.
## Overview of UTF-8
To meet the requirements of byte-oriented, ASCII-based systems, the Unicode
Standard defines UTF-8. UTF-8 is a variable-length, byte-based encoding that
preserves ASCII transparency.
UTF-8 maintains transparency for all of the ASCII code values (0..127). These
values do not appear in any byte of a transformed result except as the direct
representation of the ASCII values. Thus, ASCII text is also UTF-8 text.
Characteristics of UTF-8 include:
1. Unicode code points 0 to 7F16 are each encoded with a single byte of the
same value. Therefore, ASCII characters take up 50% less space with UTF-8
encoding than with UTF-16.
2. All other code points are encoded with multibyte sequences, with the first
byte (lead byte) indicating the number of bytes that follow (trail bytes).
This results in very efficient parsing. The lead bytes are in the range c016
to fd16, the trail bytes are in the range 8016 to bf16. The byte values fe16
and FF16 are never used.
3. UTF-8 is relatively compact and resource conservative in its use of the
bytes required for encoding text in European scripts, but uses 50% more
space than UTF-16 for East Asian text. Code points up to 7FF16 take up two
bytes, code points up to FFFF16 take up three (50% more memory than UTF-16),
and all others four.
4. Binary comparisons of UTF-8 strings based on their bytes result in the same
order as comparing code point values.
## Overview of UTF-32
The UTF-32 encoding form always uses one single 32-bit integer per Unicode code
point. This results in a very simple encoding.
The drawback is its memory consumption: Since code point values use only 21
bits, one-third of the memory is always unused, and since most commonly used
characters have code point values of up to FFFF16, they take up only one 16-bit
unit in UTF-16 (50% less) and up to three bytes in UTF-8 (25% less).
UTF-32 is mainly used in APIs that are defined with the same data type for both
code points and code units. Modern versions of the C standard library that
support Unicode use a 32-bit wchar_t with UTF-32 semantics.
## Overview of SCSU
SCSU (Standard Compression Scheme for Unicode) is designed to reduce the size of
Unicode text for both input and output. It is a simple compression that
transforms the text into a byte stream. It typically uses one byte per character
in small scripts, and two bytes per character in large, East Asian scripts.
It is usually shorter than any of the UTFs. However, SCSU is stateful, which
makes it unsuitable for internal processing. It also uses all possible byte
values, which might require additional processing for protocols such as SMTP
(email).
See also <http://www.unicode.org/unicode/reports/tr6/> .
## Other Unicode Encodings
Other Unicode encodings have been developed over time for various purposes. Most
of them are implemented in ICU, see
[source/data/mappings/convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt)
1. BOCU-1: Binary-Ordered Compression of Unicode
An encoding of Unicode that is about as compact as SCSU but has a much
smaller amount of state. Unlike SCSU, it preserves code point order and can
be used in 8bit emails without a transfer encoding. BOCU-1 does **not**
preserve ASCII characters in ASCII-readable form. See [Unicode Technical
Note #6](http://www.unicode.org/notes/tn6/) .
2. UTF-7: Designed for 7bit emails; simple and not very compact. Since email
systems have been 8-bit safe for several years, UTF-7 is not necessary any
more and not recommended. Most ASCII characters are readable, others are
base64-encoded. See [RFC 2152](http://www.ietf.org/rfc/rfc2152.txt) .
3. IMAP-mailbox-name: A variant of UTF-7 that is suitable for expressing
Unicode strings as ASCII characters for Unix filenames.
**The name "IMAP-mailbox-name" is specific to ICU!**
See [RFC 2060 INTERNET MESSAGE ACCESS PROTOCOL - VERSION
4rev1](http://www.ietf.org/rfc/rfc2060.txt) section 5.1.3. Mailbox
International Naming Convention.
4. UTF-EBCDIC: An EBCDIC-friendly encoding that is similar to UTF-8. See
[Unicode Technical Report #16](http://www.unicode.org/reports/tr16/) . **As
of ICU 2.6, UTF-EBCDIC is not implemented in ICU.**
5. CESU-8: Compatibility Encoding Scheme for UTF-16: 8-Bit
An incompatible variant of UTF-8 that preserves 16-bit-Unicode (UTF-16)
string order instead of code point order. Not for open interchange. See
[Unicode Technical Report #26](http://www.unicode.org/reports/tr26/) .
## Programming using UTFs
Programming using any of the UTFs is much more straightforward than with
traditional multi-byte character encodings, even though UTF-8 and UTF-16 are
also variable-width encodings.
Within each Unicode encoding form, the code unit values for singletons (code
units that alone encode characters), lead units, and for trailing units are all
disjointed. This has crucial implications for implementations. The following
lists these implications:
1. Determines the number of units for one code point using the lead unit. This
is especially important for UTF-8, where there can be up to 4 bytes per
character.
2. Determines boundaries. If ICU users randomly access text, you can always
determine the nearest code-point boundaries with a small number of machine
instructions.
3. Does not have any overlap. If ICU users search for string A in string B, you
never get a false match on code points. Users do not need to convert to code
points for string searching. False matches never occurs since the end of one
sequence is never the same as the start of another sequence. Overlap is one
of the biggest problems with common multi-byte encodings like Shift-JIS. All
of the UTFs avoid this problem.
4. Uses simple iteration. Getting the next or previous code point is
straightforward, and only takes a small number of machine instructions.
5. Can use UTF-16 encoding, which is actually fully symmetric. ICU users can
determine from any single code unit whether it is the first, last, or only
one for a code point. Moving (iterating) in either direction through UTF-16
text is equally fast and efficient.
6. Uses slow indexing by code points. This indexing procedure is a disadvantage
of all variable-width encodings. Except in UTF-32, it is inefficient to find
code unit boundaries corresponding to the nth code point or to find the code
point offset containing the nth code unit. Both involve scanning from the
start of the text or from a last known boundary. ICU, like most common APIs,
always indexes by code units. It counts code units and not code points.
Conversion between different UTFs is very fast. Unlike converting to and from
legacy encodings like Latin-2, conversion between UTFs does not require table
look-ups.
ICU provides two basic data type definitions for Unicode. UChar32 is a 32-bit
type for code points, and used for single Unicode characters. It may be signed
or unsigned. It is the same as wchar_t if it is 32 bits wide. UChar is an
unsigned 16-bit integer for UTF-16 code units. It is the base type for strings
(`UChar *`), and it is the same as wchar_t if it is 16 bits wide.
Some higher-level APIs, used especially for formatting, use characters closer to
a representation for a glyph. Such "user characters" are also called "graphemes"
or "grapheme clusters" and require strings so that combining sequences can be
included.
## Serialized Formats
In files, input, output, and network protocols, text must be accompanied by the
specification of its character encoding scheme for a client to be able to
interpret it correctly. (This is called a "charset" in Internet protocols.)
However, an encoding scheme specification is not necessary if the text is only
used within a single platform, protocol, or application where it is otherwise
clear what the encoding is. (The language and text directionality should usually
be specified to enable spell checking, text-to-speech transformation, etc.)
*The discussion of encoding specifications in this section applies to standard
Internet protocols where charset name strings are used. Other protocols may use
numeric encoding identifiers and assign different semantics to those identifiers
than Internet protocols.*
Typically, the encoding specification is done in a protocol- and document
format-dependent way. However, the Unicode standard offers a mechanism for
tagging text files with a "signature" for cases where protocols do not identify
character encoding schemes.
The character ZERO WIDTH NO-BREAK SPACE (FEFF16) can be used as a signature by
prepending it to a file or stream. The alternative function of U+FEFF as a
format control character has been copied to U+2060 WORD JOINER, and U+FEFF
should only be used for Unicode signatures.
The different character encoding schemes generate different, distinct byte
sequences for U+FEFF:
1. UTF-8: EF BB BF
2. UTF-16BE: FE FF
3. UTF-16LE: FF FE
4. UTF-32BE: 00 00 FE FF
5. UTF-32LE: FF FE 00 00
6. SCSU: 0E FE FF
7. BOCU-1: FB EE 28
8. UTF-7: 2B 2F 76 ( 38 | 39 | 2B | 2F )
9. UTF-EBCDIC: DD 73 66 73
ICU provides the function ucnv_detectUnicodeSignature() for Unicode signature
detection.
*There is no signature for CESU-8 separate from the one for UTF-8. UTF-8 and
CESU-8 encode U+FEFF and in fact all BMP code points with the same bytes. The
opportunity for misidentification of one as the other is one of the reasons why
CESU-8 should only be used in limited, closed, specific environments.*
In UTF-16 and UTF-32, where the signature also distinguishes between big-endian
and little-endian byte orders, it is also called a byte order mark (BOM). The
signature works for UTF-16 since the code point that has the byte-swapped
encoding, FFFE16, will never be a valid Unicode character. (It is a
"non-character" code point.) In Internet protocols, if an encoding specification
of "UTF-16" or "UTF-32" is used, it is expected that there is a signature byte
sequence (BOM) that identifies the byte ordering, which is not the case for the
encoding scheme/charset names with "BE" or "LE".
*If text is specified to be encoded in the UTF-16 or UTF-32 charset and does not
begin with a BOM, then it must be interpreted as UTF-16BE or UTF-32BE,
respectively.*
A signature is not part of the content, and must be stripped when processing.
For example, blindly concatenating two files will give an incorrect result.
If a signature was detected, then the signature "character" U+FEFF should be
removed from the Unicode stream **after** conversion. Removing the signature
bytes before conversion could cause the conversion to fail for stateful
encodings like BOCU-1 and UTF-7.
Whether a signature is to be recognized or not depends on the protocol or
application.
1. If a protocol specifies a charset name, then the byte stream must be
interpreted according to how that name is defined. Only the "UTF-16" and
"UTF-32" names include recognition of the byte order marks that are specific
to them (and the ICU converters for these names do this automatically). None
of the other Unicode charsets are defined to include any signature/BOM
handling.
2. If no charset name is provided, for example for text files in most
filesystems, then applications must usually rely on heuristics to determine
the file encoding. Many document formats contain an embedded or implicit
encoding declaration, but for plain text files it is reasonable to use
Unicode signatures as simple and reliable heuristics. This is especially
common on Windows systems. However, some tools for plain text file handling
(e.g., many Unix command line tools) are not prepared for Unicode
signatures.
## The Unicode Standard Is An Industry Standard
The Unicode standard is an industry standard and parallels ISO 10646-1. Around
1993, these two standards were effectively merged into the same character set
standard. Both standards have the same character repertoire and the same
encoding forms and schemes.
One difference used to be that the ISO standard defined code point values to be
from 0 to 7FFFFFFF16, not just up to 10FFFF16. The ISO work group decided to add
an amendment to the standard. The amendment removes this difference by declaring
that no characters will ever be assigned code points above 10FFFF16. The main
reason for the ISO work group's decision is interoperability between the UTFs.
UTF-16 can not encode any code points above this limit.
This means that the code point space for both Unicode and ISO 10646 is now the
same! **These changes to ISO 10646 have been made recently and should be
complete in the edition ISO 10646:2003 which also combines all parts of the
standard into one.**
The former, larger code space is the reason why the ISO definition of UTF-8
specifies sequences of five and six bytes to cover that whole range.
Another difference is that the ISO standard defines encoding forms "UCS-4" and
"UCS-2". UCS-4 is essentially UTF-32 with a theoretical upper limit of
7FFFFFFF16, using 31 out of the 32 bits. However, in practice, the ISO committee
has accepted that the characters above 10FFFF will not be encoded, so there is
essentially no difference between the forms. The "4" stands for "four-byte
form".
UCS-2 is a subset of UTF-16 that is limited to code points from 0 to FFFF,
excluding the surrogate code points. Thus, it cannot represent the characters
with code points above FFFF (called supplementary characters).
*There is no conversion necessary between UCS-2 and UTF-16. The difference is
only in the interpretation of surrogates.*
The standards differ in what kind of information they provide: The Unicode
standard provides more character properties and describes algorithms etc., while
the ISO standard defines collections, subsets and similar.
The standards are synchronized and the respective committees work together to
add new characters and assign code point values.

View file

@ -0,0 +1,456 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# How To Use ICU4C From COBOL
## Overview
This document describes how to use ICU functions within a COBOL program. It is
assumed that the programmer understands the concepts behind ICU, and is able to
identify which ICU APIs are appropriate for his/her purpose. The programmer must
also understand the meaning of the arguments passed to these APIs and of the
returned value, if any. This is all explained in the ICU documentation, although
in C/C++ style. This documents objective is to facilitate the adaptation of
these explanations to COBOL syntax.
It must be understood that the packaging of ICU data and executable code into
libraries is platform dependent. Consequently, the calling conventions between
COBOL programs and the C/C++ functions in ICU may vary from platform to
platform. In a lesser way, the C/C++ types of arguments and return values may
have different equivalents in COBOL, depending on the platform and even the
specific COBOL compiler used.
This document is supplemented with three [sample
programs](https://sourceforge.net/projects/icu/files/OldFiles/samples/ICU-COBOL.zip)
illustrating using ICU APIs for code page conversion, collation and
normalization. Description of the sample programs appears in the appendix at the
end of this document.
## ICU API invocation in COBOL
1. Invocation of ICU APIs is done with the COBOL “CALL” statement.
2. Variables, pointers and constants appearing in ICU \*.H files (for C/C++)
must be defined in the WORKING-STORAGE section for COBOL.
3. Arguments to a C/C++ API translate into arguments to a COBOL CALL statement,
passed by value or by reference as will be detailed below.
4. For a C/C++ API with a non-void return value, the RETURNING clause will be
used for the CALL statement.
5. Character string arguments to C/C++ must be null-terminated. In COBOL, this
means using the `Z"xxx"` format for literals, and adding `X"00"` at the end of
the content of variables.
6. Special consideration must be given when a pointer is the value returned by
an API, since COBOL implements a more limited concept of pointers than
C/C++. How to handle this case will be explained below.
### COBOL and C/C++ Data Types
The following table (extracted from IBM VisualAge COBOL documentation) shows the
correspondence between the data types available in COBOL and C/C++.
> :point_right: **Note**: Parts of identifier names in Cobol are separated by `-`, not by `_` as in C.
| C/C++ data types | COBOL data types |
|--------------------------- |--------------------------------------------------------------------------------------------------- |
| wchar_t | "DISPLAY-1 (PICTURE N, G) wchar_t is the processing code whereas DISPLAY-1 is the file code." |
| char | PIC X. |
| signed char | No appropriate COBOL equivalent. |
| unsigned char | No appropriate COBOL equivalent. |
| short signed int | PIC S9-S9(4) COMP-5. Can beCOMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| short unsigned int | PIC 9-9(4) COMP-5. Can be COMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| long int | PIC 9(5)-9(9) COMP-5. Can be COMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| long long int | PIC 9(10)-9(18) COMP-5. Can be COMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| float | COMP-1. |
| double | COMP-2. |
| enumeration | Equivalent to level 88, but not identical. |
| char(n) | PICTURE X(n). |
| array pointer (*) to type | No appropriate COBOL equivalent. |
| pointer(*) to function | PROCEDURE-POINTER. |
A number of C definitions specific to ICU (and many other compilers on POSIX
platforms) that are not presented in the table above can also be translated into
COBOL definitions.
| C/C++ data types | COBOL data types |
|------------------------------------------|---------------------------------------------------------------------------------------------|
| int8_t | PIC X. Not really equivalent. |
| uint8_t | PIC X. Not really equivalent. |
| int16_t | PIC S9(4) BINARY. Can beCOMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| uint16_t | PIC 9(4) BINARY. Can beCOMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| int32_t | PIC S9(9) COMP-5. Can be COMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| uint32_t | PIC 9(9) COMP-5. Can be COMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| Uchar | PIC 9(4) BINARY. Can beCOMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| Uchar32 | PIC 9(9) COMP-5. Can be COMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| UNormalizationMode | PIC S9(9) COMP-5. Can be COMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| UerrorCode | PIC S9(9) COMP-5. Can be COMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| pointer(*) to object (e.g. Uconverter *) | PIC S9(9) COMP-5. Can be COMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
| Windows Handle | PIC S9(9) COMP-5. Can be COMP, COMP-4, or BINARY if you use the TRUNC(BIN) compiler option. |
### Enumerations (first possibility)
C Enumeration types do not translate very well into COBOL. There are two
possible ways to simulate these enumerations.
#### C example
```c
typedef enum {
/** No decomposition/composition. @draft ICU 1.8 */
UNORM_NONE = 1,
/** Canonical decomposition. @draft ICU 1.8 */
UNORM_NFD = 2,
. . .
} UNormalizationMode;
```
#### COBOL example
```cobol
WORKING-STORAGE section.
*--------------- Ported from unorm.h ------------
* enum UNormalizationMode {
77 UNORM-NONE PIC
S9(9) Binary value 1.
77 UNORM-NFD PIC
S9(9) Binary value 2.
```
### Enumerations (second possibility)
#### C example
```c
/*==== utypes.h ========*/
typedef enum UErrorCode {
U_USING_FALLBACK_WARNING = -128, /* (not an error) */
U_USING_DEFAULT_WARNING = -127, /* (not an error) */
. . .
} UErrorCode;
```
#### COBOL example
```cobol
*==== utypes.h ========
01 UerrorCode PIC S9(9) Binary value 0.
* A resource bundle lookup returned a fallback
* (not an error)
88 U-USING-FALLBACK-WARNING value -128.
* (not an error)
88 U-USING-DEFAULT-WARNING value -127.
. . .
```
## Call statement, calling by value or by reference
In general, arguments defined in C as pointers (`\*`) must be listed in the
COBOL Call statement with the using by reference clause. Arguments which are not
pointers must be transferred with the using by value clause. The exception to
this requirement is when an argument is a pointer which has been assigned to a
COBOL variable (e.g. as a value returned by an ICU API), then it must be passed
by value. For instance, a pointer to a Converter passed as argument to
conversion APIs.
### Conversion Declaration Examples
#### C (API definition in \*.h file)
```c
/*--------------------- UCNV.H ---------------------------*/
U_CAPI int32_t U_EXPORT2
ucnv_toUChars(UConverter * cnv,
UChar * dest,
int32_t destCapacity,
const char * src,
int32_t srcLength,
UErrorCode * pErrorCode);
```
#### COBOL
```cobol
PROCEDURE DIVISION.
Call API-Pointer using
by value Converter-toU-Pointer
by reference Unicode-Input-Buffer
by value destCapacity
by reference Input-Buffer
by value srcLength
by reference UErrorCode
Returning Text-Length.
```
## Call statement, Returning clause
### Returned value is Pointer or Binary
#### C (API definition in \*.h file)
```c
U_CAPI UConverter * U_EXPORT2
ucnv_open(const char * converterName,
UErrorCode * err);
```
#### COBOL
```cobol
WORKING-STORAGE section.
01 Converter-Pointer PIC S9(9) BINARY.
PROCEDURE DIVISION
Move Z"iso-8859-8" to converterNameSource.
. . .
Call API-Pointer using
by reference converterNameSource
by reference UErrorCode
Returning Converter-Pointer.
```
### Returned value is a Pointer to string
If the returned value in C is a string pointer (`char \*`), then in COBOL we
must use a pointer to string defined in the Linkage section.
#### C ( API definition in \*.h file)
```c
U_CAPI const char * U_EXPORT2
ucnv_getAvailableName(int32_t n);
```
#### COBOL
```cobol
DATA DIVISION.
WORKING-STORAGE section.
01 Converter-Name-Link-Pointer Usage is Pointer.
LINKAGE section.
01 Converter-Name-Link.
03 Converter-Name-String pic X(80).
PROCEDURE DIVISION using Converter-Name-Link.
Call API-Pointer using by value Converters-Index
Returning Converter-Name-Link-Pointer.
SET Address of Converter-Name-Link
to Converter-Name-Link-Pointer.
. . .
Move Converter-Name-String to Debug-Value.
```
## How to invoke ICU APIs
Inter-language communication is often problematic. This is certainly the case
when calling C/C++ functions from COBOL, because of the very different roots of
the two languages. How to invoke the ICU APIs from a COBOL program is likely to
depend on the operating system and even on the specific compilers in use. The
section below deals with COBOL to C calls on a Windows platform. Similar
sections should be added for other platforms.
### Windows platforms
The following instructions were tested on a Windows 2000 platform, with the IBM
VisualAge COBOL compiler and the Microsoft Visual C/C++ compiler.
For Windows, ICU APIs are normally packaged as DLLs (Dynamic Load Libraries).
For technical reasons, COBOL calls to C/C++ functions need to be done via
dynamic loading of the DLLs at execution time (load on call).
The COBOL program must be compiled with the following compiler options:
\* options CBL PGMNAME(MIXED) CALLINT(SYSTEM) NODYNAM
In order to call an ICU API, two preparation steps are needed:
1. Load in memory the DLL which contains the API
2. Get the address of the API
For performance, it is better to perform these steps once before the first call
and to save the returned values for future use (the sample programs get the
address of APIs for each call, for the sake of logging; production programs
should get the address once and reuse it
as many times as needed).
When no more APIs from a DLL are needed, the DLL should be unloaded in order to
free the associated memory.
### Load DLL Into Memory
This is done as follows:
Call "LoadLibraryA" using by reference DLL-Name
Returning DLL-Handle.
IF DLL-Handle = ZEROS
Perform error handling. . .
Return value: DLL Handle, defined as `PIC S9(9) BINARY`
Input Value: DLL Name (null-terminated string)
Errors may happen if the DLL name is not correct, or the string is not
null-terminated, or the DLL file is not available (in the current directory or
in a directory included in the PATH system variable).
#### Get API address
This is done as follows:
Call "GetProcAddress" using by value DLL-Handle
by reference API-Name
Returning API-Pointer.
IF API-Pointer = NULL
Perform error handling...
Return value: API address, defined as PROCEDURE-POINTER
Input Value: DLL Handle (returned by call to LoadLibraryA)
Procedure Name (null-terminated string)
Errors may happen if the API name is not correct (remember that API names are
case-sensitive), or the string is not null-terminated, or the API is not
included in the specified DLL. If the API pointer is not null, the call to the
API is done with following according to the arguments and return value of the
API.
Call API-Pointer using . . . returning . . .
After calling an API, the returned error code should be checked when relevant.
Code to check for error conditions is illustrated in the sample programs.
#### Unload DLL from Memory
This is done as follows:
Call "FreeLibrary" using DLL-Handle.
Return value: none
Input Value: DLL Handle (returned by call to LoadLibraryA)
## Sample Programs
Three sample programs are supplied with this document. The sample programs were
developed on and for a Windows 2000 platform. Some adaptations may be necessary
for other platforms
Before running the sample programs, you must perform the following steps:
1. Install the version of ICU appropriate for your platform
2. Build ICU libraries if needed (see the ICU Readme file)
3. Make the libraries accessible (for instance on Windows systems, add the
directory containing the libraries to the PATH system variable)
4. Compile the sample programs with appropriate compiler options
5. Copy the test files to a work directory
Each program is supplied with input test files and with a model log file. If the
log file that you create by running a sample program is equivalent to the model
log file, your setup is probably correct.
The three sample programs focus each on a certain ICU area of functionality:
1. Conversion
2. Collation
3. Normalization
### Conversion sample program
* The sample program includes the following steps:
* - Display the names of the converters from a list of all
* converters contained in the alias file.
* - Display the current default converter name.
* - Set new default converter name.
*
* - Read a string from Input file "ICU_Conv_Input_8.txt"
* (File in UTF-8 Format)
* - Convert this string from UTF-8 to code page iso-8859-8
* - Write the result to output file "ICU_Conv_Output.txt"
*
* - Read a line from Input file "ICU_Conv_Input.txt"
* (File in ANSI Format, code page 862)
* - Convert this string from code page ibm-862 to UTF-16
* - Convert the resulting string from UTF-16 to code page windows-1255
* - Write the result to output file "ICU_ Conv_Output.txt"
* - Write debugging information to Display and
* log file "ICU_Conv_Log.txt" (File in ANSI Format)
* - Repeat for all lines in Input file
**
* The following ICU APIs are used:
* ucnv_countAvailable
* ucnv_getAvailableName
* ucnv_getDefaultName
* ucnv_setDefaultName
* ucnv_convert
* ucnv_open
* ucnv_toUChars
* ucnv_fromUChars
* ucnv_close
The ucnv_xxx APIs are documented in file "UCNV.H".
### Collation sample program
* The sample program includes the following steps:
* - Read a string array from Input file "ICU_Coll_Input.txt"
* (file in ANSI format)
* - Convert string array from code page into UTF-16 format
* - Compare the string array into the canonical composed
* - Perform bubble sort of string array, according
* to Unicode string equivalence comparisons
* - Convert string array from Unicode into code page format
* - Write the result to output file "ICU_Coll_Output.txt"
* (file in ANSI format)
* - Write debugging information to Display and
* log file "ICU_Coll_Log.txt" (file in ANSI format)
**
* The following ICU APIs are used:
* ucol_open
* ucol_strcoll
* ucol_close
* ucnv_open
* ucnv_toUChars
* ucnv_fromUChars
* ucnv_close
The ucol_xxx APIs are documented in file "UCOL.H".
The ucnv_xxx APIs are documented in file "UCNV.H".
### Normalization sample program
* The sample includes the following steps:
* - Read a string from input file "ICU_NORM_Input.txt"
* (file in ANSI format)
* - Convert the string from code page into UTF-16 format
* - Perform quick check on the string, to determine if the
* string is in NFD (Canonical decomposition)
* normalization format.
* - Normalize the string into canonical composed form
* (FCD and decomposed)
* - Perform quick check on the result string, to determine
* if the string is in NFD normalization form
* - Convert the string from Unicode into the code page format
* - Write the result to output file "ICU_NORM_Output.txt"
* (file in ANSI format)
* - Write debugging information to Display and
* log file "ICU_NORM_Log.txt" (file in ANSI format)
**
* The following ICU APIs are used:
* ucnv_open
* ucnv_toUChars
* unorm_normalize
* unorm_quickCheck
* ucnv_fromUChars
* ucnv_close
The unorm_xxx APIs are documented in file "UNORM.H".
The ucnv_xxx APIs are documented in file "UCNV.H".

View file

@ -0,0 +1,11 @@
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Use From...
ICU4C can be used from other programming languages and environments. Please
refer to the subpages listed below for details.
* [How To Use ICU4C From COBOL](cobol.md)