From ec45aaf1a2ea982c8b2d3abdd8045fd06a65af6e Mon Sep 17 00:00:00 2001 From: Craig Cornelius Date: Wed, 5 Aug 2020 18:00:48 +0000 Subject: [PATCH] ICU-20088 Move User Guide to Markdown See #919 --- docs/processes/rules_update.md | 2 +- .../userguide/boundaryanalysis/break-rules.md | 437 ++++ docs/userguide/boundaryanalysis/index.md | 529 +++++ docs/userguide/collation/api.md | 696 ++++++ docs/userguide/collation/architecture.md | 562 +++++ docs/userguide/collation/concepts.md | 814 +++++++ .../collation/customization/ignorepunct.md | 161 ++ .../collation/customization/index.md | 1059 +++++++++ docs/userguide/collation/examples.md | 317 +++ docs/userguide/collation/faq.md | 55 + docs/userguide/collation/index.md | 142 ++ docs/userguide/collation/string-search.md | 318 +++ docs/userguide/conversion/compression.md | 92 + docs/userguide/conversion/converters.md | 786 +++++++ docs/userguide/conversion/data.md | 673 ++++++ docs/userguide/conversion/detection.md | 345 +++ docs/userguide/conversion/index.md | 141 ++ docs/userguide/datetime/calendar/examples.md | 254 ++ docs/userguide/datetime/calendar/index.md | 313 +++ docs/userguide/datetime/index.md | 137 ++ docs/userguide/datetime/timezone/examples.md | 76 + docs/userguide/datetime/timezone/index.md | 242 ++ docs/userguide/datetime/universaltimescale.md | 256 ++ docs/userguide/design.md | 899 +++++++ docs/userguide/dev/codingguidelines.md | 2069 +++++++++++++++++ docs/userguide/dev/contributions.md | 122 + docs/userguide/dev/index.md | 15 + docs/userguide/dev/sync/custom.md | 226 ++ docs/userguide/dev/sync/index.md | 71 + docs/userguide/editing.md | 84 + .../format_parse/datetime/examples.md | 277 +++ docs/userguide/format_parse/datetime/index.md | 371 +++ .../userguide/format_parse/formatted_value.md | 6 +- docs/userguide/format_parse/index.md | 210 ++ .../format_parse/messages/examples.md | 381 +++ docs/userguide/format_parse/messages/index.md | 217 ++ docs/userguide/format_parse/numbers/index.md | 23 + .../numbers/legacy-numberformat.md | 196 ++ .../format_parse/numbers/rbnf-examples.md | 95 + docs/userguide/format_parse/numbers/rbnf.md | 120 + .../format_parse/numbers/rounding-modes.md | 121 + .../format_parse/numbers/skeletons.md | 13 +- docs/userguide/glossary.md | 137 ++ docs/userguide/howtouseicu.md | 207 ++ docs/userguide/i18n.md | 272 +++ .../icu4j-locale-service-provider.md | 137 ++ docs/userguide/icudata.md | 1071 +++++++++ docs/userguide/icufaq/icu4j-faq.md | 267 +++ docs/userguide/icufaq/index.md | 455 ++++ docs/userguide/index.md | 103 + docs/userguide/io/index.md | 8 + docs/userguide/io/ustdio.md | 25 + docs/userguide/io/ustream.md | 11 + docs/userguide/layoutengine/index.md | 184 ++ docs/userguide/layoutengine/paragraph.md | 56 + docs/userguide/locale/examples.md | 141 ++ docs/userguide/locale/index.md | 572 +++++ docs/userguide/locale/localizing.md | 516 ++++ docs/userguide/locale/resources.md | 929 ++++++++ docs/userguide/packaging/index.md | 196 ++ docs/userguide/packaging/plug-ins.md | 166 ++ docs/userguide/posix.md | 246 ++ docs/userguide/services.md | 352 +++ docs/userguide/sitemap.md | 83 + docs/userguide/strings/characteriterator.md | 169 ++ docs/userguide/strings/index.md | 696 ++++++ docs/userguide/strings/properties.md | 332 +++ docs/userguide/strings/regexp.md | 504 ++++ docs/userguide/strings/stringprep.md | 393 ++++ docs/userguide/strings/unicodeset.md | 272 +++ docs/userguide/strings/utext.md | 382 +++ docs/userguide/strings/utf-8.md | 147 ++ docs/userguide/transforms/bidi.md | 116 + docs/userguide/transforms/casemappings.md | 107 + docs/userguide/transforms/general/index.md | 1405 +++++++++++ docs/userguide/transforms/general/rules.md | 668 ++++++ docs/userguide/transforms/index.md | 46 + .../transforms/normalization/examples.md | 10 + .../transforms/normalization/index.md | 215 ++ docs/userguide/unicode.md | 531 +++++ docs/userguide/usefrom/cobol.md | 456 ++++ docs/userguide/usefrom/index.md | 11 + 82 files changed, 26506 insertions(+), 11 deletions(-) create mode 100644 docs/userguide/boundaryanalysis/break-rules.md create mode 100644 docs/userguide/boundaryanalysis/index.md create mode 100644 docs/userguide/collation/api.md create mode 100644 docs/userguide/collation/architecture.md create mode 100644 docs/userguide/collation/concepts.md create mode 100644 docs/userguide/collation/customization/ignorepunct.md create mode 100644 docs/userguide/collation/customization/index.md create mode 100644 docs/userguide/collation/examples.md create mode 100644 docs/userguide/collation/faq.md create mode 100644 docs/userguide/collation/index.md create mode 100644 docs/userguide/collation/string-search.md create mode 100644 docs/userguide/conversion/compression.md create mode 100644 docs/userguide/conversion/converters.md create mode 100644 docs/userguide/conversion/data.md create mode 100644 docs/userguide/conversion/detection.md create mode 100644 docs/userguide/conversion/index.md create mode 100644 docs/userguide/datetime/calendar/examples.md create mode 100644 docs/userguide/datetime/calendar/index.md create mode 100644 docs/userguide/datetime/index.md create mode 100644 docs/userguide/datetime/timezone/examples.md create mode 100644 docs/userguide/datetime/timezone/index.md create mode 100644 docs/userguide/datetime/universaltimescale.md create mode 100644 docs/userguide/design.md create mode 100644 docs/userguide/dev/codingguidelines.md create mode 100644 docs/userguide/dev/contributions.md create mode 100644 docs/userguide/dev/index.md create mode 100644 docs/userguide/dev/sync/custom.md create mode 100644 docs/userguide/dev/sync/index.md create mode 100644 docs/userguide/editing.md create mode 100644 docs/userguide/format_parse/datetime/examples.md create mode 100644 docs/userguide/format_parse/datetime/index.md create mode 100644 docs/userguide/format_parse/index.md create mode 100644 docs/userguide/format_parse/messages/examples.md create mode 100644 docs/userguide/format_parse/messages/index.md create mode 100644 docs/userguide/format_parse/numbers/index.md create mode 100644 docs/userguide/format_parse/numbers/legacy-numberformat.md create mode 100644 docs/userguide/format_parse/numbers/rbnf-examples.md create mode 100644 docs/userguide/format_parse/numbers/rbnf.md create mode 100644 docs/userguide/format_parse/numbers/rounding-modes.md create mode 100644 docs/userguide/glossary.md create mode 100644 docs/userguide/howtouseicu.md create mode 100644 docs/userguide/i18n.md create mode 100644 docs/userguide/icu4j-locale-service-provider.md create mode 100644 docs/userguide/icudata.md create mode 100644 docs/userguide/icufaq/icu4j-faq.md create mode 100644 docs/userguide/icufaq/index.md create mode 100644 docs/userguide/index.md create mode 100644 docs/userguide/io/index.md create mode 100644 docs/userguide/io/ustdio.md create mode 100644 docs/userguide/io/ustream.md create mode 100644 docs/userguide/layoutengine/index.md create mode 100644 docs/userguide/layoutengine/paragraph.md create mode 100644 docs/userguide/locale/examples.md create mode 100644 docs/userguide/locale/index.md create mode 100644 docs/userguide/locale/localizing.md create mode 100644 docs/userguide/locale/resources.md create mode 100644 docs/userguide/packaging/index.md create mode 100644 docs/userguide/packaging/plug-ins.md create mode 100644 docs/userguide/posix.md create mode 100644 docs/userguide/services.md create mode 100644 docs/userguide/sitemap.md create mode 100644 docs/userguide/strings/characteriterator.md create mode 100644 docs/userguide/strings/index.md create mode 100644 docs/userguide/strings/properties.md create mode 100644 docs/userguide/strings/regexp.md create mode 100644 docs/userguide/strings/stringprep.md create mode 100644 docs/userguide/strings/unicodeset.md create mode 100644 docs/userguide/strings/utext.md create mode 100644 docs/userguide/strings/utf-8.md create mode 100644 docs/userguide/transforms/bidi.md create mode 100644 docs/userguide/transforms/casemappings.md create mode 100644 docs/userguide/transforms/general/index.md create mode 100644 docs/userguide/transforms/general/rules.md create mode 100644 docs/userguide/transforms/index.md create mode 100644 docs/userguide/transforms/normalization/examples.md create mode 100644 docs/userguide/transforms/normalization/index.md create mode 100644 docs/userguide/unicode.md create mode 100644 docs/userguide/usefrom/cobol.md create mode 100644 docs/userguide/usefrom/index.md diff --git a/docs/processes/rules_update.md b/docs/processes/rules_update.md index 7cf7674c0c4..df6bfbda778 100644 --- a/docs/processes/rules_update.md +++ b/docs/processes/rules_update.md @@ -110,7 +110,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov For this example, the rule file is `icu4c/source/data/brkitr/rules/char.txt`. (If the change is for word or line break, which have multiple rule files for tailorings, only update the root file at this time.) - Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](http://userguide.icu-project.org/boundaryanalysis/break-rules) for an explanation of rule syntax and behavior. + Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](../userguide/boundaryanalysis/break-rules.md) for an explanation of rule syntax and behavior. The transformation from UAX or CLDR style rules to ICU rules can be non-trivial. Sources of difficulties include: diff --git a/docs/userguide/boundaryanalysis/break-rules.md b/docs/userguide/boundaryanalysis/break-rules.md new file mode 100644 index 00000000000..03dec09d470 --- /dev/null +++ b/docs/userguide/boundaryanalysis/break-rules.md @@ -0,0 +1,437 @@ + + +# Break Rules + +## Introduction + +ICU locates boundary positions within text by means of rules, which are a form +of regular expressions. The form of the rules is similar, but not identical, +to the boundary rules from the Unicode specifications +[[UAX-14](https://unicode.org/reports/tr14/), +[UAX-29](https://unicode.org/reports/tr29/)], and there is a reasonably close +correspondence between the two. + +Taken as a set, the ICU rules describe how to move forward to the next boundary, +starting from a known boundary. +ICU includes rules for the standard boundary types (word, line, etc.). +Applications may also create customized break iterators from their own rules. + +ICU's built-in rules are located at +[icu/icu4c/source/data/brkitr/rules/](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules). +These can serve as examples when writing your own, and as starting point for +customizations. + +### Rule Tutorial + +Rules most commonly describe a range of text that should remain together, +unbroken. For example, this rule + + [\p{Letter}]+; + +matches a run of one or more letters, and would cause them to remain unbroken. + +The part within `[`brackets`]` follows normal ICU [UnicodeSet pattern +syntax](../strings/unicodeset.md). + +The qualifier, '`+`' in this case, can be one of + +| Qualifier | Meaning | +| --------- | ------------------------ | +| empty | Match exactly once | +| `?` | Match zero or one time | +| `+` | Match one or more times | +| `*` | Match zero or more times | + +#### Variables + +A variable names a set or rule sub-expression. They are useful for documenting +what something represents, and for simplifying complex expressions by breaking +them up. + +"Variable" is something if a misnomer; they cannot be reassigned, but are more +of a constant expression. + +They start with a '`$`', both in the definition and use. + + # Variable Definition + $ASCIILetNum = [A-Za-z0-9]; + # Variable Use + $ASCIILetNum+; + +#### Comments and Semicolons + +'`#`' begins a comment, which extends to the end of a line. + +Comments may stand alone, or appear after another statement on a line. + +All rule statements or expressions are terminated by semicolons. + +#### Chained Matching + +Most ICU rule sets use the concept of "chained matching". The idea is that +complete match can be composed from multiple pieces, with each piece coming from +an individual rule of a rule set. + +This idea is unique to ICU break rules, it is not a concept found in other +regular expression based matchers. Some of the Unicode standard break rules +would be difficult to implement without it. + +Starting with an example, + + !!chain; + word_char = [\p{Letter}]; + word_joiner = [_-]; + $word_char+; + $word_char $word_joiner $word_char; + +These rules will match "`abc`", "`hello_world`", `"hi-there"`, +"`a-bunch_of-joiners-here`". + +They will not match "`-abc`", "`multiple__joiners`", "`tail-`" + +A full match is composed of pieces or submatches, possibly from different rules, +with adjacent submatches linked by at least one overlapping character. + +In the example below, matching "`hello_world`", + +* '`1`' shows matches of the first rule, `word_char+` + +* '`2`' shows matches of the second rule, `$word_char $word_joiner $word_char` + + hello_world + 11111 11111 +     222 + +There is an overlap of the matched regions, which causes the chaining mechanism +to join them into a single overall match. + +The mechanism is a good match to, for example, [Unicode's word break +rules](http://www.unicode.org/reports/tr29/#Word_Boundary_Rules), where rules +WB5 through WB13 combine to piece together longer words from multiple short +segments. + +`!!chain;` enables chaining in a rule set. It is disabled by default for back +compatibility—very old versions of ICU did not support it, and it was +originally introduced as an option. + +#### Parentheses and Alternation + +Rule expressions can contain parentheses and '`|`' operators, representing +alternation or "or" operations. This follows conventional regular expression +behavior. + +For example, the following would match a simplified identifier: + + $Letter ($Letter | $Digit)*; + +#### String and Character Literals + +Similarly to common regular expressions, literal characters that do not have +other special meaning represent themselves. So the rule + + Hello; + +would match the literal input "`Hello`". + +In practice, nearly all break rules are composed from `[`sets`]` based on Unicode +character properties; literal characters in rules are very rare. + +To prevent random typos in rules from being treated as literals, use this +option: + + !!quoted_literals_only; + +With the option, the naked `Hello` becomes a rule syntax error while a quoted +`"hello"` still matches a literal hello. + +`!!quoted_literals_only` is strongly recommended for all rule sets. The random +typo problem is very real, and surprisingly hard to recognize and debug. + +#### Explicit Break Rules + +A rule containing a slash (`/`) will force a boundary when it matches, even when +other rules or chaining would otherwise lead to a longer match. Also called Hard +Break Rules, these have the form + + pre-context / post-context; + +where the pre and post-context look like normal break rules. Both the pre and +post context are required, and must not allow a zero-length match. There should +be no overlap between characters that end a match of the pre-context and those +that begin a match of the post-context. + +Chaining into a hard break rule operates normally. There is no chaining out of a +hard break rule; when the post-context matches a break is forced immediately. + +Note: future versions of ICU may loosen the restrictions on explicit break +rules. The behavior of rules with missing or overlapping contexts is subject to +change. + +#### Chaining Control + +Chaining into a rule can be dis-allowed by beginning that rule with a '`^`'. Rules +so marked can begin a match after a preceding boundary or at the start of text, +but cannot extend a match via chaining from another rule. + +~~The !!LBCMNoChain; statement modifies chaining behavior by preventing chaining +from one rule to another from occurring on any character whose Line Break +property is Combining Mark. This option is subject to change or removal, and +should not be used in general. Within ICU, it is used only with the line break +rules. We hope to replace it with something more general.~~ + +> :point_right: **Note**: `!!LBCMNoChain` is deprecated, and will be removed completely from a future +version of ICU. + +## Rule Status Values + +Break rules can be tagged with a number, which is called the *rule status*. +After a boundary has been located, the status number of the specific rule that +determined the boundary position is available to the application through the +function `getRuleStatus()`. + +For the predefined word boundary rules, status values are available to +distinguish between boundaries associated with words, numbers, and those around +spaces or punctuation. Similarly for line break boundaries, status values +distinguish between mandatory line endings (new line characters) and break +opportunities that are appropriate points for line wrapping. Refer to the ICU +API documentation for the C header file `ubrk.h` or to Java class +`RuleBasedBreakIterator` for a complete list of the predefined boundary +classifications. + +When creating custom sets of break rules, integer status values can be +associated with boundary rules in whatever way will be convenient for the +application. There is no need to remain restricted to the predefined values and +classifications from the standard rules. + +It is possible for a set of break rules to contain more than a single rule that +produces some boundary in an input text. In this event, `getRuleStatus()` will +return the numerically largest status value from the matching rules, and the +alternate function `getRuleStatusVec()` will return a vector of the values from +all of the matching rules. + +In the source form of the break rules, status numbers appear at end of a rule, +and are enclosed in `{`braces`}`. + +Hard break rules that also have a status value place the status at the end, for +example + + pre-context / post-context {1234}; + +### Word Dictionaries + +For some languages that don't normally use spaces between words, break iterators +are able to supplement the rules with dictionary based breaking. Some languages, +Thai or Lao, for example, use a dictionary for both word and line breaking. +Others, such as Japanese, use a dictionary for word breaking, but not for line +breaking. + +To enable dictionary use, + +1. The break rules must select, as unbroken chunks, ranges of text to be passed + off to the word dictionary for further subdivision. +2. The break rules must define a character class named `$dictionary` that + contains the characters (letters) to be handled by the dictionary. + +The dictionary implementation, on receiving a range of text, will map it to a +specific dictionary based on script, and then delegate to that dictionary for +subdividing the range into words. + +See, for example, this snippet from the [line break +rules](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/brkitr/rules/line.txt): + + #   Dictionary character set, for triggering language-based break engines. Currently + #   limited to LineBreak=Complex_Context (SA). + $dictionary = [$SA]; + +## Rule Options + +| Option | Description | +| --------------- | ----------- | +| `!!chain` | Enable rule chaining. Default is no chaining. | +| `!!forward` | The rules that follow are for forward iteration. Forward rules are now the only type of rules needed or used. | + +### Deprecated Rule Options + +| Deprecated Option | Description | +| --------------- | ----------- | +| ~~`!!reverse`~~ | ~~*[deprecated]* The rules that follow are for reverse iteration. No longer needed; any rules in a Reverse rule section are ignored.~~ | +| ~~`!!safe_forward`~~ | ~~*[deprecated]* The rules that follow are for safe forward iteration. No longer needed; any rules in such a section are ignored.~~ | +| ~~`!!safe_reverse`~~ | ~~*[deprecated]* The rules that follow are for safe reverse iteration. No longer needed; any rules in such a section are ignored.~~ | +| ~~`!!LBCMNoChain`~~ | ~~*[deprecated]* Disable chaining when the overlap character matches `\p{Line_Break=Combining_Mark}`~~ | + +## Rule Syntax + +Here is the syntax for the boundary rules. (The EBNF Syntax is given below.) + +| Rule Name | Rule Values | Notes | +| ---------- | ----------- | ----- | +| rules | statement+ | | +| statement | assignment \| rule \| control | +| control | (`!!forward` \| `!!reverse` \| `!!safe_forward` \| `!!safe_reverse` \| `!!chain`) `;` +| assignment | variable `=` expr `;` | 5 | +| rule | `^`? expr (`{`number`}`)? `;` | 8,9 | +| number | [0-9]+ | 1 | +| break-point | `/` | 10 | +| expr | expr-q \| expr `\|` expr \| expr expr | 3 | +| expr-q | term \| term `*` \| term `?` \| term `+` | +| term | rule-char \| unicode-set \| variable \| quoted-sequence \| `(` expr `)` \| break-point | +| rule-special | *any printing ascii character except letters or numbers* \| white-space | +| rule-char | *any non-escaped character that is not rule-special* \| `.` \| *any escaped character except* `\p` *or* `\P` | +| variable | `$` name-start-char name-char* | 7 | +| name-start-char | `_` \| \p{L} | +| name-char | name-start-char \| \\p{N} | +| quoted-sequence | `'` *(any char except single quote or line terminator or two adjacent single quotes)*+ `'` | +| escaped-char | *See “Character Quoting and Escaping” in the [UnicodeSet](../strings/unicodeset.md) chapter* | +| unicode-set | See [UnicodeSet](../strings/unicodeset.md) | 4 | +| comment | unescaped `#` *(any char except new-line)** new-line | 2 | +| s | unescaped \p{Z}, tab, LF, FF, CR, NEL | 6 | +| new-line | LF, CR, NEL | 2 | + +### Rule Syntax Notes + +1. The number associated with a rule that actually determined a break position + is available to the application after the break has been returned. These + numbers are *not* Perl regular expression repeat counts. + +2. Comments are recognized and removed separately from otherwise parsing the + rules. They may appear wherever a space would be allowed (and ignored.) + +3. The implicit concatenation of adjacent terms has higher precedence than the + `|` operation. "`ab|cd`" is interpreted as "`(ab)|(cd)`", not as "`a(b|c)d`" or + "`(((ab)|c)d)`" + +4. The syntax for [unicode-set](../strings/unicodeset.md) is defined (and parsed) by the `UnicodeSet` class. + It is not repeated here. + +5. For `$`variables that will be referenced from inside of a `UnicodeSet`, the + definition must consist only of a Unicode Set. For example, when variable `$a` + is used in a rule like `[$a$b$c]`, then this definition of `$a` is ok: + “`$a=[:Lu:];`” while this one “`$a=abcd;`” would cause an error when `$a` was + used. + +6. Spaces are allowed nearly anywhere, and are not significant unless escaped. + Exceptions to this are noted. + +7. No spaces are allowed within a variable name. The variable name `$dictionary` + is special. If defined, it must be a Unicode Set, the characters of which + will trigger the use of word dictionary based boundaries. + +8. A leading `^` on a rule prevents chaining into that rule. It can only match + immediately after a preceding boundary, or at the start of text. + +9. `{`nnn`}` appearing at the end of a rule is a Rule Status number, not a repeat + count as it would be with conventional regular expression syntax. + +10. A `/` in a rule specifies a hard break point. If the rule matches, a + boundary will be forced at the position of the `/` within the match. + +### EBNF Syntax used for the RBBI rules syntax description + +| syntax | description | +| -- | ------------------------- | +| a? | zero or one instance of a | +| a+ | one or more instances of a | +| a* | zero or more instances of a | +| a \| b | either a or b, but not both | +| `a` "`a`" | the literal string between the quotes or displayed as `monospace` | + +## Planned Changes and Removed or Deprecated Rule Features + +1. Reverse rules could formerly be indicated by beginning them with an + exclamation `!`. This syntax is deprecated, and will be removed from a + future version of ICU. + +2. `!!LBCMNoChain` was a global option that specified that characters with the + line break property of "Combining Character" would not participate in rule + chaining. This option was always considered internal, is deprecated and will + be removed from a future version of ICU. + +3. Naked rule characters. Plain text, in the context of a rule, is treated as + literal text to be matched, much like normal regular expressions. This turns + out to be very error prone, has been the source of bugs in released versions + of ICU, and is not useful in implementing normal text boundary rules. A + future version will reject literal text that is not escaped. + +4. Exact reverse rules and safe forward rules: planned changes to the break + engine implementation will remove the need for exact reverse rules and safe + forward rules. + +5. `{bof}` and `{eof}`, appearing within `[`sets`]`, match the beginning or ending of + the input text, respectively. This is an internal (not documented) feature + that will probably be removed in a future version of ICU. They are currently + used by the standard rules for word, line and sentence breaking. An + alternative is probably needed. The existing implementation is incomplete. + +## Additional Sample Code + +**C/C++**: See +[icu/source/samples/break/](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/break/) +in the ICU source distribution for code samples showing the use of ICU boundary +analysis. + +## Details about Dictionary-Based Break Iteration + +> :point_right: **Note**: This section originally from August 2012. +> It is probably out of date, for example `brkfiles.mk` does not exist anyore. + +Certain Unicode characters have a "dictionary" bit set in the break iteration +rules, and text made up of these characters cannot be handled by the rules-based +break iteration code for lines or words. Rather, they must be handled by a +dictionary-based approach. The ICU approach is as follows: + +Once the Dictionary bit is detected, the set of characters with that bit is +handed off to "dictionary code." This code then inspects the characters more +carefully, and splits them by script (Thai, Khmer, Chinese, Japanese, Korean). +If text in this script has not yet been handled, it loads the appropriate +dictionary from disk, and initializes a specialized "BreakEngine" class for that +script. + +There are three such specialized classes: Thai, Khmer and CJK. + +Thai and Khmer use very similar approaches. They look through a dictionary that +is not weighted by word frequency, and attempt to find the longest total "match" +that can be made in the text. + +For Chinese and Japanese text, on the other hand, we have a unified dictionary +(due to the fact that both use some of the same characters, it is difficult to +distinguish them) that contains information about word frequencies. The +algorithm to match text then uses dynamic programming to find the set of breaks +it considers "most likely" based on the frequency of the words created by the +breaks. This algorithm could also be used for Thai and Khmer, but we do not have +sufficient data to do so. This algorithm could also be used for Korean, but once +again we do not have the data to do so. + +Code of interest is in `source/common/dictbe.{h, cpp}`, `source/common/brkeng.{h, +cpp}`, `source/common/dictionarydata.{h, cpp}`. The dictionaries use the `BytesTrie` +and `UCharsTrie` as their data store. The binary form of these dictionaries is +produced by the `gendict` tool, which has source in `source/tools/gendict`. + +In order to add new dictionary implementations, a few changes have to be made. +First, you should create a new subclass of `DictionaryBreakEngine` or +`LanguageBreakEngine` in `dictbe.cpp` that implements your algorithm. Then, in +`brkeng.cpp`, you should add logic to create this dictionary break engine if we +strike the appropriate script - which should only be 3 or so lines of code at +the most. Lastly, you should add the correct data file. If your data is to be +represented as a `.dict` file - as is recommended, and in fact required if you +don't want to make substantial code changes to the engine loader - you need to +simply add a file in the correct format for gendict to the `source/data/brkitr` +directory, and add its name to the list of `BRK_DICT_SOURCE` in +`source/data/brkitr/brkfiles.mk`. This will cause your dictionary (say, `foo.txt`) +to be added as a `UCharsTrie` dictionary with the name foo.dict. If you want your +dictionary to be a `BytesTrie` dictionary, you will need to specify a transform +within the `Makefile`. To do so, find the part of `source/data/Makefile.in` and +`source/data/makedata.mak` that deals with `thaidict.dict` and `khmerdict.dict` and +add a similar set of lines for your script. Lastly, in +`source/data/brkitr/root.txt`, add a line to the dictionaries `{}` section of the +form: + + shortscriptname:process(dependency){"dictionaryname.dict"} + +For example, for Katakana: + + Kata:process(dependency){"cjdict.dict"} + +Make sure to add appropriate tests for the new implementation. diff --git a/docs/userguide/boundaryanalysis/index.md b/docs/userguide/boundaryanalysis/index.md new file mode 100644 index 00000000000..3003c5bddab --- /dev/null +++ b/docs/userguide/boundaryanalysis/index.md @@ -0,0 +1,529 @@ + + +# Boundary Analysis + +## Overview of Text Boundary Analysis + +Text boundary analysis is the process of locating linguistic boundaries while +formatting and handling text. Examples of this process include: + +1. Locating appropriate points to word-wrap text to fit within specific margins + while displaying or printing. + +2. Locating the beginning of a word that the user has selected. + +3. Counting characters, words, sentences, or paragraphs. + +4. Determining how far to move the text cursor when the user hits an arrow key + (Some characters require more than one position in the text store and some + characters in the text store do not display at all). + +5. Making a list of the unique words in a document. + +6. Figuring out if a given range of text contains only whole words. + +7. Capitalizing the first letter of each word. + +8. Locating a particular unit of the text (For example, finding the third word + in the document). + +The `BreakIterator` classes were designed to support these kinds of tasks. The +BreakIterator objects maintain a location between two characters in the text. +This location will always be a text boundary. Clients can move the location +forward to the next boundary or backward to the previous boundary. Clients can +also check if a particular location within a source text is on a boundary or +find the boundary which is before or after a particular location. + +## Four Types of BreakIterator + +ICU `BreakIterator`s can be used to locate the following kinds of text boundaries: + +1. Character Boundary + +2. Word Boundary + +3. Line-break Boundary + +4. Sentence Boundary + +Each type of boundary is found in accordance with the rules specified by Unicode +Standard Annex #29, *Unicode Text Segmentation* +( ) or Unicode Standard Annex #14, *Unicode +Line Breaking Algorithm* () + +### Character Boundary + +The character-boundary iterator locates the boundaries according to the rules +defined in . +These boundaries try to match what a user would think of as a "character"—a +basic unit of a writing system for a language—which may be more than just a +single Unicode code point. + +The letter `Ä`, for example, can be represented in Unicode either with a single +code-point value or with two code-point values (one representing the `A` and +another representing the umlaut `¨`). The character-boundary iterator will treat +either representation as a single character. + +End-user characters, as described above, are also called grapheme clusters, in +an attempt to limit the confusion caused by multiple meanings for the word +"character". + +### Word Boundary + +The word-boundary iterator locates the boundaries of words, for purposes such as +double click selection or "Find whole words" operations. + +Words boundaries are identified according to the rules in +, supplemented by a word +dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for +locating word breaks take into account the alphabets and conventions used by +different languages. + +Here's an example of a sentence, showing the boundary locations that will be +identified by a word break iterator: + +> :point_right: **Note**: TODO: An example needs to be added here. + +### Line-break Boundary + +The line-break iterator locates positions that would be appropriate points to +wrap lines when displaying the text. The boundary rules are define here: + + +This example shows the differences in the break locations produced by word and +line break iterators: + +> :point_right: **Note**: TODO: An example needs to be added here. + +### Sentence Boundary + +A sentence-break iterator locates sentence boundaries according to the rules +defined here: + +## Dictionary-Based BreakIterator + +Some languages are written without spaces, and word and line breaking requires +more than rules over character sequences. ICU provides dictionary support for +word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese. + +Use of the dictionaries is automatic when text in one of the dictionary +languages is encountered. There is no separate API, and no extra programming +steps required by applications making use of the dictionaries. + +## Usage + +To locate boundaries in a document, create a BreakIterator using the +`BreakIterator::create***Instance` family of methods in C++, or the `ubrk_open()` +function (C), where "`***`" is `Character`, `Word`, `Line` or `Sentence`, +depending on the type of iterator wanted. These factory methods also take a +parameter that specifies the locale for the language of the text to be processed. + +When creating a `BreakIterator`, a locale is also specified, and the behavior of +the BreakIterator obtained may be specialized in some way for that locale. For +most locales the default break iterator behavior is used. + +Applications also may register customized BreakIterators for use in specific +locales. Once such a break iterator has been registered, any requests for break +iterators for that locale will return copies of the registered break iterator. + +ICU may cache service instances. Therefore, registration should be done during +startup, before opening services by locale ID. + +In the general-usage-model, applications will use the following basic steps to +analyze a piece of text for boundaries: + +1. Create a `BreakIterator` with the desired behavior + +2. Use the `setText()` method to set the iterator to analyze a particular piece + of text. + +3. Locate the desired boundaries using the appropriate combination of `first()`, + `last()`, `next()`, `previous()`, `preceding()`, and `following()` methods. + +The `setText()` method can be called more than once, allowing reuse of a +BreakIterator on new pieces of text. Because the creation of a `BreakIterator` can +be relatively time-consuming, it makes good sense to reuse them when practical. + +The iterator always points to a boundary position between two characters. The +numerical value of the position, as returned by `current()` is the zero-based +index of the character following the boundary. Thus a position of zero +represents a boundary preceding the first character of the text, and a position +of one represents a boundary between the first and second characters. + +The `first()` and `last()` methods reset the iterator's current position to the +beginning or end of the text (the beginning and the end are always considered +boundaries). The `next()` and `previous()` methods advance the iterator one boundary +forward or backward from the current position. If the `next()` or `previous()` +methods run off the beginning or end of the text, it returns DONE. The `current()` +method returns the current position. + +The `following()` and `preceding()` methods are used for random access, to move the +iterator to an arbitrary position within the text. Since a BreakIterator always +points to a boundary position, the `following()` and `preceding()` methods will +never set the iterator to point to the position specified by the caller (even if +it is, in fact, a boundary position). `BreakIterator` will, however, set the +iterator to the nearest boundary position before or after the specified +position. + +`isBoundary()` returns true if the specified position is a boundary. + +### Thread Safety + +`BreakIterator`s are not thread safe. This is inherit in their design—break +iterators are stateful, holding a reference to and position in the text, meaning +that a single instance cannot operate in parallel on multiple texts. + +For concurrent break iteration, each thread must use its own break iterator. +These can be obtained by creating separate break iterators of the desired type, +or by initially creating a master break iterator and then creating a clone for +each thread. + +### Line Breaking Strictness, a CSS Property + +CSS has the concept of "[Line Breaking +Strictness](https://www.w3.org/TR/css-text-3/#line-break-property)". This +property specifies the strictness of line-breaking rules applied within an +element: especially how wrapping interacts with punctuation and symbols. ICU +line break iterators can choose a strictness using locale tags: + +| Locale | Behavior | +| ------------ | ----------- | +| `en@lb=strict`
`ja@lb=strict` | Breaks text using the most stringent set of line-breaking rules | +| `en@lb=normal`
`ja@lb=normal` | Breaks text using the most common set of line-breaking rules. | +| `en@lb=loose`
`ja@lb=loose` | Breaks text using the least restrictive set of line-breaking rules. Typically used for short lines, such as in newspapers. | + +### Sentence Break Filters + +Sentence breaking can return false positives - an indication that sentence ends +in an incorrect position - in the presence of abbreviations. For example, +consider the sentence + +> In the meantime Mr. Weston arrived with his small ship. + +Default sentence break shows a false boundary following the "Mr." + +ICU includes lists of common abbreviations that can be used to filter, to +ignore, these false sentence boundaries. Filtering is enabled by the presence of +the `ss` locale tag when creating the break iterator. + +| Locale | Behavior | +| ---------------- | ------------------------------------------------------- | +| `en` | no filtering | +| `en@ss=standard` | Filter based on common English language abbreviations. | +| `es@ss=standard` | Filter with common Spanish abbreviations. | + +Abbreviation lists are available (as of ICU 64) for English, German, Spanish, +French, Italian and Portuguese. + +## Accuracy + +ICU's break iterators are based on the default boundary rules described in the +Unicode Standard Annexes [14](https://www.unicode.org/reports/tr14/) and +[29](https://www.unicode.org/unicode/reports/tr29/) . These are relatively +simple boundary rules that can be implemented efficiently, and are sufficient +for many purposes and languages. However, some languages and applications will +require a more sophisticated linguistic analysis of the text in order to find +boundaries with good accuracy. Such an analysis is not directly available from +ICU at this time. + +Break Iterators based on custom, user-supplied boundary rules can be created and +used by applications with requirements that are not met by the standard default +boundary rules. + +## BreakIterator Boundary Analysis Examples + +### Print out all the word-boundary positions in a UnicodeString + +**In C++:** + +```c++ +void listWordBoundaries(const UnicodeString& s) { + UErrorCode status = U_ZERO_ERROR; + BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status); + bi->setText(s); + int32_t p = bi->first(); + while (p != BreakIterator::DONE) { + printf("Boundary at position %d\n", p); + p = bi->next(); + } + delete bi; +} +``` + +**In C:** + +```c +void listWordBoundaries(const UChar* s, int32_t len) { + UBreakIterator* bi; + int32_t p; + UErrorCode err = U_ZERO_ERROR; + bi = ubrk_open(UBRK_WORD, 0, s, len, &err); + if (U_FAILURE(err)) return; + p = ubrk_first(bi); + while (p != UBRK_DONE) { + printf("Boundary at position %d\n", p); + p = ubrk_next(bi); + } + ubrk_close(bi); +} +``` + +### Get the boundaries of the word that contains a double-click position + +**In C++:** + +```c++ +void wordContaining(BreakIterator& wordBrk, + int32_t idx, + const UnicodeString& s, + int32_t& start, + int32_t& end) { + // this function is written to assume that we have an + // appropriate BreakIterator stored in an object or a + // global variable somewhere-- When possible, programmers + // should avoid having the create() and delete calls in + // a function of this nature. + if (s.isEmpty()) + return; + wordBrk.setText(s); + start = wordBrk.preceding(idx + 1); + end = wordBrk.next(); + // NOTE: for this and similar operations, use preceding() and next() + // as shown here, not following() and previous(). preceding() is + // faster than following() and next() is faster than previous() + // NOTE: By using preceding(idx + 1) above, we're adopting the convention + // that if the double-click comes right on top of a word boundary, it + // selects the word that _begins_ on that boundary (preceding(idx) would + // instead select the word that _ends_ on that boundary). +} +``` + +**In C:** + +```c +void wordContaining(UBreakIterator* wordBrk, + int32_t idx, + const UChar* s, + int32_t sLen, + int32_t* start, + int32_t* end, + UErrorCode* err) { + if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) { + *err = U_ILLEGAL_ARGUMENT_ERROR; + return; + } + ubrk_setText(wordBrk, s, sLen, err); + if (U_SUCCESS(*err)) { + *start = ubrk_preceding(wordBrk, idx + 1); + *end = ubrk_next(wordBrk); + } +} +``` + +### Check for Whole Words + +Use the following to check if a range of text is a "whole word": + +**In C++:** + +```c++ +UBool isWholeWord(BreakIterator& wordBrk, + const UnicodeString& s, + int32_t start, + int32_t end) { + if (s.isEmpty()) + return FALSE; + wordBrk.setText(s); + if (!wordBrk.isBoundary(start)) + return FALSE; + return wordBrk.isBoundary(end); +} +``` + +**In C:** + +```c +UBool isWholeWord(UBreakIterator* wordBrk, + const UChar* s, + int32_t sLen, + int32_t start, + int32_t end, + UErrorCode* err) { + UBool result = FALSE; + if (wordBrk == NULL || s == NULL) { + *err = U_ILLEGAL_ARGUMENT_ERROR; + return FALSE; + } + ubrk_setText(wordBrk, s, sLen, err); + if (U_SUCCESS(*err)) { + result = ubrk_isBoundary(wordBrk, start) && ubrk_isBoundary(wordBrk, end); + } + return result; +} +``` + +Count the words in a document (C++ only): + +```c++ +int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) { + bi.setText(s); + int32_t count = 0; + while (start != BreakIterator::DONE) { + int breakType = bi.getRuleStatus(); + if (breakType != UBRK_WORD_NONE) { + // Exclude spaces, punctuation, and the like. + // A status value UBRK_WORD_NONE indicates that the boundary does + // not start a word or number. + // + ++count; + } + start = bi.next(); + } + return count; +} +``` + +The function `getRuleStatus()` returns an enum giving additional information on +the text preceding the last break position found. Using this value, it is +possible to distinguish between numbers, words, words containing kana +characters, words containing ideographic characters, and non-word characters, +such as spaces or punctuation. The sample uses the break status value to filter +out, and not count, boundaries associated with non-word characters. + +### Word-wrap a document (C++ only) + +The sample function below wraps a paragraph so that each line is less than or +equal to 72 characters. The function fills in an array passed in by the caller +with the starting offsets of +each line in the document. Also, it fills in a second array to track how many +trailing white space characters there are in the line. For simplicity, it is +assumed that an outside process has already broken the document into paragraphs. +For example, it is assumed that every string the function is passed has a single +newline at the end only. + +```c++ +int32_t wrapParagraph(const UnicodeString& s, + const Locale& locale, + int32_t lineStarts[], + int32_t trailingwhitespace[], + int32_t maxLines, + UErrorCode &status) { + + int32_t numLines = 0; + int32_t p, q; + const int32_t MAX_CHARS_PER_LINE = 72; + UChar c; + + BreakIterator *bi = BreakIterator::createLineInstance(locale, status); + if (U_FAILURE(status)) { + delete bi; + return 0; + } + bi->setText(s); + + + p = 0; + while (p < s.length()) { + // jump ahead in the paragraph by the maximum number of + // characters that will fit + q = p + MAX_CHARS_PER_LINE; + + // if this puts us on a white space character, a control character + // (which includes newlines), or a non-spacing mark, seek forward + // and stop on the next character that is not any of these things + // since none of these characters will be visible at the end of a + // line, we can ignore them for the purposes of figuring out how + // many characters will fit on the line) + if (q < s.length()) { + c = s[q]; + while (q < s.length() + && (u_isspace(c) + || u_charType(c) == U_CONTROL_CHAR + || u_charType(c) == U_NON_SPACING_MARK + )) { + ++q; + c = s[q]; + } + } + + // then locate the last legal line-break decision at or before + // the current position ("at or before" is what causes the "+ 1") + q = bi->preceding(q + 1); + + // if this causes us to wind back to where we started, then the + // line has no legal line-break positions. Break the line at + // the maximum number of characters + if (q == p) { + p += MAX_CHARS_PER_LINE; + lineStarts[numLines] = p; + trailingwhitespace[numLines] = 0; + ++numLines; + } + // otherwise, we got a good line-break position. Record the start of this + // line (p) and then seek back from the end of this line (q) until you find + // a non-white space character (same criteria as above) and + // record the number of white space characters at the end of the + // line in the other results array + else { + lineStarts[numLines] = p; + int32_t nextLineStart = q; + + for (q--; q > p; q--) { + c = s[q]; + if (!(u_isspace(c) + || u_charType(c) == U_CONTROL_CHAR + || u_charType(c) == U_NON_SPACING_MARK)) { + break; + } + } + trailingwhitespace[numLines] = nextLineStart - q -1; + p = nextLineStart; + ++numLines; + } + if (numLines >= maxLines) { + break; + } + } + delete bi; + return numLines; +} +``` + +Most text editors would not break lines based on the number of characters on a +line. Even with a monospaced font, there are still many Unicode characters that +are not displayed and therefore should be filtered out of the calculation. With +a proportional font, character widths are added up until a maximum line width is +exceeded or an end of the paragraph marker is reached. + +Trailing white space does not need to be counted in the line-width measurement +because it does not need to be displayed at the end of a line. The sample code +above returns an array of trailing white space values because an external +rendering process needs to be able to measure the length of the line (without +the trailing white space) to justify the lines. For example, if the text is +right-justified, the invisible white space would be drawn outside the margin. +The line would actually end with the last visible character. + +In either case, the basic principle is to jump ahead in the text to the location +where the line would break (without taking word breaks into account). Then, move +backwards using the preceding() method to find the last legal breaking position +before that location. Iterating straight through the text with next() method +will generally be slower. + +## ICU BreakIterator Data Files + +The source code for the ICU break rules for the standard boundary types is +located in the directory +[icu4c/source/data/brkitr/rules](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules). +These files will be built, and the corresponding binary state tables +incorporated into ICU's data, by the standard ICU4C build process. + +The dictionary word lists used by word break, and for some languages, line break +are in +[icu4c/source/data/brkitr/dictionaries](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/dictionaries). + +The same data is used by both ICU4C and ICU4J. In the normal ICU build process, +the source data is processed into a binary form using ICU4C, and the resulting +binary tables are incorporated into ICU4J. diff --git a/docs/userguide/collation/api.md b/docs/userguide/collation/api.md new file mode 100644 index 00000000000..36d979d6884 --- /dev/null +++ b/docs/userguide/collation/api.md @@ -0,0 +1,696 @@ + + +# Collation API Details + +This section describes some of the usage conventions for the ICU Collation +Service API. + +## Collator Instantiation + +To use the Collation Service, you must instantiate a `Collator`. The +Collator defines the properties and behavior of the sort ordering. The Collator +can be repeatedly referenced until all collation activities have been performed. +The Collator can then be closed and removed. + +### Instantiating the Predefined Collators + +ICU comes with a large set of already predefined collators that are suited for +specific locales. Most of the ICU locales have a predefined collator. In the worst +case, the CLDR default set of rules, +which is mostly equivalent to the UCA default ordering (DUCET), is used. +The default sort order itself is designed to work well for many languages. +(For example, there are no tailorings for the standard sort orders for +English, German, French, etc.) + +To instantiate a predefined collator, use the APIs `ucol_open`, `createInstance` and +`getInstance` for C, C++ and Java codes respectively. The C API takes a locale ID +(or language tag) string argument, C++ takes a Locale object, and Java takes a +Locale or ULocale. + +For some languages, multiple collation types are available; for example, +"de-u-co-phonebk" / "de@collation=phonebook". They can be enumerated via +`Collator::getKeywordValuesForLocale()`. See also the list of available collation +tailorings in the online [ICU Collation +Demo](http://demo.icu-project.org/icu-bin/collation.html). + +Starting with ICU 54, collation attributes can be specified via locale keywords +as well, in the old locale extension syntax ("el@colCaseFirst=upper") or in +language tag syntax ("el-u-kf-upper"). Keywords and values are case-insensitive. + +See the [LDML Collation spec, Collation +Settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings), +and the [data +file](https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml) listing +the valid collation keywords and their values. (The deprecated attributes +kh/colHiraganaQuaternary and vt/variableTop are not supported.) + +For the [old locale extension +syntax](http://www.unicode.org/reports/tr35/tr35.html#Old_Locale_Extension_Syntax), +the data file's alias names are used (first alias, if defined, otherwise the +name): "de@collation=phonebook;colCaseLevel=yes;kv=space" + +For the language tag syntax, the non-alias names are used, and "true" values can +be omitted: "de-u-co-phonebk-kc-kv-space" + +This example demonstrates the instantiation of a collator. + +**C:** + +```C +UErrorCode status = U_ZERO_ERROR; +UCollator *coll = ucol_open("en_US", &status); +if(U_SUCCESS(status)) { + /* close the collator*/ + ucol_close(coll); +} +``` + +**C++:** + +```C++ +UErrorCode status = U_ZERO_ERROR; +Collator *coll = Collator::createInstance(Locale("en", "US"), status); +if(U_SUCCESS(status)) { + //close the collator + delete coll; +} +``` + +**Java:** + +```Java +Collator col = null; +try { + col = Collator.getInstance(Locale.US); +} catch (Exception e) { + System.err.println("English collation creation failed."); + e.printStackTrace(); +} +``` + +### Instantiating Collators Using Custom Rules + +If the ICU predefined collators are not appropriate for your intended usage, you +can +define your own set of rules and instantiate a collator that uses them. For more +details, please see [the section on collation +customization](customization/index.md). + +This example demonstrates the instantiation of a collator. + +**C:** + +```C +UErrorCode status = U_ZERO_ERROR; +U_STRING_DECL(rules, "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E", 52); +UCollator *coll; + +U_STRING_INIT(rules, "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E", 52); +coll = ucol_openRules(rules, -1, UCOL_ON, UCOL_DEFAULT_STRENGTH, NULL, &status); +if(U_SUCCESS(status)) { + /* close the collator*/ + ucol_close(coll); +} +``` + +**C++:** + +```C++ +UErrorCode status = U_ZERO_ERROR; +UnicodeString rules(u"&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E"); +Collator *coll = new RuleBasedCollator(rules, status); +if(U_SUCCESS(status)) { + //close the collator + delete coll; +} +``` + +**Java:** + +```Java +RuleBasedCollator coll = null; +String ruleset = "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E"; +try { + coll = new RuleBasedCollator(ruleset); +} catch (Exception e) { + System.err.println("Customized collation creation failed."); + e.printStackTrace(); +} +``` + +## Compare + +Two of the most used functions in ICU collation API, `ucol_strcoll` and `ucol_getSortKey`, have their counterparts in both Win32 and ANSI APIs: + +ICU C | ICU C++ | ICU Java | ANSI/POSIX | WIN32 +----------------- | --------------------------- | -------------------------- | ---------- | ----- +`ucol_strcoll` | `Collator::compare` | `Collator.compare` | `strcoll` | `CompareString` +`ucol_getSortKey` | `Collator::getSortKey` | `Collator.getCollationKey` | `strxfrm` | `LCMapString` +  | `Collator::getCollationKey` |   |   | + +For more sophisticated usage, such as user-controlled language-sensitive text +searching, an iterating interface to collation is provided. Please refer to the +section below on `CollationElementIterator` for more details. + +The `ucol_compare` function compares one pair of strings at a time. Comparing two +strings is much faster than calculating sort keys for both of them. However, if +comparisons should be done repeatedly on a very large number of strings, generating +and storing sort keys can improve performance. In all other cases (such as quick +sort or bubble sort of a +moderately-sized list of strings), comparing strings works very well. + +The C API used for comparing two strings is `ucol_strcoll`. It requires two +`UChar *` strings and their lengths as parameters, as well as a pointer to a valid +`UCollator` instance. The result is a `UCollationResult` constant, which can be one +of `UCOL_LESS`, `UCOL_EQUAL` or `UCOL_GREATER`. + +The C++ API offers the method `Collator::compare` with several overloads. +Acceptable input arguments are `UChar *` with length of strings, or `UnicodeString` +instances. The result is a member of the `UCollationResult` or `EComparisonResult` enums. + +The Java API provides the method `Collator.compare` with one overload. Acceptable +input arguments are Strings or Objects. The result is an int value, which is +less than zero if source is less than target, zero if source and target are +equal, or greater than zero if source is greater than target. + +There are also several convenience functions and methods returning a boolean +value, such as `ucol_greater`, `ucol_greaterOrEqual`, `ucol_equal` (in C) +`Collator::greater`, `Collator::greaterOrEqual`, `Collator::equal` (in C++) and +`Collator.equals` (in Java). + +### Examples + +**C:** + +```C +UChar *s [] = { /* list of Unicode strings */ }; +uint32_t listSize = sizeof(s)/sizeof(s[0]); +UErrorCode status = U_ZERO_ERROR; +UCollator *coll = ucol_open("en_US", &status); +uint32_t i, j; +if(U_SUCCESS(status)) { + for(i=listSize-1; i>=1; i--) { + for(j=0; j=1; i--) { + for(j=0; jcompare(s[j], s[j+1]) == UCOL_LESS) { + swap(s[j], s[j+1]); + } + } +} +delete coll; +} +``` + +**Java:** + +```Java +String s [] = { /* list of Unicode strings */ }; +try { + Collator coll = Collator.getInstance(Locale.US); + for (int i = s.length - 1; i > = 1; i --) { + for (j=0; j bufferLen) { + if (currBuffer == buffer) { + currBuffer = (char*)malloc(expectedLen); + } else { + currBuffer = (char*)realloc(currBuffer, expectedLen); + } + } + bufferLen = ucol_getSortKey(coll, source[i], -1, currBuffer, expectedLen); + } + processSortKey(i, currBuffer, bufferLen); + + + if (currBuffer != buffer && currBuffer != NULL) { + free(currBuffer); + } +} +``` + +> :point_right: **Note** Although the API allows you to call +> `ucol_getSortKey` with `NULL` to see what the +> sort key length is, it is strongly recommended that you NOT determine the length +> first, then allocate and fill the sort key buffer. If you do, it requires twice +> the processing since computing the length has to do the same calculation as +> actually getting the sort key. Instead, the example shown above uses a stack buffer. + +### Using Iterators for String Comparison + +ICU4C's `ucol_strcollIter` API allows for comparing two strings that are supplied +as character iterators (`UCharIterator`). This is useful when you need to compare +differently encoded strings using `strcoll`. In that case, converting the strings +first would probably be wasteful, since `strcoll` usually gives the result +before whole strings are processed. This API is implemented only as a C function +in ICU4C. There are no equivalent C++ or ICU4J functions. + +```C +... +/* we are arriving with two char*: utf8Source and utf8Target, with their +* lengths in utf8SourceLen and utf8TargetLen +*/ + UCharIterator sIter, tIter; + uiter_setUTF8(&sIter, utf8Source, utf8SourceLen); + uiter_setUTF8(&tIter, utf8Target, utf8TargetLen); + compareResultUTF8 = ucol_strcollIter(myCollation, &sIter, &tIter, &status); +... +``` + +### Obtaining Partial Sort Keys + +When using different sort algorithms, such as radix sort, sometimes it is useful +to process strings only as much as needed to feed into the sorting algorithm. +For that purpose, ICU provides the `ucol_nextSortKeyPart` API, which also takes +character iterators. This API allows for iterating over subsequent pieces of an +uncompressed sort key. Between calls to the API you need to save a 64-bit state. +Following is an example of simulating a string compare function using the partial +sort key API. Your usage model is bound to look much different. + +```C +static UCollationResult compareUsingPartials(UCollator *coll, + const UChar source[], int32_t sLen, + const UChar target[], int32_t tLen, + int32_t pieceSize, UErrorCode *status) { + int32_t partialSKResult = 0; + UCharIterator sIter, tIter; + uint32_t sState[2], tState[2]; + int32_t sSize = pieceSize, tSize = pieceSize; + int32_t i = 0; + uint8_t sBuf[16384], tBuf[16384]; + if(pieceSize > 16384) { + *status = U_BUFFER_OVERFLOW_ERROR; + return UCOL_EQUAL; + } + *status = U_ZERO_ERROR; + sState[0] = 0; sState[1] = 0; + tState[0] = 0; tState[1] = 0; + while(sSize == pieceSize && tSize == pieceSize && partialSKResult == 0) { + uiter_setString(&sIter, source, sLen); + uiter_setString(&tIter, target, tLen); + sSize = ucol_nextSortKeyPart(coll, &sIter, sState, sBuf, pieceSize, status); + tSize = ucol_nextSortKeyPart(coll, &tIter, tState, tBuf, pieceSize, status); + partialSKResult = memcmp(sBuf, tBuf, pieceSize); + } + + if(partialSKResult < 0) { + return UCOL_LESS; + } else if(partialSKResult > 0) { + return UCOL_GREATER; + } else { + return UCOL_EQUAL; + } +} +``` + +### Other Examples + +A longer example is presented in the 'Examples' section. Here is an illustration +of the usage model. + +**C:** + +```C +#define MAX_KEY_SIZE 100 +#define MAX_BUFFER_SIZE 10000 +#define MAX_LIST_LENGTH 5 +const char text[] = { + "Quick", + "fox", + "Moving", + "trucks", + "riddle" +}; +const UChar s [5][20]; +int i; +int32_t length, expectedLen; +uint8_t temp[MAX_BUFFER _SIZE]; + + +uint8_t *temp2 = NULL; +uint8_t keys [MAX_LIST_LENGTH][MAX_KEY_SIZE]; +UErrorCode status = U_ZERO_ERROR; + +temp2 = temp; + +length = MAX_BUFFER_SIZE; +for( i = 0; i < 5; i++) +{ + u_uastrcpy(s[i], text[i]); +} +UCollator *coll = ucol_open("en_US",&status); +uint32_t length; +if(U_SUCCESS(status)) { + for(i=0; i length) { + if (temp2 == temp) { + temp2 =(char*)malloc(expectedLen); + } else + temp2 =(char*)realloc(temp2, expectedLen); + } + length =ucol_getSortKey(coll, s[i], -1, temp2, expectedLen); + } + memcpy(key[i], temp2, length); + } +} +qsort(keys, MAX_LIST_LENGTH,MAX_KEY_SIZE*sizeof(uint8_t), strcmp); +for (i = 0; i < MAX_LIST_LENGTH; i++) { + free(key[i]); +} +ucol_close(coll); +``` + +**C++:** + +```C++ +#define MAX_LIST_LENGTH 5 +const UnicodeString s [] = { + "Quick", + "fox", + "Moving", + "trucks", + "riddle" +}; +CollationKey *keys[MAX_LIST_LENGTH]; +UErrorCode status = U_ZERO_ERROR; +Collator *coll = Collator::createInstance(Locale("en_US"), status); +uint32_t i; +if(U_SUCCESS(status)) { + for(i=0; igetCollationKey(s[i], -1); + } + qsort(keys, MAX_LIST_LENGTH, sizeof(CollationKey),compareKeys); + delete[] keys; + delete coll; +} +``` + +**Java:** + +```Java +String s [] = { + "Quick", + "fox", + "Moving", + "trucks", + "riddle" +}; +CollationKey keys[] = new CollationKey[s.length]; +try { + Collator coll = Collator.getInstance(Locale.US); + for (int i = 0; i < s.length; i ++) { + keys[i] = coll.getCollationKey(s[i]); + } + + Arrays.sort(keys); +} +catch (Exception e) { + System.err.println("Error creating English collator"); + e.printStackTrace(); +} +``` + +## Collation ElementIterator + +A collation element iterator can only be used in one direction. This is +established at the time of the first call to retrieve a collation element. Once +`ucol_next` (C), `CollationElementIterator::next` (C++) or +`CollationElementIterator.next` (Java) are invoked, +`ucol_previous` (C), +`CollationElementIterator::previous` (C++) or `CollationElementIterator.previous` +(Java) should not be used (and vice versa). The direction can be changed +immediately after `ucol_first`, `ucol_last`, `ucol_reset` (in C), +`CollationElementIterator::first`, `CollationElementIterator::last`, +`CollationElementIterator::reset` (in C++) or `CollationElementIterator.first`, +`CollationElementIterator.last`, `CollationElementIterator.reset` (in Java) is +called, or when it reaches the end of string while traversing the string. + +When `ucol_next` is called at the end of the string buffer, `UCOL_NULLORDER` is +always returned with any subsequent calls to `ucol_next`. The same applies to +`ucol_previous`. + +An example of how iterators are used is the Boyer-Moore search implementation, +which can be found in the samples section. + +### API Example + +**C:** + +```C +UCollator *coll = ucol_open("en_US",status); +UErrorCode status = U_ZERO_ERROR; +UChar text[20]; +UCollationElements *collelemitr; +uint32_t collelem; + +u_uastrcpy(text, "text"); +collelemitr = ucol_openElements(coll, text, -1, &status); +collelem = 0; +do { + collelem = ucol_next(collelemitr, &status); +} while (collelem != UCOL_NULLORDER); + +ucol_closeElements(collelemitr); +ucol_close(coll); +``` + +**C++:** + +```C++ +UErrorCode status = U_ZERO_ERROR; +Collator *coll = Collator::createInstance(Locale::getUS(), status); +UnicodeString text("text"); +CollationElementIterator *collelemitr = coll->createCollationElementIterator(text); +uint32_t collelem = 0; +do { + collelem = collelemitr->next(status); +} while (collelem != CollationElementIterator::NULLORDER); + +delete collelemitr; +delete coll; +``` + +**Java:** + +```Java +try { + RuleBasedCollator coll = (RuleBasedCollator)Collator.getInstance(Locale.US); + String text = "text"; + CollationElementIterator collelemitr = coll.getCollationElementIterator(text); + int collelem = 0; + do { + collelem = collelemitr.next(); + } while (collelem != CollationElementIterator.NULLORDER); +} catch (Exception e) { + System.err.println("Error in collation iteration"); + e.printStackTrace(); +} +``` + +## Setting and Getting Attributes + +The general attribute setting APIs are `ucol_setAttribute` (in C) and +`Collator::setAttribute` (in C++). These APIs take an attribute name and an +attribute value. If the name and the value pass a syntax and range check, the +property of the collator is changed. If the name and value do not pass a syntax +and range check, however, the state is not changed and the error code variable +is set to an error condition. The Java version does not provide general +attribute setting APIs; instead, each attribute has its own setter API of +the form `RuleBasedCollator.setATTRIBUTE_NAME(arguments)`. + +The attribute getting APIs are `ucol_getAttribute` (C) and `Collator::getAttribute` +(C++). Both APIs require an attribute name as an argument and return an +attribute value if a valid attribute name was supplied. If a valid attribute +name was not supplied, however, they return an undefined result and set the +error code. Similarly to the setter APIs for the Java version, no generic getter +API is provided. Each attribute has its own setter API of the form +`RuleBasedCollator.getATTRIBUTE_NAME()` in the Java version. + +## References: + +1. Ken Whistler, Markus Scherer: "Unicode Technical Standard #10, Unicode Collation + Algorithm" () + +2. ICU Design doc: "Collation v2" () + +3. Mark Davis: "ICU Collation Design Document" + () + +3. The Unicode Standard, chapter 5, "Implementation guidelines" + () + +4. Laura Werner: "Efficient text searching in Java: Finding the right string in + any language" + () + +5. Mark Davis, Martin Dürst: "Unicode Standard Annex #15: Unicode Normalization + Forms" (). diff --git a/docs/userguide/collation/architecture.md b/docs/userguide/collation/architecture.md new file mode 100644 index 00000000000..16c78a45697 --- /dev/null +++ b/docs/userguide/collation/architecture.md @@ -0,0 +1,562 @@ + + +# Collation Service Architecture + +This section describes the design principles, architecture and coding +conventions of the ICU Collation Service. + +## Collator + +To use the Collation Service, a Collator must first be instantiated. An +Collator is a data structure or object that maintains all of the property +and state information necessary to define and support the specific collation +behavior provided. Examples of properties described in the Collator are the +locale, whether normalization is to be performed, and how many levels of +collation are to be evaluated. Examples of the state information described in +the Collator include the direction of a Collation Element Iterator (forward +or backward) and the status of the last API executed. + +The Collator is instantiated either by referencing a locale or by defining a +custom set of rules (a tailoring). + +The Collation Service uses the paradigm: + +1. Open a Collator, + +2. Use while necessary, + +3. Close the Collator. + +Collator instances cannot be shared among threads. You should open them +instead, and use a different collator for each separate thread. The safe clone +function is supported for cloning collators in a thread-safe fashion. + +The Collation Service follows the ICU conventions for locale designation +when opening collators: + +1. NULL means the default locale. + +2. The empty locale name ("") means the root locale. + The Collation Service adheres to the ICU conventions described in the + "[ICU Architectural Design](../design.md) " section of the users guide. + In particular: + +3. The standard error code convention is usually followed. (Functions that do + not take an error code parameter do so for backward compatibility.) + +4. The string length convention is followed: when passing a `UChar *`, the + length is required in a separate argument. If -1 is passed for the length, + it is assumed that the string is zero terminated. + +### Collation locale and keyword handling + +When a collator is created from a locale, the collation service (like all ICU +services) must map the requested locale to the localized collation data +available to ICU at the time. It does so using the standard ICU locale fallback +mechanism. See the fallback section of the [locale +chapter](../locale/index.md) for more details. + +If you pass a regular locale in, like "en_US", the collation service first +searches with fallback for "collations/default" key. The first such key it finds +will have an associated string value; this is the keyword name for the collation +that is default for this locale. If the search falls all the way back to the +root locale, the collation service will us the "collations/default" key there, +which has the value "standard". + +If there is a locale with a keyword, like "de-u-co-phonebk" or "de@collation=phonebook", the +collation service searches with fallback for "collations/phonebook". If the +search is successful, the collation service uses the string value it finds to +instantiate a Collator. If the search fails because no such key is present in +any of ICU's locale data (e.g., "de@collation=funky"), the service returns a +collator implementing the default tailoring of the locale. +If the fallback is all the way to the root locale, then +the return `UErrorCode` is `U_USING_DEFAULT_WARNING`. + +## Input values for collation + +Collation deals with processing strings. ICU generally requires that all the +strings should be in UTF-16 format, and that all the required conversion should +done before ICU functions are used. In the case of collation, there are APIs +that can also take instances of character iterators (`UCharIterator`) +or UTF-8 directly. + +Theoretically, character iterators can iterate strings +in any encoding. ICU currently provides character iterator implementations for +UTF-8 and UTF-16BE (useful when processing data from a big endian platform on an +little endian machine). It should be noted, however, that using iterators for +collation APIs has a performance impact. It should be used in situations when it +is not desirable to convert whole strings before the operation - such as when +using a string compare function. + +## Collation Elements + +As discussed in the introduction, there are many possible orderings for sorted +text, depending on language and other factors. Ideally, there is a way to +describe each ordering as a set of rules for calculating numeric values for each +string of text. The collation process then becomes one of simply comparing these +numeric values. + +This essentially describes the way the Collation Service works. To implement +a particular sort ordering, first the relationship between each character or +character sequence is derived. For example, a Spanish ordering defines the +letter sequence "CH" to be between the letters "C" and "D". As also discussed in +the introduction, to order strings properly requires that comparison of base +letters must be considered separately from comparison of accents. Letter case +must also be considered separately from either base letters or accents. Any +ordering specification language must provide a way to define the relationships +between characters or character sequences on multiple levels. ICU supports this +by using "<" to describe a relationship at the primary level, using "<<" to +describe a relationship at the secondary level, and using "<<<" to describe a +relationship at the tertiary level. Here are some example usages: + +Symbol | Example | Description +------ | -------- | ----------- +`<` | `c < ch` | Make a primary (base letter) difference between "c" and the character sequence "ch" +`<<` | `a << ä` | Make a secondary (accent) difference between "a" and "ä" +`<<<` | `a<< + +### Sort key size + +One of the more important issues when considering using sort keys is the sort +key size. Unfortunately, it is very hard to give a fast exact answer to the +following question: "What is the maximum size for sort keys generated for +strings of size X". This problem is twofold: + +1. The maximum size of the sort key depends on the size of the collation + elements that are used to build it. Size of collation elements vary greatly + and depends both on the alphabet in question and on the locale used. + +2. Compression is used in building sort keys. Most 'regular' sequences of + characters produce very compact sort keys. + +If one is to assume the worst case and use too-big buffers, a lot of space will +be wasted. However, if you use too-small buffers, you will lose performance if +generated sort keys are longer than supplied buffers too often +(and you have to reallocate for each of those). +A good strategy +for this problem would be to manually manage a large buffer for storing sortkeys +and keep a list of indices to sort keys in this buffer (see the "large buffers" +[Collation Example](examples.md#using-large-buffers-to-manage-sort-keys) +for more details). + +Here are some rules of a thumb, please do not rely on them. If you are looking +at the East Asian locales, you probably want to go with 5 bytes per code point. +For Thai, 3 bytes per code point should be sufficient. For all the other locales +(mostly Latin and Cyrillic), you should be fine with 2 bytes per code point. +These values are based on average lengths of sort keys generated with tertiary +strength. If you need quaternary and identical strength (you should not), add 3 +bytes per code point to each of these. + +### Partial sort keys + +In some cases, most notably when implementing [radix +sorting](http://en.wikipedia.org/wiki/Radix_sort), it is useful to produce only +parts of sort keys at a time. ICU4C 2.6+ provides an API that allows producing +parts of sort keys (`ucol_nextSortKeyPart` API). These sort keys may or may not be +compressed; that is, they may or may not be compatible with regular sort keys. + +### Merging sort keys + +Sometimes, it is useful to be able to merge sort keys. One example is having +separate sort keys for first and last names. If you need to perform an operation +that requires a sort key generated on the whole name, instead of concatenating +strings and regenerating sort keys, you should merge the sort keys. The merging +is done by merging the corresponding levels while inserting a terminator between +merged parts. The reserved sort key byte value for the merge terminator is 0x02. +For more details see [UCA section 1.6, Merging Sort +Keys](http://www.unicode.org/reports/tr10/#Interleaved_Levels). + +* C API: unicode/ucol.h `ucol_mergeSortkeys()` +* Java API: `com.ibm.icu.text.CollationKey merge(CollationKey source)` + +CLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is +intended to allow concatenating strings like firstName+\\uFFFE+lastName to yield +the same results as merging their individual sort keys. +This has been fully implemented in ICU since version 53. + +### Generating bounds for a sort key (prefix matching) + +Having sort keys for strings allows for easy creation of bounds - sort keys that +are guaranteed to be smaller or larger than any sort key from a give range. For +example, if bounds are produced for a sortkey of string "smith", strings between +upper and lower bounds with one level would include "Smith", "SMITH", "sMiTh". +Two kinds of upper bounds can be generated - the first one will match only +strings of equal length, while the second one will match all the strings with +the same initial prefix. + +CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum +primary weight, so that for example the string "smith\\uFFFF" can be used as the +upper bound rather than modifying the sort key for "smith". + +## Collation Element Iterator + +The collation element iterator is used for traversing Unicode string collation +elements one at a time. It can be used to implement language-sensitive text +search algorithms like Boyer-Moore. + +For most applications, the two API categories, compare and sort key, are +sufficient. Most people do not need to manipulate collation elements directly. + +Example: + +Consider iterating over "apple" and "äpple". Here are sequences of collation +elements: + +String 1 | String 1 Collation Elements +-------- | --------------------------- +a | `[1900.05.05]` +p | `[3700.05.05]` +p | `[3700.05.05]` +l | `[2F00.05.05]` +e | `[2100.05.05]` + +String 2 | String 2 Collation Elements +-------- | --------------------------- +a | `[1900.05.05]` +\\u0308 | `[0000.9D.05]` +p | `[3700.05.05]` +p | `[3700.05.05]` +l | `[2F00.05.05]` +e | `[2100.05.05]` + +The resulting CEs are typically masked according to the desired strength, and +zero CEs are discarded. In the above example, masking with 0xFFFF0000 (for primary strength) +produces the results of NULL secondary and tertiary differences. The collator then +ignores the NULL differences and declares a match. For more details see the +paper "Efficient text searching in Java™: Finding the right string in any +language" by Laura Werner ( +). + +## Collation Attributes + +The Collation Service has a number of attributes whose values can be changed +during run time. These attributes affect both the functionality and the +performance of the Collation Service. This section describes these +attributes and, where possible, their performance impact. Performance +indications are only approximate and timings may vary significantly depending on +the CPU, compiler, etc. + +Although string comparison by ICU and comparison of each string's sort key give +the same results, attribute settings can impact the execution time of each +method differently. To be precise in the discussion of performance, this section +refers to the API employed in the measurement. The `ucol_strcoll` function is the +API for string comparison. The `ucol_getSortKey` function is used to create sort +keys. + +> :point_right: **Note** There is a special attribute value, `UCOL_DEFAULT`, +> that can be used to set any attribute to its default value +> (which is inherited from the UCA and the tailoring). + +### Attribute Types + +#### Strength level + +Collation strength, or the maximum collation level used for comparison, is set +by using the `UCOL_STRENGTH` attribute. Valid values are: + +1. `UCOL_PRIMARY` + +2. `UCOL_SECONDARY` + +3. `UCOL_TERTIARY` (default) + +4. `UCOL_QUATERNARY` + +5. `UCOL_IDENTICAL` + +#### French collation + +The `UCOL_FRENCH_COLLATION` attribute determines whether to sort the secondary +differences in reverse order. Valid values are: + +1. `UCOL_OFF` (default): compares secondary differences in the order they appear + in the string. + +2. `UCOL_ON`: causes secondary differences to be considered in reverse order, as + it is done in the French language. + +#### Normalization mode + +The `UCOL_NORMALIZATION_MODE` attribute, or its alias `UCOL_DECOMPOSITION_MODE`, +controls whether text normalization is performed on the input strings. Valid +values are: + +1. `UCOL_OFF` (default): turns off normalization check + +2. `UCOL_ON` : normalization is checked and the collator performs normalization + if it is needed. + +X | FCD | NFC | NFD +--------------------- | --- | --- | --- +A-ring | Y | Y | +Angstrom | Y | | +A + ring | Y | | Y +A + grave | Y | Y | +A-ring + grave | Y | | +A + cedilla + ring | Y | | Y +A + ring + cedilla | | | +A-ring + cedilla | | Y | + +With normalization mode turned on, the `ucol_strcoll` function slows down by 10%. +In addition, the time to generate a sort key also increases by about 25%. + +#### Alternate handling + +This attribute allows shifting of the variable characters (usually spaces and +punctuation, in the UCA also most symbols) from the primary to the quaternary +strength level. This is set by using the `UCOL_ALTERNATE_HANDLING` attribute. For +details see [UCA: Variable +Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting), [LDML: +Collation +Settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings), +and [“Ignore Punctuation” Options](customization/ignorepunct.md). + +1. `UCOL_NON_IGNORABLE` (CLDR/ICU default): variable characters are treated as + all the other characters + +2. `UCOL_SHIFTED` (UCA default): all the variable characters will be ignored at + the primary, secondary and tertiary levels and their primary strengths will + be shifted to the quaternary level. + +#### Case Ordering + +Some conventions require uppercase letters to sort before lowercase ones, while +others require the opposite. This attribute is controlled by the value of the +`UCOL_CASE_FIRST`. The case difference in the UCA is contained in the tertiary +weights along with other appearance characteristics (like circling of letters). +The case-first attribute allows for emphasizing of the case property of the +letters by reordering the tertiary weights with either upper-first, and/or +lowercase-first. This difference gets the most significant bit in the weight. +Valid values for this attribute are: + +1. `UCOL_OFF` (default): leave tertiary weights unaffected + +2. `UCOL_LOWER_FIRST`: causes lowercase letters and uncased characters to sort + before uppercase + +3. `UCOL_UPPER_FIRST` : causes uppercase letters to sort first + +The case-first attribute does not affect the performance substantially. + +#### Case level + +When this attribute is set, an additional level is formed between the secondary +and tertiary levels, known as the Case Level. The case level is used to +distinguish large and small Japanese Kana characters. Case level could also be +used in other situations. for example to distinguish certain Pinyin characters. +Case level is controlled by `UCOL_CASE_LEVEL` attribute. Valid values for this +attribute are + +1. `UCOL_OFF` (default): no additional case level + +2. `UCOL_ON` : adds a case level + +#### Hiragana Quaternary + +*This setting is deprecated and ignored in recent versions of ICU.* + +Hiragana Quaternary can be set to `UCOL_ON`, in which case Hiragana code points +will sort before everything else on the quaternary level. If set to `UCOL_OFF` +Hiragana letters are treated the same as all the other code points. This setting +can be changed on run-time using the `UCOL_HIRAGANA_QUATERNARY_MODE` attribute. +You probably won't need to use it. + +#### Variable Top + +Variable Top is a boundary which decides whether the code points will be treated +as variable (shifted to quaternary level in the **shifted** mode) or +non-ignorable. Special APIs are used for setting of variable top. It can +basically be set either to a codepoint or a primary strength value. + +## Performance + +ICU collation is designed to be fast, small and customizable. Several techniques +are used to enhance the performance: + +1. Providing optimized processing for Latin characters. + +2. Comparing strings incrementally and stopping at the first significant + difference. + +3. Tuning to eliminate unnecessary file access or memory allocation. + +4. Providing efficient preflight functions that allows fast sort key size + generation. + +5. Using a single, shared copy of UCA in memory for the read-only default sort + order. Only small tailoring tables are kept in memory for locale-specific + customization. + +6. Compressing sort keys efficiently. + +7. Making the sort order be data-driven. + +In general, the best performance from the Collation Service is expected by +doing the following: + +1. After opening a collator, keep and reuse it until done. Do not open new + collators for the same sort order. (Note the restriction on + multi-threading.) + +2. Use `ucol_strcoll` etc. when comparing strings. If it is necessary to + compare strings thousands or millions of times, + create the sort keys first and compare the sort keys instead. + Generating the sort keys of two strings is about 5-10 + times slower than just comparing them directly. + +3. Follow the best practice guidelines for generating sort keys. Do not call + `ucol_getSortKey` twice to first size the key and then allocate the sort key + buffer and repeat the call to the function to fill in the buffer. + +### Performance and Storage Implications of Attributes + +Most people use the default attributes when comparing strings or when creating +sort keys. When they do want to customize the ordering, the most common options +are the following : + +`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED`\ +Used to ignore space and punctuation characters + +`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED` **and** `UCOL_STRENGTH == UCOL_QUATERNARY`\ +Used to ignore the space and punctuation characters except when there are no previous letter, accent, or case/variable differences. + +`UCOL_CASE_FIRST == UCOL_LOWER_FIRST` **or** `UCOL_CASE_FIRST == UCOL_UPPER_FIRST`\ +Used to change the ordering of upper vs. lower case letters (as +well as small vs. large kana) + +`UCOL_CASE_LEVEL == UCOL_ON` **and** `UCOL_STRENGTH == UCOL_PRIMARY`\ +Used to ignore only the accent differences. + +`UCOL_NORMALIZATION_MODE == UCOL_ON`\ +Force to always check for normalization. This +is used if the input text may not be in FCD form. + +`UCOL_FRENCH_COLLATION == UCOL_OFF`\ +This is only useful for languages like French and Catalan that may turn this attribute on. +(It is the default only for Canadian French ("fr-CA").) + +In String Comparison, most of these options have little or no effect on +performance. The only noticeable one is normalization, which can cost 10%-40% in +performance. + +For Sort Keys, most of these options either leave the storage alone or reduce +it. Shifting can reduce the storage by about 10%-20%; case level + primary-only +can decrease it about 20% to 40%. Using no French accents can reduce the storage +by about 38% , but only for languages like French and Catalan that turn it on by +default. On the other hand, using Shifted + Quaternary can increase the storage by +10%-15%. (The Identical Level also increases the length, but this option is not +recommended). + +> :point_right: **Note** All of the above numbers are based on +> tests run on a particular machine, with a particular set of data. +> (The data for each language is a large number of names +> in that language in the format , .) +> The performance and storage may vary, depending on the particular computer, +> operating system, and data. + +## Versioning + +Sort keys are often stored on disk for later reuse. A common example is the use +of keys to build indexes in databases. When comparing keys, it is important to +know that both keys were generated by the same algorithms and weightings. +Otherwise, identical strings with keys generated on two different dates, for +example, might compare as unequal. Sort keys can be affected by new versions of +ICU or its data tables, new sort key formats, or changes to the Collator. +Starting with release 1.8.1, ICU provides a versioning mechanism to identify the +version information of the following (but not limited to), + +1. The run-time executable + +2. The collation element content + +3. The Unicode/UCA database + +4. The tailoring table + +The version information of Collator is a 32-bit integer. If a new version of ICU +has changes affecting the content of collation elements, the version information +will be changed. In that case, to use the new version of ICU collator will +require regenerating any saved or stored sort keys. + +However, it is possible to modify ICU code or data without changing relevant version numbers, +so it is safer to regenerate sort keys any time after any part of ICU has been updated. + +Since ICU4C 1.8.1. +it is possible to build your program so that it uses more than one version of +ICU (only in C/C++, not in Java). Therefore, you could use the current version +for the features you need and use the older version for collation. + +## Programming Examples + +See the [Collation Examples](examples.md) chapter for an example of how to +compare and create sort keys with the default locale in C, C++ and Java. diff --git a/docs/userguide/collation/concepts.md b/docs/userguide/collation/concepts.md new file mode 100644 index 00000000000..c8468b54db8 --- /dev/null +++ b/docs/userguide/collation/concepts.md @@ -0,0 +1,814 @@ + + +# Collation Concepts + +The previous section demonstrated many of the requirements imposed on string +comparison routines that try to correctly collate strings according to +conventions of more than a hundred different languages, written in many +different scripts. This section describes the principles and architecture behind +the ICU Collation Service. + +## Sortkeys vs Comparison + +Sort keys are most useful in databases, where the overhead of calling a function +for each comparison is very large. + +Generating a sort key from a Collator is many times more expensive than doing a +compare with the Collator (for common use cases). That's if the two functions +are called from Java or C. So for those languages, unless there is a very large +number of comparisons, it is better to call the compare function. + +Here is an example, with a little back-of-the-envelope calculation. Let's +suppose that with a given language on a given platform, the compare performance +(CP) is 100 faster than sortKey performance (SP), and that you are doing a +binary search of a list with 1,000 elements. The binary comparison performance +is BP. We'd do about 10 comparisons, getting: + +compare: 10 \* CP + +sortkey: 1 \* SP + 10 \* BP + +Even if BP is free, compare would be better. One has to get up to where log2(n) += 100 before they break even. + +But even this calculation is only a rough guide. First, the binary comparison is +not completely free. Secondly, the performance of compare function varies +radically with the source data. We optimized for maximizing performance of +collation in sorting and binary search, so comparing strings that are "close" is +optimized to be much faster than comparing strings that are "far away". That +optimization is important because normal sort/lookup operations compare close +strings far more often -- think of binary search, where the last few comparisons +are always with the closest strings. So even the above calculation is not very +accurate. + +## Comparison Levels + +In general, when comparing and sorting objects, some properties can take +precedence over others. For example, in geometry, you might consider first the +number of sides a shape has, followed by the number of sides of equal length. +This causes triangles to be sorted together, then rectangles, then pentagons, +etc. Within each category, the shapes would be ordered according to whether they +had 0, 2, 3 or more sides of the same length. However, this is not the only way +the shapes can be sorted. For example, it might be preferable to sort shapes by +color first, so that all red shapes are grouped together, then blue, etc. +Another approach would be to sort the shapes by the amount of area they enclose. + +Similarly, character strings have properties, some of which can take precedence +over others. There is more than one way to prioritize the properties. + +For example, a common approach is to distinguish characters first by their +unadorned base letter (for example, without accents, vowels or tone marks), then +by accents, and then by the case of the letter (upper vs. lower). Ideographic +characters might be sorted by their component radicals and then by the number of +strokes it takes to draw the character. +An alternative ordering would be to sort these characters by strokes first and +then by their radicals. + +The ICU Collation Service supports many levels of comparison (named "Levels", +but also known as "Strengths"). Having these categories enables ICU to sort +strings precisely according to local conventions. However, by allowing the +levels to be selectively employed, searching for a string in text can be +performed with various matching conditions. + +Performance optimizations have been made for ICU collation with the default +level settings. Performance specific impacts are discussed in the Performance +section below. + +Following is a list of the names for each level and an example usage: + +1. Primary Level: Typically, this is used to denote differences between base + characters (for example, "a" < "b"). It is the strongest difference. For + example, dictionaries are divided into different sections by base character. + This is also called the level-1 strength. + +2. Secondary Level: Accents in the characters are considered secondary + differences (for example, "as" < "às" < "at"). Other differences between + letters can also be considered secondary differences, depending on the + language. A secondary difference is ignored when there is a primary + difference anywhere in the strings. This is also called the level-2 + strength. + Note: In some languages (such as Danish), certain accented letters are + considered to be separate base characters. In most languages, however, an + accented letter only has a secondary difference from the unaccented version + of that letter. + +3. Tertiary Level: Upper and lower case differences in characters are + distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In + addition, a variant of a letter differs from the base form on the tertiary + level (such as "A" and "Ⓐ"). Another example is the difference between large + and small Kana. A tertiary difference is ignored when there is a primary or + secondary difference anywhere in the strings. This is also called the + level-3 strength. + +4. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations + (§)) at level 1-3, an additional level can be used to distinguish words with + and without punctuation (for example, "ab" < "a-b" < "aB"). This difference + is ignored when there is a primary, secondary or tertiary difference. This + is also known as the level-4 strength. The quaternary level should only be + used if ignoring punctuation is required or when processing Japanese text + (see Hiragana processing (§)). + +5. Identical Level: When all other levels are equal, the identical level is + used as a tiebreaker. The Unicode code point values of the NFD form of each + string are compared at this level, just in case there is no difference at + levels 1-4 . For example, Hebrew cantillation marks are only distinguished + at this level. This level should be used sparingly, as only code point + value differences between two strings is an extremely rare occurrence. + Using this level substantially decreases the performance for + both incremental comparison and sort key generation (as well as increasing + the sort key length). It is also known as level 5 strength. + +## Backward Secondary Sorting + +Some languages require words to be ordered on the secondary level according to +the *last* accent difference, as opposed to the *first* accent difference. This +was previously the default for all French locales, based on some French +dictionary ordering traditions, but is currently only applicable to Canadian +French (locale **fr_CA**), for conformance with the [Canadian sorting +standard](http://www.unicode.org/reports/tr10/#CanStd). The difference in +ordering is only noticeable for a small number of pairs of real words. For more +information see [UCA: Contextual +Sensitivity](http://www.unicode.org/reports/tr10/#Contextual_Sensitivity). + +Example: + +Forward secondary | Backward secondary +----------------- | ------------------ +cote | cote +coté | côte +côte | coté +côté | côté + +## Contractions + +A contraction is a sequence consisting of two or more letters. It is considered +a single letter in sorting. + +For example, in the traditional Spanish sorting order, "ch" is considered a +single letter. All words that begin with "ch" sort after all other words +beginning with "c", but before words starting with "d". + +Other examples of contractions are "ch" in Czech, which sorts after "h", and +"lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n" +respectively. + +Example: + +Order without contraction | Order with contraction "lj" sorting after letter "l" +------------------------- | ---------------------------------------------------- +la | la +li | li +lj | lk +lja | lz +ljz | lj +lk | lja +lz | ljz +ma | ma + +Contracting sequences such as the above are not very common in most languages. + +> :point_right: **Note** Since ICU 2.2, and as required by the UCA, +> if a completely ignorable code point +> appears in text in the middle of contraction, it will not break the contraction. +> For example, in Czech sorting, cU+0000h will sort as it were ch. + +## Expansions + +If a letter sorts as if it were a sequence of more than one letter, it is called +an expansion. + +For example, in German phonebook sorting (de@collation=phonebook or BCP 47 +de-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae." +All words starting with "ä" will sort between words starting with "ad" and words +starting with "af". + +In the case of Unicode encoding, characters can often be represented either as +pre-composed characters or in decomposed form. For example, the letter "à" can +be represented in its decomposed (a+\`) and pre-composed (à) form. Most +applications do not want to distinguish text by the way it is encoded. A search +for "à" should find all instances of the letter, regardless of whether the +instance is in pre-composed or decomposed form. Therefore, either form of the +letter must result in the same sort ordering. The architecture of the ICU +Collation Service supports this. + +## Contractions Producing Expansions + +It is possible to have contractions that produce expansions. + +One example occurs in Japanese, where the vowel with a prolonged sound mark is +treated to be equivalent to the long vowel version: + +カアー<<< カイー and\ +キイー<<< キイー + +> :point_right: **Note** Since ICU 2.0 Japanese tailoring uses +> [prefix analysis](http://www.unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings) +> instead of contraction producing expansions. + +## Normalization + +In the section on expansions, we discussed that text in Unicode can often be +represented in either pre-composed or decomposed forms. There are other types of +equivalences possible with Unicode, including Canonical and Compatibility. The +process of +Normalization ensures that text is written in a predictable way so that searches +are not made unnecessarily complicated by having to match on equivalences. Not +all text is normalized, however, so it is useful to have a collation service +that can address text that is not normalized, but do so with efficiency. + +The ICU Collation Service handles un-normalized text properly, producing the +same results as if the text were normalized. + +In practice, most data that is encountered is in normalized or semi-normalized +form already. The ICU Collation Service is designed so that it can process a +wide range of normalized or un-normalized text without a need for normalization +processing. When a case is encountered that requires normalization, the ICU +Collation Service drops into code specific to this purpose. This maximizes +performance for the majority of text that does not require normalization. + +In addition, if the text is known with certainty not to contain un-normalized +text, then even the overhead of checking for normalization can be eliminated. +The ICU Collation Service has the ability to turn Normalization Checking either +on or off. If Normalization Checking is turned off, it is the user's +responsibility to insure that all text is already in the appropriate form. This +is true in a great majority of the world languages, so normalization checking is +turned off by default for most locales. + +If the text requires normalization processing, Normalization Checking should be +on. Any language that uses multiple combining characters such as Arabic, ancient +Greek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking +to be on, or the text to go through a normalization process before collation. + +For more information about Normalization related reordering please see +[Unicode Technical Note #5](http://www.unicode.org/notes/tn5/) and +[UAX #15.](http://www.unicode.org/reports/tr15/) + +> :point_right: **Note** ICU supports two modes of normalization: on and off. +> Java.text.\* classes offer compatibility decomposition mode, which is not supported in ICU. + +## Ignoring Punctuation + +In some cases, punctuation can be ignored while searching or sorting data. For +example, this enables a search for "biweekly" to also return instances of +"bi-weekly". In other cases, it is desirable for punctuated text to be +distinguished from text without punctuation, but to have the text sort close +together. + +These two behaviors can be accomplished if there is a way for a character to be +ignored on all levels except for the quaternary level. If this is the case, then +two strings which compare as identical on the first three levels (base letter, +accents, and case) are then distinguished at the fourth level based on their +punctuation (if any). If the comparison function ignores differences at the +fourth level, then strings that differ by punctuation only are compared as +equal. + +The following table shows the results of sorting a list of terms in 3 different +ways. In the first column, punctuation characters (space " ", and hyphen "-") +are not ignored (" " < "-" < "b"). In the second column, punctuation characters +are ignored in the first 3 levels and compared only in the fourth level. In the +third column, punctuation characters are ignored in the first 3 levels and the +fourth level is not considered. In the last column, punctuated terms are +equivalent to the identical terms without punctuation. + +For more options and details see the [“Ignore Punctuation” +Options](customization/ignorepunct.md) page. + +Non-ignorable | Ignorable and Quaternary strength | Ignorable and Tertiary strength +------------- | --------------------------------- | ------------------------------- +black bird | black bird | **black bird** +black Bird | black-bird | **black-bird** +black birds | blackbird | **blackbird** +black-bird | black Bird | black Bird +black-Bird | black-Bird | black-Bird +black-birds | blackBird | blackBird +blackbird | black birds | black birds +blackBird | black-birds | black-birds +blackbirds | blackbirds | blackbirds + +> :point_right: **Note** The strings with the same font format in the last column are +compared as equal by ICU Collator.\ +> Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that +> follow shifted code points will be completely ignored. This means that an accent +> following a space will compare as if it was a space alone. + +## Case Ordering + +The tertiary level is used to distinguish text by case, by small versus large +Kana, and other letter variants as noted above. + +Some applications prefer to emphasize case differences so that words starting +with the same case sort together. Some Japanese applications require the +difference between small and large Kana be emphasized over other tertiary +differences. + +The UCA does not provide means to separate out either case or Kana differences +from the remaining tertiary differences. However, the ICU Collation Service has +two options that help in customize case and/or Kana differences. Both options +are turned off by default. + +### CaseFirst + +The Case-first option makes case the most significant part of the tertiary +level. Primary and secondary levels are unaffected. With this option, words +starting with the same case sort together. The Case-first option can be set to +make either lowercase sort before +uppercase or uppercase sort before lowercase. + +Note: The case-first option does not constitute a separate level; it is simply a +reordering of the tertiary level. + +ICU makes use of the following three case categories for sorting + +1. uppercase: "ABC" + +2. mixed case: "Abc", "aBc" + +3. normal (lowercase or no case): "abc", "123" + +Mixed case is always sorted between uppercase and normal case when the +"case-first" option is set. + +### CaseLevel + +The Case Level option makes a separate level for case differences. This is an +extra level positioned between secondary and tertiary. The case level is used in +Japanese to make the difference between small and large Kana more important than +the other tertiary differences. It also can be used to ignore other tertiary +differences, or even secondary differences. This is especially useful in +matching. For example, if the strength is set to primary only (level-1) and the +case level is turned on, the comparison ignores accents and tertiary differences +except for case. The contents of the case level are affected by the case-first +option. + +The case level is independent from the strength of comparison. It is possible to +have a collator set to primary strength with the case level turned on. This +provides for comparison that takes into account the case differences, while at +the same time ignoring accents and tertiary differences other than case. This +may be used in searching. + +Example: + +**Case-first off, Case level off** + +apple\ +ⓐⓟⓟⓛⓔ\ +Abernathy\ +ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ +ähnlich\ +Ähnlichkeit + +**Lowercase-first, Case level off** + +apple\ +ⓐⓟⓟⓛⓔ\ +ähnlich\ +Abernathy\ +ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ +Ähnlichkeit + +**Uppercase-first, Case level off** + +Abernathy\ +ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ +Ähnlichkeit\ +apple\ +ⓐⓟⓟⓛⓔ\ +ähnlich + +**Lowercase-first, Case level on** + +apple\ +Abernathy\ +ⓐⓟⓟⓛⓔ\ +ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ +ähnlich\ +Ähnlichkeit + +**Uppercase-first, Case level on** + +Abernathy\ +apple\ +ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ +ⓐⓟⓟⓛⓔ\ +Ähnlichkeit\ +ähnlich + +## Script Reordering + +Script reordering allows scripts and some other groups of characters to be moved +relative to each other. This reordering is done on top of the DUCET/CLDR +standard collation order. Reordering can specify groups to be placed at the +start and/or the end of the collation order. + +By default, reordering codes specified for the start of the order are placed in +the order given after several special non-script blocks. These special groups of +characters are space, punctuation, symbol, currency, and digit. Script groups +can be intermingled with these special non-script groups if those special groups +are explicitly specified in the reordering. + +The special code `others` stands for any script that is not explicitly mentioned +in the list. Anything that is after others will go at the very end of the list +in the order given. For example, `[Grek, others, Latn]` will result in an +ordering that puts all scripts other than Greek and Latin between them. + +### Examples: + +Note: All examples below use the string equivalents for the scripts and reorder +codes that would be used in collator rules. The script and reorder code +constants that would be used in API calls will be different. + +**Example 1:**\ +set reorder code - `[Grek]`\ +result - `[space, punctuation, symbol, currency, digit, Grek, others]` + +**Example 2:**\ +set reorder code - `[Grek]`\ +result - `[space, punctuation, symbol, currency, digit, Grek, others]` + +followed by: set reorder code - `[Hani]`\ +result -` [space, punctuation, symbol, currency, digit, Hani, others]` + +That is, setting a reordering always modifies +the DUCET/CLDR order, replacing whatever was previously set, rather than adding +on to it. In order to cumulatively modify an ordering, you have to retrieve the +existing ordering, modify it, and then set it. + +**Example 3:**\ +set reorder code - `[others, digit]`\ +result - `[space, punctuation, symbol, currency, others, digit]` + +**Example 4:**\ +set reorder code - `[space, Grek, punctuation]`\ +result - `[symbol, currency, digit, space, Grek, punctuation, others]` + +**Example 5:**\ +set reorder code - `[Grek, others, Hani]`\ +result - `[space, punctuation, symbol, currency, digit, Grek, others, Hani]` + +**Example 6:**\ +set reorder code - `[Grek, others, Hani, symbol, Tglg]`\ +result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` + +followed by:\ +set reorder code - `[NONE]`\ +result - DUCET/CLDR + +**Example 7:**\ +set reorder code - `[Grek, others, Hani, symbol, Tglg]`\ +result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` + +followed by:\ +set reorder code - `[DEFAULT]`\ +result - original reordering for the locale which may or may not be DUCET/CLDR + +**Example 8:**\ +set reorder code - `[Grek, others, Hani, symbol, Tglg]`\ +result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` + +followed by:\ +set reorder code - `[]`\ +result - original reordering for the locale which may or may not be DUCET/CLDR + +**Example 9:**\ +set reorder code - `[Hebr, Phnx]`\ +result - error + +Beginning with ICU 55, scripts only reorder together if they are primary-equal, +for example Hiragana and Katakana. + +ICU 4.8-54: + +* Scripts were reordered in groups, each normally starting with a [Recommended + Script](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). +* Reorder codes moved as a group (were “equivalent”) if their scripts shared a + primary-weight lead byte. +* For example, Hebr and Phnx were “equivalent” reordering codes and were + reordered together. Their order relative to each other could not be changed. +* Only any one code out of any group could be reordered, not multiple of the + same group. + +## Sorting of Japanese Text (JIS X 4061) + +Japanese standard JIS X 4061 requires two changes to the collation procedures: +special processing of Hiragana characters and (for performance reasons) prefix +analysis of text. + +### Hiragana Processing + +JIS X 4061 standard requires more levels than provided by the UCA. To offer +conformant sorting order, ICU uses the quaternary level to distinguish between +Hiragana and Katakana. Hiragana symbols are given smaller values than Katakana +symbols on quaternary level, thus causing Hiragana sequences to sort before +corresponding Katakana sequences. + +### Prefix Analysis + +Another characteristics of sorting according to the JIS X 4061 is a large number +of contractions followed by expansions (see +[Contractions Producing Expansions](#contractions-producing-expansions)). +This causes all the Hiragana and Katakana codepoints to be treated as +contractions, which reduces performance. The solution we adopted introduces the +prefix concept which allows us to improve the performance of Japanese sorting. +More about this can be found in the [customization +chapter](customization/index.md) . + +## Thai/Lao reordering + +UCA requires that certain Thai and Lao prevowels be reordered with a code point +following them. This option is always on in the ICU implementation, as +prescribed by the UCA. + +This rule takes effect when: + +1. A Thai vowel of the range \\U0E40-\\U0E44 precedes a Thai consonant of the + range \\U0E01-\\U0E2E + or + +2. A Lao vowel of the range \\U0EC0-\\U0EC4 precedes a Lao consonant of the + range \\U0E81-\\U0EAE. In these cases the vowel is placed after the + consonant for collation purposes. + +> :point_right: **Note** There is a difference between java.text.\* classes and ICU in regard to Thai +> reordering. Java.text.\* classes allow tailorings to turn off reordering by +> using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai +> prevowels. + +## Space Padding + +In many database products, fields are padded with null. To get correct results, +the input to a Collator should omit any superfluous trailing padding spaces. The +problem arises with contractions, expansions, or normalization. Suppose that +there are two fields, one containing "aed" and the other with "äd". German +phonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will +compare "ä" as if it were "ae" (on a primary level), so the order will be "äd" < +"aed". But if both fields are padded with spaces to a length of 3, then this +will reverse the order, since the first will compare as if it were one character +longer. In other words, when you start with strings 1 and 2 + +1 | a | e | d | \ +-- | -- | -- | --------- | --------- +2 | ä | d | \ | \ + +they end up being compared on a primary level as if they were 1' and 2' + +1' | a | e | d | \ |   +-- | -- | -- | -- | --------- | --------- +2' | a | e | d | \ | \ + +Since 2' has an extra character (the extra space), it counts as having a primary +difference when it shouldn't. The correct result occurs when the trailing +padding spaces are removed, as in 1" and 2" + +1" | a | e | d +-- | -- | -- | -- +2" | a | e | d + +## Collator naming scheme + +***Starting with ICU 54, the following naming scheme and its API functions are +deprecated.*** Use ucol_open() with language tag collation keywords instead (see +[Collation API Details](api.md)). For example, +ucol_open("de-u-co-phonebk-ka-shifted", &errorCode) for German Phonebook order +with "ignore punctuation" mode. + +When collating or matching text, a number of attributes can be used to affect +the desired result. The following describes the attributes, their values, their +effects, their normal usage, and the string comparison performance and sort key +length implications. It also includes single-letter abbreviations for both the +attributes and their values. These abbreviations allow a 'short-form' +specification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which +can be used to specific that the desired options are: UCA version 4.0.0; ignore +spaces, punctuation and symbols; use Swedish linguistic conventions; compare +case-insensitively. + +A number of attribute values are common across different attributes; these +include **Default** (abbreviated as D), **On** (O), and **Off** (X). Unless +otherwise stated, the examples use the UCA alone with default settings. + +> :point_right: **Note** In order to achieve uniqueness, a collator name always +> has the attribute abbreviations sorted. + +### Main References + +1. For a full list of supported locales in ICU, see [Locale + Explorer](http://demo.icu-project.org/icu-bin/locexp) , which also contains + an on-line demo showing sorting for each locale. The demo allows you to try + different attribute values, to see how they affect sorting. + +2. To see tabular results for the UCA table itself, see the [Unicode Collation + Charts](http://www.unicode.org/charts/collation/) . + +3. For the UCA specification, see [UTS #10: Unicode Collation + Algorithm](http://www.unicode.org/reports/tr10/) . + +4. For more detail on the precise effects of these options, see [Collation + Customization](customization/index.md) . + +#### Collator Naming Attributes + +Attribute | Abbreviation | Possible Values +---------------------- | ------------ | --------------- +Locale | L | \ +Script | Z | \ +Region | R | \ +Variant | V | \ +Keyword | K | \ +  |   |   +Strength | S | 1, 2, 3, 4, I, D +Case_Level | E | X, O, D +Case_First | C | X, L, U, D +Alternate | A | N, S, D +Variable_Top | T | \ +Normalization Checking | N | X, O, D +French | F | X, O, D +Hiragana | H | X, O, D + +#### Collator Naming Attribute Descriptions + +The **Locale** attribute is typically the most +important attribute for correct sorting and matching, according to the user +expectations in different countries and regions. The default UCA ordering will +only sort a few languages such as Dutch and Portuguese correctly ("correctly" +meaning according to the normal expectations for users of the languages). +Otherwise, you need to supply the locale to UCA in order to properly collate +text for a given language. Thus a locale needs to be supplied so as to choose a +collator that is correctly **tailored** for that locale. The choice of a locale +will automatically preset the values for all of the attributes to something that +is reasonable for that locale. Thus most of the time the other attributes do not +need to be explicitly set. In some cases, the choice of locale will make a +difference in string comparison performance and/or sort key length. + +In short attribute names, +`_