diff --git a/docs/processes/rules_update.md b/docs/processes/rules_update.md
index 7cf7674c0c4..df6bfbda778 100644
--- a/docs/processes/rules_update.md
+++ b/docs/processes/rules_update.md
@@ -110,7 +110,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
For this example, the rule file is `icu4c/source/data/brkitr/rules/char.txt`.
(If the change is for word or line break, which have multiple rule files for tailorings, only update the root file at this time.)
- Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](http://userguide.icu-project.org/boundaryanalysis/break-rules) for an explanation of rule syntax and behavior.
+ Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](../userguide/boundaryanalysis/break-rules.md) for an explanation of rule syntax and behavior.
The transformation from UAX or CLDR style rules to ICU rules can be non-trivial. Sources of difficulties include:
diff --git a/docs/userguide/boundaryanalysis/break-rules.md b/docs/userguide/boundaryanalysis/break-rules.md
new file mode 100644
index 00000000000..03dec09d470
--- /dev/null
+++ b/docs/userguide/boundaryanalysis/break-rules.md
@@ -0,0 +1,437 @@
+
+
+# Break Rules
+
+## Introduction
+
+ICU locates boundary positions within text by means of rules, which are a form
+of regular expressions. The form of the rules is similar, but not identical,
+to the boundary rules from the Unicode specifications
+[[UAX-14](https://unicode.org/reports/tr14/),
+[UAX-29](https://unicode.org/reports/tr29/)], and there is a reasonably close
+correspondence between the two.
+
+Taken as a set, the ICU rules describe how to move forward to the next boundary,
+starting from a known boundary.
+ICU includes rules for the standard boundary types (word, line, etc.).
+Applications may also create customized break iterators from their own rules.
+
+ICU's built-in rules are located at
+[icu/icu4c/source/data/brkitr/rules/](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules).
+These can serve as examples when writing your own, and as starting point for
+customizations.
+
+### Rule Tutorial
+
+Rules most commonly describe a range of text that should remain together,
+unbroken. For example, this rule
+
+ [\p{Letter}]+;
+
+matches a run of one or more letters, and would cause them to remain unbroken.
+
+The part within `[`brackets`]` follows normal ICU [UnicodeSet pattern
+syntax](../strings/unicodeset.md).
+
+The qualifier, '`+`' in this case, can be one of
+
+| Qualifier | Meaning |
+| --------- | ------------------------ |
+| empty | Match exactly once |
+| `?` | Match zero or one time |
+| `+` | Match one or more times |
+| `*` | Match zero or more times |
+
+#### Variables
+
+A variable names a set or rule sub-expression. They are useful for documenting
+what something represents, and for simplifying complex expressions by breaking
+them up.
+
+"Variable" is something if a misnomer; they cannot be reassigned, but are more
+of a constant expression.
+
+They start with a '`$`', both in the definition and use.
+
+ # Variable Definition
+ $ASCIILetNum = [A-Za-z0-9];
+ # Variable Use
+ $ASCIILetNum+;
+
+#### Comments and Semicolons
+
+'`#`' begins a comment, which extends to the end of a line.
+
+Comments may stand alone, or appear after another statement on a line.
+
+All rule statements or expressions are terminated by semicolons.
+
+#### Chained Matching
+
+Most ICU rule sets use the concept of "chained matching". The idea is that
+complete match can be composed from multiple pieces, with each piece coming from
+an individual rule of a rule set.
+
+This idea is unique to ICU break rules, it is not a concept found in other
+regular expression based matchers. Some of the Unicode standard break rules
+would be difficult to implement without it.
+
+Starting with an example,
+
+ !!chain;
+ word_char = [\p{Letter}];
+ word_joiner = [_-];
+ $word_char+;
+ $word_char $word_joiner $word_char;
+
+These rules will match "`abc`", "`hello_world`", `"hi-there"`,
+"`a-bunch_of-joiners-here`".
+
+They will not match "`-abc`", "`multiple__joiners`", "`tail-`"
+
+A full match is composed of pieces or submatches, possibly from different rules,
+with adjacent submatches linked by at least one overlapping character.
+
+In the example below, matching "`hello_world`",
+
+* '`1`' shows matches of the first rule, `word_char+`
+
+* '`2`' shows matches of the second rule, `$word_char $word_joiner $word_char`
+
+ hello_world
+ 11111 11111
+ 222
+
+There is an overlap of the matched regions, which causes the chaining mechanism
+to join them into a single overall match.
+
+The mechanism is a good match to, for example, [Unicode's word break
+rules](http://www.unicode.org/reports/tr29/#Word_Boundary_Rules), where rules
+WB5 through WB13 combine to piece together longer words from multiple short
+segments.
+
+`!!chain;` enables chaining in a rule set. It is disabled by default for back
+compatibility—very old versions of ICU did not support it, and it was
+originally introduced as an option.
+
+#### Parentheses and Alternation
+
+Rule expressions can contain parentheses and '`|`' operators, representing
+alternation or "or" operations. This follows conventional regular expression
+behavior.
+
+For example, the following would match a simplified identifier:
+
+ $Letter ($Letter | $Digit)*;
+
+#### String and Character Literals
+
+Similarly to common regular expressions, literal characters that do not have
+other special meaning represent themselves. So the rule
+
+ Hello;
+
+would match the literal input "`Hello`".
+
+In practice, nearly all break rules are composed from `[`sets`]` based on Unicode
+character properties; literal characters in rules are very rare.
+
+To prevent random typos in rules from being treated as literals, use this
+option:
+
+ !!quoted_literals_only;
+
+With the option, the naked `Hello` becomes a rule syntax error while a quoted
+`"hello"` still matches a literal hello.
+
+`!!quoted_literals_only` is strongly recommended for all rule sets. The random
+typo problem is very real, and surprisingly hard to recognize and debug.
+
+#### Explicit Break Rules
+
+A rule containing a slash (`/`) will force a boundary when it matches, even when
+other rules or chaining would otherwise lead to a longer match. Also called Hard
+Break Rules, these have the form
+
+ pre-context / post-context;
+
+where the pre and post-context look like normal break rules. Both the pre and
+post context are required, and must not allow a zero-length match. There should
+be no overlap between characters that end a match of the pre-context and those
+that begin a match of the post-context.
+
+Chaining into a hard break rule operates normally. There is no chaining out of a
+hard break rule; when the post-context matches a break is forced immediately.
+
+Note: future versions of ICU may loosen the restrictions on explicit break
+rules. The behavior of rules with missing or overlapping contexts is subject to
+change.
+
+#### Chaining Control
+
+Chaining into a rule can be dis-allowed by beginning that rule with a '`^`'. Rules
+so marked can begin a match after a preceding boundary or at the start of text,
+but cannot extend a match via chaining from another rule.
+
+~~The !!LBCMNoChain; statement modifies chaining behavior by preventing chaining
+from one rule to another from occurring on any character whose Line Break
+property is Combining Mark. This option is subject to change or removal, and
+should not be used in general. Within ICU, it is used only with the line break
+rules. We hope to replace it with something more general.~~
+
+> :point_right: **Note**: `!!LBCMNoChain` is deprecated, and will be removed completely from a future
+version of ICU.
+
+## Rule Status Values
+
+Break rules can be tagged with a number, which is called the *rule status*.
+After a boundary has been located, the status number of the specific rule that
+determined the boundary position is available to the application through the
+function `getRuleStatus()`.
+
+For the predefined word boundary rules, status values are available to
+distinguish between boundaries associated with words, numbers, and those around
+spaces or punctuation. Similarly for line break boundaries, status values
+distinguish between mandatory line endings (new line characters) and break
+opportunities that are appropriate points for line wrapping. Refer to the ICU
+API documentation for the C header file `ubrk.h` or to Java class
+`RuleBasedBreakIterator` for a complete list of the predefined boundary
+classifications.
+
+When creating custom sets of break rules, integer status values can be
+associated with boundary rules in whatever way will be convenient for the
+application. There is no need to remain restricted to the predefined values and
+classifications from the standard rules.
+
+It is possible for a set of break rules to contain more than a single rule that
+produces some boundary in an input text. In this event, `getRuleStatus()` will
+return the numerically largest status value from the matching rules, and the
+alternate function `getRuleStatusVec()` will return a vector of the values from
+all of the matching rules.
+
+In the source form of the break rules, status numbers appear at end of a rule,
+and are enclosed in `{`braces`}`.
+
+Hard break rules that also have a status value place the status at the end, for
+example
+
+ pre-context / post-context {1234};
+
+### Word Dictionaries
+
+For some languages that don't normally use spaces between words, break iterators
+are able to supplement the rules with dictionary based breaking. Some languages,
+Thai or Lao, for example, use a dictionary for both word and line breaking.
+Others, such as Japanese, use a dictionary for word breaking, but not for line
+breaking.
+
+To enable dictionary use,
+
+1. The break rules must select, as unbroken chunks, ranges of text to be passed
+ off to the word dictionary for further subdivision.
+2. The break rules must define a character class named `$dictionary` that
+ contains the characters (letters) to be handled by the dictionary.
+
+The dictionary implementation, on receiving a range of text, will map it to a
+specific dictionary based on script, and then delegate to that dictionary for
+subdividing the range into words.
+
+See, for example, this snippet from the [line break
+rules](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/brkitr/rules/line.txt):
+
+ # Dictionary character set, for triggering language-based break engines. Currently
+ # limited to LineBreak=Complex_Context (SA).
+ $dictionary = [$SA];
+
+## Rule Options
+
+| Option | Description |
+| --------------- | ----------- |
+| `!!chain` | Enable rule chaining. Default is no chaining. |
+| `!!forward` | The rules that follow are for forward iteration. Forward rules are now the only type of rules needed or used. |
+
+### Deprecated Rule Options
+
+| Deprecated Option | Description |
+| --------------- | ----------- |
+| ~~`!!reverse`~~ | ~~*[deprecated]* The rules that follow are for reverse iteration. No longer needed; any rules in a Reverse rule section are ignored.~~ |
+| ~~`!!safe_forward`~~ | ~~*[deprecated]* The rules that follow are for safe forward iteration. No longer needed; any rules in such a section are ignored.~~ |
+| ~~`!!safe_reverse`~~ | ~~*[deprecated]* The rules that follow are for safe reverse iteration. No longer needed; any rules in such a section are ignored.~~ |
+| ~~`!!LBCMNoChain`~~ | ~~*[deprecated]* Disable chaining when the overlap character matches `\p{Line_Break=Combining_Mark}`~~ |
+
+## Rule Syntax
+
+Here is the syntax for the boundary rules. (The EBNF Syntax is given below.)
+
+| Rule Name | Rule Values | Notes |
+| ---------- | ----------- | ----- |
+| rules | statement+ | |
+| statement | assignment \| rule \| control |
+| control | (`!!forward` \| `!!reverse` \| `!!safe_forward` \| `!!safe_reverse` \| `!!chain`) `;`
+| assignment | variable `=` expr `;` | 5 |
+| rule | `^`? expr (`{`number`}`)? `;` | 8,9 |
+| number | [0-9]+ | 1 |
+| break-point | `/` | 10 |
+| expr | expr-q \| expr `\|` expr \| expr expr | 3 |
+| expr-q | term \| term `*` \| term `?` \| term `+` |
+| term | rule-char \| unicode-set \| variable \| quoted-sequence \| `(` expr `)` \| break-point |
+| rule-special | *any printing ascii character except letters or numbers* \| white-space |
+| rule-char | *any non-escaped character that is not rule-special* \| `.` \| *any escaped character except* `\p` *or* `\P` |
+| variable | `$` name-start-char name-char* | 7 |
+| name-start-char | `_` \| \p{L} |
+| name-char | name-start-char \| \\p{N} |
+| quoted-sequence | `'` *(any char except single quote or line terminator or two adjacent single quotes)*+ `'` |
+| escaped-char | *See “Character Quoting and Escaping” in the [UnicodeSet](../strings/unicodeset.md) chapter* |
+| unicode-set | See [UnicodeSet](../strings/unicodeset.md) | 4 |
+| comment | unescaped `#` *(any char except new-line)** new-line | 2 |
+| s | unescaped \p{Z}, tab, LF, FF, CR, NEL | 6 |
+| new-line | LF, CR, NEL | 2 |
+
+### Rule Syntax Notes
+
+1. The number associated with a rule that actually determined a break position
+ is available to the application after the break has been returned. These
+ numbers are *not* Perl regular expression repeat counts.
+
+2. Comments are recognized and removed separately from otherwise parsing the
+ rules. They may appear wherever a space would be allowed (and ignored.)
+
+3. The implicit concatenation of adjacent terms has higher precedence than the
+ `|` operation. "`ab|cd`" is interpreted as "`(ab)|(cd)`", not as "`a(b|c)d`" or
+ "`(((ab)|c)d)`"
+
+4. The syntax for [unicode-set](../strings/unicodeset.md) is defined (and parsed) by the `UnicodeSet` class.
+ It is not repeated here.
+
+5. For `$`variables that will be referenced from inside of a `UnicodeSet`, the
+ definition must consist only of a Unicode Set. For example, when variable `$a`
+ is used in a rule like `[$a$b$c]`, then this definition of `$a` is ok:
+ “`$a=[:Lu:];`” while this one “`$a=abcd;`” would cause an error when `$a` was
+ used.
+
+6. Spaces are allowed nearly anywhere, and are not significant unless escaped.
+ Exceptions to this are noted.
+
+7. No spaces are allowed within a variable name. The variable name `$dictionary`
+ is special. If defined, it must be a Unicode Set, the characters of which
+ will trigger the use of word dictionary based boundaries.
+
+8. A leading `^` on a rule prevents chaining into that rule. It can only match
+ immediately after a preceding boundary, or at the start of text.
+
+9. `{`nnn`}` appearing at the end of a rule is a Rule Status number, not a repeat
+ count as it would be with conventional regular expression syntax.
+
+10. A `/` in a rule specifies a hard break point. If the rule matches, a
+ boundary will be forced at the position of the `/` within the match.
+
+### EBNF Syntax used for the RBBI rules syntax description
+
+| syntax | description |
+| -- | ------------------------- |
+| a? | zero or one instance of a |
+| a+ | one or more instances of a |
+| a* | zero or more instances of a |
+| a \| b | either a or b, but not both |
+| `a` "`a`" | the literal string between the quotes or displayed as `monospace` |
+
+## Planned Changes and Removed or Deprecated Rule Features
+
+1. Reverse rules could formerly be indicated by beginning them with an
+ exclamation `!`. This syntax is deprecated, and will be removed from a
+ future version of ICU.
+
+2. `!!LBCMNoChain` was a global option that specified that characters with the
+ line break property of "Combining Character" would not participate in rule
+ chaining. This option was always considered internal, is deprecated and will
+ be removed from a future version of ICU.
+
+3. Naked rule characters. Plain text, in the context of a rule, is treated as
+ literal text to be matched, much like normal regular expressions. This turns
+ out to be very error prone, has been the source of bugs in released versions
+ of ICU, and is not useful in implementing normal text boundary rules. A
+ future version will reject literal text that is not escaped.
+
+4. Exact reverse rules and safe forward rules: planned changes to the break
+ engine implementation will remove the need for exact reverse rules and safe
+ forward rules.
+
+5. `{bof}` and `{eof}`, appearing within `[`sets`]`, match the beginning or ending of
+ the input text, respectively. This is an internal (not documented) feature
+ that will probably be removed in a future version of ICU. They are currently
+ used by the standard rules for word, line and sentence breaking. An
+ alternative is probably needed. The existing implementation is incomplete.
+
+## Additional Sample Code
+
+**C/C++**: See
+[icu/source/samples/break/](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/break/)
+in the ICU source distribution for code samples showing the use of ICU boundary
+analysis.
+
+## Details about Dictionary-Based Break Iteration
+
+> :point_right: **Note**: This section originally from August 2012.
+> It is probably out of date, for example `brkfiles.mk` does not exist anyore.
+
+Certain Unicode characters have a "dictionary" bit set in the break iteration
+rules, and text made up of these characters cannot be handled by the rules-based
+break iteration code for lines or words. Rather, they must be handled by a
+dictionary-based approach. The ICU approach is as follows:
+
+Once the Dictionary bit is detected, the set of characters with that bit is
+handed off to "dictionary code." This code then inspects the characters more
+carefully, and splits them by script (Thai, Khmer, Chinese, Japanese, Korean).
+If text in this script has not yet been handled, it loads the appropriate
+dictionary from disk, and initializes a specialized "BreakEngine" class for that
+script.
+
+There are three such specialized classes: Thai, Khmer and CJK.
+
+Thai and Khmer use very similar approaches. They look through a dictionary that
+is not weighted by word frequency, and attempt to find the longest total "match"
+that can be made in the text.
+
+For Chinese and Japanese text, on the other hand, we have a unified dictionary
+(due to the fact that both use some of the same characters, it is difficult to
+distinguish them) that contains information about word frequencies. The
+algorithm to match text then uses dynamic programming to find the set of breaks
+it considers "most likely" based on the frequency of the words created by the
+breaks. This algorithm could also be used for Thai and Khmer, but we do not have
+sufficient data to do so. This algorithm could also be used for Korean, but once
+again we do not have the data to do so.
+
+Code of interest is in `source/common/dictbe.{h, cpp}`, `source/common/brkeng.{h,
+cpp}`, `source/common/dictionarydata.{h, cpp}`. The dictionaries use the `BytesTrie`
+and `UCharsTrie` as their data store. The binary form of these dictionaries is
+produced by the `gendict` tool, which has source in `source/tools/gendict`.
+
+In order to add new dictionary implementations, a few changes have to be made.
+First, you should create a new subclass of `DictionaryBreakEngine` or
+`LanguageBreakEngine` in `dictbe.cpp` that implements your algorithm. Then, in
+`brkeng.cpp`, you should add logic to create this dictionary break engine if we
+strike the appropriate script - which should only be 3 or so lines of code at
+the most. Lastly, you should add the correct data file. If your data is to be
+represented as a `.dict` file - as is recommended, and in fact required if you
+don't want to make substantial code changes to the engine loader - you need to
+simply add a file in the correct format for gendict to the `source/data/brkitr`
+directory, and add its name to the list of `BRK_DICT_SOURCE` in
+`source/data/brkitr/brkfiles.mk`. This will cause your dictionary (say, `foo.txt`)
+to be added as a `UCharsTrie` dictionary with the name foo.dict. If you want your
+dictionary to be a `BytesTrie` dictionary, you will need to specify a transform
+within the `Makefile`. To do so, find the part of `source/data/Makefile.in` and
+`source/data/makedata.mak` that deals with `thaidict.dict` and `khmerdict.dict` and
+add a similar set of lines for your script. Lastly, in
+`source/data/brkitr/root.txt`, add a line to the dictionaries `{}` section of the
+form:
+
+ shortscriptname:process(dependency){"dictionaryname.dict"}
+
+For example, for Katakana:
+
+ Kata:process(dependency){"cjdict.dict"}
+
+Make sure to add appropriate tests for the new implementation.
diff --git a/docs/userguide/boundaryanalysis/index.md b/docs/userguide/boundaryanalysis/index.md
new file mode 100644
index 00000000000..3003c5bddab
--- /dev/null
+++ b/docs/userguide/boundaryanalysis/index.md
@@ -0,0 +1,529 @@
+
+
+# Boundary Analysis
+
+## Overview of Text Boundary Analysis
+
+Text boundary analysis is the process of locating linguistic boundaries while
+formatting and handling text. Examples of this process include:
+
+1. Locating appropriate points to word-wrap text to fit within specific margins
+ while displaying or printing.
+
+2. Locating the beginning of a word that the user has selected.
+
+3. Counting characters, words, sentences, or paragraphs.
+
+4. Determining how far to move the text cursor when the user hits an arrow key
+ (Some characters require more than one position in the text store and some
+ characters in the text store do not display at all).
+
+5. Making a list of the unique words in a document.
+
+6. Figuring out if a given range of text contains only whole words.
+
+7. Capitalizing the first letter of each word.
+
+8. Locating a particular unit of the text (For example, finding the third word
+ in the document).
+
+The `BreakIterator` classes were designed to support these kinds of tasks. The
+BreakIterator objects maintain a location between two characters in the text.
+This location will always be a text boundary. Clients can move the location
+forward to the next boundary or backward to the previous boundary. Clients can
+also check if a particular location within a source text is on a boundary or
+find the boundary which is before or after a particular location.
+
+## Four Types of BreakIterator
+
+ICU `BreakIterator`s can be used to locate the following kinds of text boundaries:
+
+1. Character Boundary
+
+2. Word Boundary
+
+3. Line-break Boundary
+
+4. Sentence Boundary
+
+Each type of boundary is found in accordance with the rules specified by Unicode
+Standard Annex #29, *Unicode Text Segmentation*
+( ) or Unicode Standard Annex #14, *Unicode
+Line Breaking Algorithm* ()
+
+### Character Boundary
+
+The character-boundary iterator locates the boundaries according to the rules
+defined in .
+These boundaries try to match what a user would think of as a "character"—a
+basic unit of a writing system for a language—which may be more than just a
+single Unicode code point.
+
+The letter `Ä`, for example, can be represented in Unicode either with a single
+code-point value or with two code-point values (one representing the `A` and
+another representing the umlaut `¨`). The character-boundary iterator will treat
+either representation as a single character.
+
+End-user characters, as described above, are also called grapheme clusters, in
+an attempt to limit the confusion caused by multiple meanings for the word
+"character".
+
+### Word Boundary
+
+The word-boundary iterator locates the boundaries of words, for purposes such as
+double click selection or "Find whole words" operations.
+
+Words boundaries are identified according to the rules in
+, supplemented by a word
+dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for
+locating word breaks take into account the alphabets and conventions used by
+different languages.
+
+Here's an example of a sentence, showing the boundary locations that will be
+identified by a word break iterator:
+
+> :point_right: **Note**: TODO: An example needs to be added here.
+
+### Line-break Boundary
+
+The line-break iterator locates positions that would be appropriate points to
+wrap lines when displaying the text. The boundary rules are define here:
+
+
+This example shows the differences in the break locations produced by word and
+line break iterators:
+
+> :point_right: **Note**: TODO: An example needs to be added here.
+
+### Sentence Boundary
+
+A sentence-break iterator locates sentence boundaries according to the rules
+defined here:
+
+## Dictionary-Based BreakIterator
+
+Some languages are written without spaces, and word and line breaking requires
+more than rules over character sequences. ICU provides dictionary support for
+word boundaries in Chinese, Japanese, Thai, Lao, Khmer and Burmese.
+
+Use of the dictionaries is automatic when text in one of the dictionary
+languages is encountered. There is no separate API, and no extra programming
+steps required by applications making use of the dictionaries.
+
+## Usage
+
+To locate boundaries in a document, create a BreakIterator using the
+`BreakIterator::create***Instance` family of methods in C++, or the `ubrk_open()`
+function (C), where "`***`" is `Character`, `Word`, `Line` or `Sentence`,
+depending on the type of iterator wanted. These factory methods also take a
+parameter that specifies the locale for the language of the text to be processed.
+
+When creating a `BreakIterator`, a locale is also specified, and the behavior of
+the BreakIterator obtained may be specialized in some way for that locale. For
+most locales the default break iterator behavior is used.
+
+Applications also may register customized BreakIterators for use in specific
+locales. Once such a break iterator has been registered, any requests for break
+iterators for that locale will return copies of the registered break iterator.
+
+ICU may cache service instances. Therefore, registration should be done during
+startup, before opening services by locale ID.
+
+In the general-usage-model, applications will use the following basic steps to
+analyze a piece of text for boundaries:
+
+1. Create a `BreakIterator` with the desired behavior
+
+2. Use the `setText()` method to set the iterator to analyze a particular piece
+ of text.
+
+3. Locate the desired boundaries using the appropriate combination of `first()`,
+ `last()`, `next()`, `previous()`, `preceding()`, and `following()` methods.
+
+The `setText()` method can be called more than once, allowing reuse of a
+BreakIterator on new pieces of text. Because the creation of a `BreakIterator` can
+be relatively time-consuming, it makes good sense to reuse them when practical.
+
+The iterator always points to a boundary position between two characters. The
+numerical value of the position, as returned by `current()` is the zero-based
+index of the character following the boundary. Thus a position of zero
+represents a boundary preceding the first character of the text, and a position
+of one represents a boundary between the first and second characters.
+
+The `first()` and `last()` methods reset the iterator's current position to the
+beginning or end of the text (the beginning and the end are always considered
+boundaries). The `next()` and `previous()` methods advance the iterator one boundary
+forward or backward from the current position. If the `next()` or `previous()`
+methods run off the beginning or end of the text, it returns DONE. The `current()`
+method returns the current position.
+
+The `following()` and `preceding()` methods are used for random access, to move the
+iterator to an arbitrary position within the text. Since a BreakIterator always
+points to a boundary position, the `following()` and `preceding()` methods will
+never set the iterator to point to the position specified by the caller (even if
+it is, in fact, a boundary position). `BreakIterator` will, however, set the
+iterator to the nearest boundary position before or after the specified
+position.
+
+`isBoundary()` returns true if the specified position is a boundary.
+
+### Thread Safety
+
+`BreakIterator`s are not thread safe. This is inherit in their design—break
+iterators are stateful, holding a reference to and position in the text, meaning
+that a single instance cannot operate in parallel on multiple texts.
+
+For concurrent break iteration, each thread must use its own break iterator.
+These can be obtained by creating separate break iterators of the desired type,
+or by initially creating a master break iterator and then creating a clone for
+each thread.
+
+### Line Breaking Strictness, a CSS Property
+
+CSS has the concept of "[Line Breaking
+Strictness](https://www.w3.org/TR/css-text-3/#line-break-property)". This
+property specifies the strictness of line-breaking rules applied within an
+element: especially how wrapping interacts with punctuation and symbols. ICU
+line break iterators can choose a strictness using locale tags:
+
+| Locale | Behavior |
+| ------------ | ----------- |
+| `en@lb=strict`
`ja@lb=strict` | Breaks text using the most stringent set of line-breaking rules |
+| `en@lb=normal`
`ja@lb=normal` | Breaks text using the most common set of line-breaking rules. |
+| `en@lb=loose`
`ja@lb=loose` | Breaks text using the least restrictive set of line-breaking rules. Typically used for short lines, such as in newspapers. |
+
+### Sentence Break Filters
+
+Sentence breaking can return false positives - an indication that sentence ends
+in an incorrect position - in the presence of abbreviations. For example,
+consider the sentence
+
+> In the meantime Mr. Weston arrived with his small ship.
+
+Default sentence break shows a false boundary following the "Mr."
+
+ICU includes lists of common abbreviations that can be used to filter, to
+ignore, these false sentence boundaries. Filtering is enabled by the presence of
+the `ss` locale tag when creating the break iterator.
+
+| Locale | Behavior |
+| ---------------- | ------------------------------------------------------- |
+| `en` | no filtering |
+| `en@ss=standard` | Filter based on common English language abbreviations. |
+| `es@ss=standard` | Filter with common Spanish abbreviations. |
+
+Abbreviation lists are available (as of ICU 64) for English, German, Spanish,
+French, Italian and Portuguese.
+
+## Accuracy
+
+ICU's break iterators are based on the default boundary rules described in the
+Unicode Standard Annexes [14](https://www.unicode.org/reports/tr14/) and
+[29](https://www.unicode.org/unicode/reports/tr29/) . These are relatively
+simple boundary rules that can be implemented efficiently, and are sufficient
+for many purposes and languages. However, some languages and applications will
+require a more sophisticated linguistic analysis of the text in order to find
+boundaries with good accuracy. Such an analysis is not directly available from
+ICU at this time.
+
+Break Iterators based on custom, user-supplied boundary rules can be created and
+used by applications with requirements that are not met by the standard default
+boundary rules.
+
+## BreakIterator Boundary Analysis Examples
+
+### Print out all the word-boundary positions in a UnicodeString
+
+**In C++:**
+
+```c++
+void listWordBoundaries(const UnicodeString& s) {
+ UErrorCode status = U_ZERO_ERROR;
+ BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);
+ bi->setText(s);
+ int32_t p = bi->first();
+ while (p != BreakIterator::DONE) {
+ printf("Boundary at position %d\n", p);
+ p = bi->next();
+ }
+ delete bi;
+}
+```
+
+**In C:**
+
+```c
+void listWordBoundaries(const UChar* s, int32_t len) {
+ UBreakIterator* bi;
+ int32_t p;
+ UErrorCode err = U_ZERO_ERROR;
+ bi = ubrk_open(UBRK_WORD, 0, s, len, &err);
+ if (U_FAILURE(err)) return;
+ p = ubrk_first(bi);
+ while (p != UBRK_DONE) {
+ printf("Boundary at position %d\n", p);
+ p = ubrk_next(bi);
+ }
+ ubrk_close(bi);
+}
+```
+
+### Get the boundaries of the word that contains a double-click position
+
+**In C++:**
+
+```c++
+void wordContaining(BreakIterator& wordBrk,
+ int32_t idx,
+ const UnicodeString& s,
+ int32_t& start,
+ int32_t& end) {
+ // this function is written to assume that we have an
+ // appropriate BreakIterator stored in an object or a
+ // global variable somewhere-- When possible, programmers
+ // should avoid having the create() and delete calls in
+ // a function of this nature.
+ if (s.isEmpty())
+ return;
+ wordBrk.setText(s);
+ start = wordBrk.preceding(idx + 1);
+ end = wordBrk.next();
+ // NOTE: for this and similar operations, use preceding() and next()
+ // as shown here, not following() and previous(). preceding() is
+ // faster than following() and next() is faster than previous()
+ // NOTE: By using preceding(idx + 1) above, we're adopting the convention
+ // that if the double-click comes right on top of a word boundary, it
+ // selects the word that _begins_ on that boundary (preceding(idx) would
+ // instead select the word that _ends_ on that boundary).
+}
+```
+
+**In C:**
+
+```c
+void wordContaining(UBreakIterator* wordBrk,
+ int32_t idx,
+ const UChar* s,
+ int32_t sLen,
+ int32_t* start,
+ int32_t* end,
+ UErrorCode* err) {
+ if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) {
+ *err = U_ILLEGAL_ARGUMENT_ERROR;
+ return;
+ }
+ ubrk_setText(wordBrk, s, sLen, err);
+ if (U_SUCCESS(*err)) {
+ *start = ubrk_preceding(wordBrk, idx + 1);
+ *end = ubrk_next(wordBrk);
+ }
+}
+```
+
+### Check for Whole Words
+
+Use the following to check if a range of text is a "whole word":
+
+**In C++:**
+
+```c++
+UBool isWholeWord(BreakIterator& wordBrk,
+ const UnicodeString& s,
+ int32_t start,
+ int32_t end) {
+ if (s.isEmpty())
+ return FALSE;
+ wordBrk.setText(s);
+ if (!wordBrk.isBoundary(start))
+ return FALSE;
+ return wordBrk.isBoundary(end);
+}
+```
+
+**In C:**
+
+```c
+UBool isWholeWord(UBreakIterator* wordBrk,
+ const UChar* s,
+ int32_t sLen,
+ int32_t start,
+ int32_t end,
+ UErrorCode* err) {
+ UBool result = FALSE;
+ if (wordBrk == NULL || s == NULL) {
+ *err = U_ILLEGAL_ARGUMENT_ERROR;
+ return FALSE;
+ }
+ ubrk_setText(wordBrk, s, sLen, err);
+ if (U_SUCCESS(*err)) {
+ result = ubrk_isBoundary(wordBrk, start) && ubrk_isBoundary(wordBrk, end);
+ }
+ return result;
+}
+```
+
+Count the words in a document (C++ only):
+
+```c++
+int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) {
+ bi.setText(s);
+ int32_t count = 0;
+ while (start != BreakIterator::DONE) {
+ int breakType = bi.getRuleStatus();
+ if (breakType != UBRK_WORD_NONE) {
+ // Exclude spaces, punctuation, and the like.
+ // A status value UBRK_WORD_NONE indicates that the boundary does
+ // not start a word or number.
+ //
+ ++count;
+ }
+ start = bi.next();
+ }
+ return count;
+}
+```
+
+The function `getRuleStatus()` returns an enum giving additional information on
+the text preceding the last break position found. Using this value, it is
+possible to distinguish between numbers, words, words containing kana
+characters, words containing ideographic characters, and non-word characters,
+such as spaces or punctuation. The sample uses the break status value to filter
+out, and not count, boundaries associated with non-word characters.
+
+### Word-wrap a document (C++ only)
+
+The sample function below wraps a paragraph so that each line is less than or
+equal to 72 characters. The function fills in an array passed in by the caller
+with the starting offsets of
+each line in the document. Also, it fills in a second array to track how many
+trailing white space characters there are in the line. For simplicity, it is
+assumed that an outside process has already broken the document into paragraphs.
+For example, it is assumed that every string the function is passed has a single
+newline at the end only.
+
+```c++
+int32_t wrapParagraph(const UnicodeString& s,
+ const Locale& locale,
+ int32_t lineStarts[],
+ int32_t trailingwhitespace[],
+ int32_t maxLines,
+ UErrorCode &status) {
+
+ int32_t numLines = 0;
+ int32_t p, q;
+ const int32_t MAX_CHARS_PER_LINE = 72;
+ UChar c;
+
+ BreakIterator *bi = BreakIterator::createLineInstance(locale, status);
+ if (U_FAILURE(status)) {
+ delete bi;
+ return 0;
+ }
+ bi->setText(s);
+
+
+ p = 0;
+ while (p < s.length()) {
+ // jump ahead in the paragraph by the maximum number of
+ // characters that will fit
+ q = p + MAX_CHARS_PER_LINE;
+
+ // if this puts us on a white space character, a control character
+ // (which includes newlines), or a non-spacing mark, seek forward
+ // and stop on the next character that is not any of these things
+ // since none of these characters will be visible at the end of a
+ // line, we can ignore them for the purposes of figuring out how
+ // many characters will fit on the line)
+ if (q < s.length()) {
+ c = s[q];
+ while (q < s.length()
+ && (u_isspace(c)
+ || u_charType(c) == U_CONTROL_CHAR
+ || u_charType(c) == U_NON_SPACING_MARK
+ )) {
+ ++q;
+ c = s[q];
+ }
+ }
+
+ // then locate the last legal line-break decision at or before
+ // the current position ("at or before" is what causes the "+ 1")
+ q = bi->preceding(q + 1);
+
+ // if this causes us to wind back to where we started, then the
+ // line has no legal line-break positions. Break the line at
+ // the maximum number of characters
+ if (q == p) {
+ p += MAX_CHARS_PER_LINE;
+ lineStarts[numLines] = p;
+ trailingwhitespace[numLines] = 0;
+ ++numLines;
+ }
+ // otherwise, we got a good line-break position. Record the start of this
+ // line (p) and then seek back from the end of this line (q) until you find
+ // a non-white space character (same criteria as above) and
+ // record the number of white space characters at the end of the
+ // line in the other results array
+ else {
+ lineStarts[numLines] = p;
+ int32_t nextLineStart = q;
+
+ for (q--; q > p; q--) {
+ c = s[q];
+ if (!(u_isspace(c)
+ || u_charType(c) == U_CONTROL_CHAR
+ || u_charType(c) == U_NON_SPACING_MARK)) {
+ break;
+ }
+ }
+ trailingwhitespace[numLines] = nextLineStart - q -1;
+ p = nextLineStart;
+ ++numLines;
+ }
+ if (numLines >= maxLines) {
+ break;
+ }
+ }
+ delete bi;
+ return numLines;
+}
+```
+
+Most text editors would not break lines based on the number of characters on a
+line. Even with a monospaced font, there are still many Unicode characters that
+are not displayed and therefore should be filtered out of the calculation. With
+a proportional font, character widths are added up until a maximum line width is
+exceeded or an end of the paragraph marker is reached.
+
+Trailing white space does not need to be counted in the line-width measurement
+because it does not need to be displayed at the end of a line. The sample code
+above returns an array of trailing white space values because an external
+rendering process needs to be able to measure the length of the line (without
+the trailing white space) to justify the lines. For example, if the text is
+right-justified, the invisible white space would be drawn outside the margin.
+The line would actually end with the last visible character.
+
+In either case, the basic principle is to jump ahead in the text to the location
+where the line would break (without taking word breaks into account). Then, move
+backwards using the preceding() method to find the last legal breaking position
+before that location. Iterating straight through the text with next() method
+will generally be slower.
+
+## ICU BreakIterator Data Files
+
+The source code for the ICU break rules for the standard boundary types is
+located in the directory
+[icu4c/source/data/brkitr/rules](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules).
+These files will be built, and the corresponding binary state tables
+incorporated into ICU's data, by the standard ICU4C build process.
+
+The dictionary word lists used by word break, and for some languages, line break
+are in
+[icu4c/source/data/brkitr/dictionaries](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/dictionaries).
+
+The same data is used by both ICU4C and ICU4J. In the normal ICU build process,
+the source data is processed into a binary form using ICU4C, and the resulting
+binary tables are incorporated into ICU4J.
diff --git a/docs/userguide/collation/api.md b/docs/userguide/collation/api.md
new file mode 100644
index 00000000000..36d979d6884
--- /dev/null
+++ b/docs/userguide/collation/api.md
@@ -0,0 +1,696 @@
+
+
+# Collation API Details
+
+This section describes some of the usage conventions for the ICU Collation
+Service API.
+
+## Collator Instantiation
+
+To use the Collation Service, you must instantiate a `Collator`. The
+Collator defines the properties and behavior of the sort ordering. The Collator
+can be repeatedly referenced until all collation activities have been performed.
+The Collator can then be closed and removed.
+
+### Instantiating the Predefined Collators
+
+ICU comes with a large set of already predefined collators that are suited for
+specific locales. Most of the ICU locales have a predefined collator. In the worst
+case, the CLDR default set of rules,
+which is mostly equivalent to the UCA default ordering (DUCET), is used.
+The default sort order itself is designed to work well for many languages.
+(For example, there are no tailorings for the standard sort orders for
+English, German, French, etc.)
+
+To instantiate a predefined collator, use the APIs `ucol_open`, `createInstance` and
+`getInstance` for C, C++ and Java codes respectively. The C API takes a locale ID
+(or language tag) string argument, C++ takes a Locale object, and Java takes a
+Locale or ULocale.
+
+For some languages, multiple collation types are available; for example,
+"de-u-co-phonebk" / "de@collation=phonebook". They can be enumerated via
+`Collator::getKeywordValuesForLocale()`. See also the list of available collation
+tailorings in the online [ICU Collation
+Demo](http://demo.icu-project.org/icu-bin/collation.html).
+
+Starting with ICU 54, collation attributes can be specified via locale keywords
+as well, in the old locale extension syntax ("el@colCaseFirst=upper") or in
+language tag syntax ("el-u-kf-upper"). Keywords and values are case-insensitive.
+
+See the [LDML Collation spec, Collation
+Settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings),
+and the [data
+file](https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml) listing
+the valid collation keywords and their values. (The deprecated attributes
+kh/colHiraganaQuaternary and vt/variableTop are not supported.)
+
+For the [old locale extension
+syntax](http://www.unicode.org/reports/tr35/tr35.html#Old_Locale_Extension_Syntax),
+the data file's alias names are used (first alias, if defined, otherwise the
+name): "de@collation=phonebook;colCaseLevel=yes;kv=space"
+
+For the language tag syntax, the non-alias names are used, and "true" values can
+be omitted: "de-u-co-phonebk-kc-kv-space"
+
+This example demonstrates the instantiation of a collator.
+
+**C:**
+
+```C
+UErrorCode status = U_ZERO_ERROR;
+UCollator *coll = ucol_open("en_US", &status);
+if(U_SUCCESS(status)) {
+ /* close the collator*/
+ ucol_close(coll);
+}
+```
+
+**C++:**
+
+```C++
+UErrorCode status = U_ZERO_ERROR;
+Collator *coll = Collator::createInstance(Locale("en", "US"), status);
+if(U_SUCCESS(status)) {
+ //close the collator
+ delete coll;
+}
+```
+
+**Java:**
+
+```Java
+Collator col = null;
+try {
+ col = Collator.getInstance(Locale.US);
+} catch (Exception e) {
+ System.err.println("English collation creation failed.");
+ e.printStackTrace();
+}
+```
+
+### Instantiating Collators Using Custom Rules
+
+If the ICU predefined collators are not appropriate for your intended usage, you
+can
+define your own set of rules and instantiate a collator that uses them. For more
+details, please see [the section on collation
+customization](customization/index.md).
+
+This example demonstrates the instantiation of a collator.
+
+**C:**
+
+```C
+UErrorCode status = U_ZERO_ERROR;
+U_STRING_DECL(rules, "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E", 52);
+UCollator *coll;
+
+U_STRING_INIT(rules, "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E", 52);
+coll = ucol_openRules(rules, -1, UCOL_ON, UCOL_DEFAULT_STRENGTH, NULL, &status);
+if(U_SUCCESS(status)) {
+ /* close the collator*/
+ ucol_close(coll);
+}
+```
+
+**C++:**
+
+```C++
+UErrorCode status = U_ZERO_ERROR;
+UnicodeString rules(u"&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E");
+Collator *coll = new RuleBasedCollator(rules, status);
+if(U_SUCCESS(status)) {
+ //close the collator
+ delete coll;
+}
+```
+
+**Java:**
+
+```Java
+RuleBasedCollator coll = null;
+String ruleset = "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E";
+try {
+ coll = new RuleBasedCollator(ruleset);
+} catch (Exception e) {
+ System.err.println("Customized collation creation failed.");
+ e.printStackTrace();
+}
+```
+
+## Compare
+
+Two of the most used functions in ICU collation API, `ucol_strcoll` and `ucol_getSortKey`, have their counterparts in both Win32 and ANSI APIs:
+
+ICU C | ICU C++ | ICU Java | ANSI/POSIX | WIN32
+----------------- | --------------------------- | -------------------------- | ---------- | -----
+`ucol_strcoll` | `Collator::compare` | `Collator.compare` | `strcoll` | `CompareString`
+`ucol_getSortKey` | `Collator::getSortKey` | `Collator.getCollationKey` | `strxfrm` | `LCMapString`
+ | `Collator::getCollationKey` | | |
+
+For more sophisticated usage, such as user-controlled language-sensitive text
+searching, an iterating interface to collation is provided. Please refer to the
+section below on `CollationElementIterator` for more details.
+
+The `ucol_compare` function compares one pair of strings at a time. Comparing two
+strings is much faster than calculating sort keys for both of them. However, if
+comparisons should be done repeatedly on a very large number of strings, generating
+and storing sort keys can improve performance. In all other cases (such as quick
+sort or bubble sort of a
+moderately-sized list of strings), comparing strings works very well.
+
+The C API used for comparing two strings is `ucol_strcoll`. It requires two
+`UChar *` strings and their lengths as parameters, as well as a pointer to a valid
+`UCollator` instance. The result is a `UCollationResult` constant, which can be one
+of `UCOL_LESS`, `UCOL_EQUAL` or `UCOL_GREATER`.
+
+The C++ API offers the method `Collator::compare` with several overloads.
+Acceptable input arguments are `UChar *` with length of strings, or `UnicodeString`
+instances. The result is a member of the `UCollationResult` or `EComparisonResult` enums.
+
+The Java API provides the method `Collator.compare` with one overload. Acceptable
+input arguments are Strings or Objects. The result is an int value, which is
+less than zero if source is less than target, zero if source and target are
+equal, or greater than zero if source is greater than target.
+
+There are also several convenience functions and methods returning a boolean
+value, such as `ucol_greater`, `ucol_greaterOrEqual`, `ucol_equal` (in C)
+`Collator::greater`, `Collator::greaterOrEqual`, `Collator::equal` (in C++) and
+`Collator.equals` (in Java).
+
+### Examples
+
+**C:**
+
+```C
+UChar *s [] = { /* list of Unicode strings */ };
+uint32_t listSize = sizeof(s)/sizeof(s[0]);
+UErrorCode status = U_ZERO_ERROR;
+UCollator *coll = ucol_open("en_US", &status);
+uint32_t i, j;
+if(U_SUCCESS(status)) {
+ for(i=listSize-1; i>=1; i--) {
+ for(j=0; j=1; i--) {
+ for(j=0; jcompare(s[j], s[j+1]) == UCOL_LESS) {
+ swap(s[j], s[j+1]);
+ }
+ }
+}
+delete coll;
+}
+```
+
+**Java:**
+
+```Java
+String s [] = { /* list of Unicode strings */ };
+try {
+ Collator coll = Collator.getInstance(Locale.US);
+ for (int i = s.length - 1; i > = 1; i --) {
+ for (j=0; j bufferLen) {
+ if (currBuffer == buffer) {
+ currBuffer = (char*)malloc(expectedLen);
+ } else {
+ currBuffer = (char*)realloc(currBuffer, expectedLen);
+ }
+ }
+ bufferLen = ucol_getSortKey(coll, source[i], -1, currBuffer, expectedLen);
+ }
+ processSortKey(i, currBuffer, bufferLen);
+
+
+ if (currBuffer != buffer && currBuffer != NULL) {
+ free(currBuffer);
+ }
+}
+```
+
+> :point_right: **Note** Although the API allows you to call
+> `ucol_getSortKey` with `NULL` to see what the
+> sort key length is, it is strongly recommended that you NOT determine the length
+> first, then allocate and fill the sort key buffer. If you do, it requires twice
+> the processing since computing the length has to do the same calculation as
+> actually getting the sort key. Instead, the example shown above uses a stack buffer.
+
+### Using Iterators for String Comparison
+
+ICU4C's `ucol_strcollIter` API allows for comparing two strings that are supplied
+as character iterators (`UCharIterator`). This is useful when you need to compare
+differently encoded strings using `strcoll`. In that case, converting the strings
+first would probably be wasteful, since `strcoll` usually gives the result
+before whole strings are processed. This API is implemented only as a C function
+in ICU4C. There are no equivalent C++ or ICU4J functions.
+
+```C
+...
+/* we are arriving with two char*: utf8Source and utf8Target, with their
+* lengths in utf8SourceLen and utf8TargetLen
+*/
+ UCharIterator sIter, tIter;
+ uiter_setUTF8(&sIter, utf8Source, utf8SourceLen);
+ uiter_setUTF8(&tIter, utf8Target, utf8TargetLen);
+ compareResultUTF8 = ucol_strcollIter(myCollation, &sIter, &tIter, &status);
+...
+```
+
+### Obtaining Partial Sort Keys
+
+When using different sort algorithms, such as radix sort, sometimes it is useful
+to process strings only as much as needed to feed into the sorting algorithm.
+For that purpose, ICU provides the `ucol_nextSortKeyPart` API, which also takes
+character iterators. This API allows for iterating over subsequent pieces of an
+uncompressed sort key. Between calls to the API you need to save a 64-bit state.
+Following is an example of simulating a string compare function using the partial
+sort key API. Your usage model is bound to look much different.
+
+```C
+static UCollationResult compareUsingPartials(UCollator *coll,
+ const UChar source[], int32_t sLen,
+ const UChar target[], int32_t tLen,
+ int32_t pieceSize, UErrorCode *status) {
+ int32_t partialSKResult = 0;
+ UCharIterator sIter, tIter;
+ uint32_t sState[2], tState[2];
+ int32_t sSize = pieceSize, tSize = pieceSize;
+ int32_t i = 0;
+ uint8_t sBuf[16384], tBuf[16384];
+ if(pieceSize > 16384) {
+ *status = U_BUFFER_OVERFLOW_ERROR;
+ return UCOL_EQUAL;
+ }
+ *status = U_ZERO_ERROR;
+ sState[0] = 0; sState[1] = 0;
+ tState[0] = 0; tState[1] = 0;
+ while(sSize == pieceSize && tSize == pieceSize && partialSKResult == 0) {
+ uiter_setString(&sIter, source, sLen);
+ uiter_setString(&tIter, target, tLen);
+ sSize = ucol_nextSortKeyPart(coll, &sIter, sState, sBuf, pieceSize, status);
+ tSize = ucol_nextSortKeyPart(coll, &tIter, tState, tBuf, pieceSize, status);
+ partialSKResult = memcmp(sBuf, tBuf, pieceSize);
+ }
+
+ if(partialSKResult < 0) {
+ return UCOL_LESS;
+ } else if(partialSKResult > 0) {
+ return UCOL_GREATER;
+ } else {
+ return UCOL_EQUAL;
+ }
+}
+```
+
+### Other Examples
+
+A longer example is presented in the 'Examples' section. Here is an illustration
+of the usage model.
+
+**C:**
+
+```C
+#define MAX_KEY_SIZE 100
+#define MAX_BUFFER_SIZE 10000
+#define MAX_LIST_LENGTH 5
+const char text[] = {
+ "Quick",
+ "fox",
+ "Moving",
+ "trucks",
+ "riddle"
+};
+const UChar s [5][20];
+int i;
+int32_t length, expectedLen;
+uint8_t temp[MAX_BUFFER _SIZE];
+
+
+uint8_t *temp2 = NULL;
+uint8_t keys [MAX_LIST_LENGTH][MAX_KEY_SIZE];
+UErrorCode status = U_ZERO_ERROR;
+
+temp2 = temp;
+
+length = MAX_BUFFER_SIZE;
+for( i = 0; i < 5; i++)
+{
+ u_uastrcpy(s[i], text[i]);
+}
+UCollator *coll = ucol_open("en_US",&status);
+uint32_t length;
+if(U_SUCCESS(status)) {
+ for(i=0; i length) {
+ if (temp2 == temp) {
+ temp2 =(char*)malloc(expectedLen);
+ } else
+ temp2 =(char*)realloc(temp2, expectedLen);
+ }
+ length =ucol_getSortKey(coll, s[i], -1, temp2, expectedLen);
+ }
+ memcpy(key[i], temp2, length);
+ }
+}
+qsort(keys, MAX_LIST_LENGTH,MAX_KEY_SIZE*sizeof(uint8_t), strcmp);
+for (i = 0; i < MAX_LIST_LENGTH; i++) {
+ free(key[i]);
+}
+ucol_close(coll);
+```
+
+**C++:**
+
+```C++
+#define MAX_LIST_LENGTH 5
+const UnicodeString s [] = {
+ "Quick",
+ "fox",
+ "Moving",
+ "trucks",
+ "riddle"
+};
+CollationKey *keys[MAX_LIST_LENGTH];
+UErrorCode status = U_ZERO_ERROR;
+Collator *coll = Collator::createInstance(Locale("en_US"), status);
+uint32_t i;
+if(U_SUCCESS(status)) {
+ for(i=0; igetCollationKey(s[i], -1);
+ }
+ qsort(keys, MAX_LIST_LENGTH, sizeof(CollationKey),compareKeys);
+ delete[] keys;
+ delete coll;
+}
+```
+
+**Java:**
+
+```Java
+String s [] = {
+ "Quick",
+ "fox",
+ "Moving",
+ "trucks",
+ "riddle"
+};
+CollationKey keys[] = new CollationKey[s.length];
+try {
+ Collator coll = Collator.getInstance(Locale.US);
+ for (int i = 0; i < s.length; i ++) {
+ keys[i] = coll.getCollationKey(s[i]);
+ }
+
+ Arrays.sort(keys);
+}
+catch (Exception e) {
+ System.err.println("Error creating English collator");
+ e.printStackTrace();
+}
+```
+
+## Collation ElementIterator
+
+A collation element iterator can only be used in one direction. This is
+established at the time of the first call to retrieve a collation element. Once
+`ucol_next` (C), `CollationElementIterator::next` (C++) or
+`CollationElementIterator.next` (Java) are invoked,
+`ucol_previous` (C),
+`CollationElementIterator::previous` (C++) or `CollationElementIterator.previous`
+(Java) should not be used (and vice versa). The direction can be changed
+immediately after `ucol_first`, `ucol_last`, `ucol_reset` (in C),
+`CollationElementIterator::first`, `CollationElementIterator::last`,
+`CollationElementIterator::reset` (in C++) or `CollationElementIterator.first`,
+`CollationElementIterator.last`, `CollationElementIterator.reset` (in Java) is
+called, or when it reaches the end of string while traversing the string.
+
+When `ucol_next` is called at the end of the string buffer, `UCOL_NULLORDER` is
+always returned with any subsequent calls to `ucol_next`. The same applies to
+`ucol_previous`.
+
+An example of how iterators are used is the Boyer-Moore search implementation,
+which can be found in the samples section.
+
+### API Example
+
+**C:**
+
+```C
+UCollator *coll = ucol_open("en_US",status);
+UErrorCode status = U_ZERO_ERROR;
+UChar text[20];
+UCollationElements *collelemitr;
+uint32_t collelem;
+
+u_uastrcpy(text, "text");
+collelemitr = ucol_openElements(coll, text, -1, &status);
+collelem = 0;
+do {
+ collelem = ucol_next(collelemitr, &status);
+} while (collelem != UCOL_NULLORDER);
+
+ucol_closeElements(collelemitr);
+ucol_close(coll);
+```
+
+**C++:**
+
+```C++
+UErrorCode status = U_ZERO_ERROR;
+Collator *coll = Collator::createInstance(Locale::getUS(), status);
+UnicodeString text("text");
+CollationElementIterator *collelemitr = coll->createCollationElementIterator(text);
+uint32_t collelem = 0;
+do {
+ collelem = collelemitr->next(status);
+} while (collelem != CollationElementIterator::NULLORDER);
+
+delete collelemitr;
+delete coll;
+```
+
+**Java:**
+
+```Java
+try {
+ RuleBasedCollator coll = (RuleBasedCollator)Collator.getInstance(Locale.US);
+ String text = "text";
+ CollationElementIterator collelemitr = coll.getCollationElementIterator(text);
+ int collelem = 0;
+ do {
+ collelem = collelemitr.next();
+ } while (collelem != CollationElementIterator.NULLORDER);
+} catch (Exception e) {
+ System.err.println("Error in collation iteration");
+ e.printStackTrace();
+}
+```
+
+## Setting and Getting Attributes
+
+The general attribute setting APIs are `ucol_setAttribute` (in C) and
+`Collator::setAttribute` (in C++). These APIs take an attribute name and an
+attribute value. If the name and the value pass a syntax and range check, the
+property of the collator is changed. If the name and value do not pass a syntax
+and range check, however, the state is not changed and the error code variable
+is set to an error condition. The Java version does not provide general
+attribute setting APIs; instead, each attribute has its own setter API of
+the form `RuleBasedCollator.setATTRIBUTE_NAME(arguments)`.
+
+The attribute getting APIs are `ucol_getAttribute` (C) and `Collator::getAttribute`
+(C++). Both APIs require an attribute name as an argument and return an
+attribute value if a valid attribute name was supplied. If a valid attribute
+name was not supplied, however, they return an undefined result and set the
+error code. Similarly to the setter APIs for the Java version, no generic getter
+API is provided. Each attribute has its own setter API of the form
+`RuleBasedCollator.getATTRIBUTE_NAME()` in the Java version.
+
+## References:
+
+1. Ken Whistler, Markus Scherer: "Unicode Technical Standard #10, Unicode Collation
+ Algorithm" ()
+
+2. ICU Design doc: "Collation v2" ()
+
+3. Mark Davis: "ICU Collation Design Document"
+ ()
+
+3. The Unicode Standard, chapter 5, "Implementation guidelines"
+ ()
+
+4. Laura Werner: "Efficient text searching in Java: Finding the right string in
+ any language"
+ ()
+
+5. Mark Davis, Martin Dürst: "Unicode Standard Annex #15: Unicode Normalization
+ Forms" ().
diff --git a/docs/userguide/collation/architecture.md b/docs/userguide/collation/architecture.md
new file mode 100644
index 00000000000..16c78a45697
--- /dev/null
+++ b/docs/userguide/collation/architecture.md
@@ -0,0 +1,562 @@
+
+
+# Collation Service Architecture
+
+This section describes the design principles, architecture and coding
+conventions of the ICU Collation Service.
+
+## Collator
+
+To use the Collation Service, a Collator must first be instantiated. An
+Collator is a data structure or object that maintains all of the property
+and state information necessary to define and support the specific collation
+behavior provided. Examples of properties described in the Collator are the
+locale, whether normalization is to be performed, and how many levels of
+collation are to be evaluated. Examples of the state information described in
+the Collator include the direction of a Collation Element Iterator (forward
+or backward) and the status of the last API executed.
+
+The Collator is instantiated either by referencing a locale or by defining a
+custom set of rules (a tailoring).
+
+The Collation Service uses the paradigm:
+
+1. Open a Collator,
+
+2. Use while necessary,
+
+3. Close the Collator.
+
+Collator instances cannot be shared among threads. You should open them
+instead, and use a different collator for each separate thread. The safe clone
+function is supported for cloning collators in a thread-safe fashion.
+
+The Collation Service follows the ICU conventions for locale designation
+when opening collators:
+
+1. NULL means the default locale.
+
+2. The empty locale name ("") means the root locale.
+ The Collation Service adheres to the ICU conventions described in the
+ "[ICU Architectural Design](../design.md) " section of the users guide.
+ In particular:
+
+3. The standard error code convention is usually followed. (Functions that do
+ not take an error code parameter do so for backward compatibility.)
+
+4. The string length convention is followed: when passing a `UChar *`, the
+ length is required in a separate argument. If -1 is passed for the length,
+ it is assumed that the string is zero terminated.
+
+### Collation locale and keyword handling
+
+When a collator is created from a locale, the collation service (like all ICU
+services) must map the requested locale to the localized collation data
+available to ICU at the time. It does so using the standard ICU locale fallback
+mechanism. See the fallback section of the [locale
+chapter](../locale/index.md) for more details.
+
+If you pass a regular locale in, like "en_US", the collation service first
+searches with fallback for "collations/default" key. The first such key it finds
+will have an associated string value; this is the keyword name for the collation
+that is default for this locale. If the search falls all the way back to the
+root locale, the collation service will us the "collations/default" key there,
+which has the value "standard".
+
+If there is a locale with a keyword, like "de-u-co-phonebk" or "de@collation=phonebook", the
+collation service searches with fallback for "collations/phonebook". If the
+search is successful, the collation service uses the string value it finds to
+instantiate a Collator. If the search fails because no such key is present in
+any of ICU's locale data (e.g., "de@collation=funky"), the service returns a
+collator implementing the default tailoring of the locale.
+If the fallback is all the way to the root locale, then
+the return `UErrorCode` is `U_USING_DEFAULT_WARNING`.
+
+## Input values for collation
+
+Collation deals with processing strings. ICU generally requires that all the
+strings should be in UTF-16 format, and that all the required conversion should
+done before ICU functions are used. In the case of collation, there are APIs
+that can also take instances of character iterators (`UCharIterator`)
+or UTF-8 directly.
+
+Theoretically, character iterators can iterate strings
+in any encoding. ICU currently provides character iterator implementations for
+UTF-8 and UTF-16BE (useful when processing data from a big endian platform on an
+little endian machine). It should be noted, however, that using iterators for
+collation APIs has a performance impact. It should be used in situations when it
+is not desirable to convert whole strings before the operation - such as when
+using a string compare function.
+
+## Collation Elements
+
+As discussed in the introduction, there are many possible orderings for sorted
+text, depending on language and other factors. Ideally, there is a way to
+describe each ordering as a set of rules for calculating numeric values for each
+string of text. The collation process then becomes one of simply comparing these
+numeric values.
+
+This essentially describes the way the Collation Service works. To implement
+a particular sort ordering, first the relationship between each character or
+character sequence is derived. For example, a Spanish ordering defines the
+letter sequence "CH" to be between the letters "C" and "D". As also discussed in
+the introduction, to order strings properly requires that comparison of base
+letters must be considered separately from comparison of accents. Letter case
+must also be considered separately from either base letters or accents. Any
+ordering specification language must provide a way to define the relationships
+between characters or character sequences on multiple levels. ICU supports this
+by using "<" to describe a relationship at the primary level, using "<<" to
+describe a relationship at the secondary level, and using "<<<" to describe a
+relationship at the tertiary level. Here are some example usages:
+
+Symbol | Example | Description
+------ | -------- | -----------
+`<` | `c < ch` | Make a primary (base letter) difference between "c" and the character sequence "ch"
+`<<` | `a << ä` | Make a secondary (accent) difference between "a" and "ä"
+`<<<` | `a<<
+
+### Sort key size
+
+One of the more important issues when considering using sort keys is the sort
+key size. Unfortunately, it is very hard to give a fast exact answer to the
+following question: "What is the maximum size for sort keys generated for
+strings of size X". This problem is twofold:
+
+1. The maximum size of the sort key depends on the size of the collation
+ elements that are used to build it. Size of collation elements vary greatly
+ and depends both on the alphabet in question and on the locale used.
+
+2. Compression is used in building sort keys. Most 'regular' sequences of
+ characters produce very compact sort keys.
+
+If one is to assume the worst case and use too-big buffers, a lot of space will
+be wasted. However, if you use too-small buffers, you will lose performance if
+generated sort keys are longer than supplied buffers too often
+(and you have to reallocate for each of those).
+A good strategy
+for this problem would be to manually manage a large buffer for storing sortkeys
+and keep a list of indices to sort keys in this buffer (see the "large buffers"
+[Collation Example](examples.md#using-large-buffers-to-manage-sort-keys)
+for more details).
+
+Here are some rules of a thumb, please do not rely on them. If you are looking
+at the East Asian locales, you probably want to go with 5 bytes per code point.
+For Thai, 3 bytes per code point should be sufficient. For all the other locales
+(mostly Latin and Cyrillic), you should be fine with 2 bytes per code point.
+These values are based on average lengths of sort keys generated with tertiary
+strength. If you need quaternary and identical strength (you should not), add 3
+bytes per code point to each of these.
+
+### Partial sort keys
+
+In some cases, most notably when implementing [radix
+sorting](http://en.wikipedia.org/wiki/Radix_sort), it is useful to produce only
+parts of sort keys at a time. ICU4C 2.6+ provides an API that allows producing
+parts of sort keys (`ucol_nextSortKeyPart` API). These sort keys may or may not be
+compressed; that is, they may or may not be compatible with regular sort keys.
+
+### Merging sort keys
+
+Sometimes, it is useful to be able to merge sort keys. One example is having
+separate sort keys for first and last names. If you need to perform an operation
+that requires a sort key generated on the whole name, instead of concatenating
+strings and regenerating sort keys, you should merge the sort keys. The merging
+is done by merging the corresponding levels while inserting a terminator between
+merged parts. The reserved sort key byte value for the merge terminator is 0x02.
+For more details see [UCA section 1.6, Merging Sort
+Keys](http://www.unicode.org/reports/tr10/#Interleaved_Levels).
+
+* C API: unicode/ucol.h `ucol_mergeSortkeys()`
+* Java API: `com.ibm.icu.text.CollationKey merge(CollationKey source)`
+
+CLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is
+intended to allow concatenating strings like firstName+\\uFFFE+lastName to yield
+the same results as merging their individual sort keys.
+This has been fully implemented in ICU since version 53.
+
+### Generating bounds for a sort key (prefix matching)
+
+Having sort keys for strings allows for easy creation of bounds - sort keys that
+are guaranteed to be smaller or larger than any sort key from a give range. For
+example, if bounds are produced for a sortkey of string "smith", strings between
+upper and lower bounds with one level would include "Smith", "SMITH", "sMiTh".
+Two kinds of upper bounds can be generated - the first one will match only
+strings of equal length, while the second one will match all the strings with
+the same initial prefix.
+
+CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum
+primary weight, so that for example the string "smith\\uFFFF" can be used as the
+upper bound rather than modifying the sort key for "smith".
+
+## Collation Element Iterator
+
+The collation element iterator is used for traversing Unicode string collation
+elements one at a time. It can be used to implement language-sensitive text
+search algorithms like Boyer-Moore.
+
+For most applications, the two API categories, compare and sort key, are
+sufficient. Most people do not need to manipulate collation elements directly.
+
+Example:
+
+Consider iterating over "apple" and "äpple". Here are sequences of collation
+elements:
+
+String 1 | String 1 Collation Elements
+-------- | ---------------------------
+a | `[1900.05.05]`
+p | `[3700.05.05]`
+p | `[3700.05.05]`
+l | `[2F00.05.05]`
+e | `[2100.05.05]`
+
+String 2 | String 2 Collation Elements
+-------- | ---------------------------
+a | `[1900.05.05]`
+\\u0308 | `[0000.9D.05]`
+p | `[3700.05.05]`
+p | `[3700.05.05]`
+l | `[2F00.05.05]`
+e | `[2100.05.05]`
+
+The resulting CEs are typically masked according to the desired strength, and
+zero CEs are discarded. In the above example, masking with 0xFFFF0000 (for primary strength)
+produces the results of NULL secondary and tertiary differences. The collator then
+ignores the NULL differences and declares a match. For more details see the
+paper "Efficient text searching in Java™: Finding the right string in any
+language" by Laura Werner (
+).
+
+## Collation Attributes
+
+The Collation Service has a number of attributes whose values can be changed
+during run time. These attributes affect both the functionality and the
+performance of the Collation Service. This section describes these
+attributes and, where possible, their performance impact. Performance
+indications are only approximate and timings may vary significantly depending on
+the CPU, compiler, etc.
+
+Although string comparison by ICU and comparison of each string's sort key give
+the same results, attribute settings can impact the execution time of each
+method differently. To be precise in the discussion of performance, this section
+refers to the API employed in the measurement. The `ucol_strcoll` function is the
+API for string comparison. The `ucol_getSortKey` function is used to create sort
+keys.
+
+> :point_right: **Note** There is a special attribute value, `UCOL_DEFAULT`,
+> that can be used to set any attribute to its default value
+> (which is inherited from the UCA and the tailoring).
+
+### Attribute Types
+
+#### Strength level
+
+Collation strength, or the maximum collation level used for comparison, is set
+by using the `UCOL_STRENGTH` attribute. Valid values are:
+
+1. `UCOL_PRIMARY`
+
+2. `UCOL_SECONDARY`
+
+3. `UCOL_TERTIARY` (default)
+
+4. `UCOL_QUATERNARY`
+
+5. `UCOL_IDENTICAL`
+
+#### French collation
+
+The `UCOL_FRENCH_COLLATION` attribute determines whether to sort the secondary
+differences in reverse order. Valid values are:
+
+1. `UCOL_OFF` (default): compares secondary differences in the order they appear
+ in the string.
+
+2. `UCOL_ON`: causes secondary differences to be considered in reverse order, as
+ it is done in the French language.
+
+#### Normalization mode
+
+The `UCOL_NORMALIZATION_MODE` attribute, or its alias `UCOL_DECOMPOSITION_MODE`,
+controls whether text normalization is performed on the input strings. Valid
+values are:
+
+1. `UCOL_OFF` (default): turns off normalization check
+
+2. `UCOL_ON` : normalization is checked and the collator performs normalization
+ if it is needed.
+
+X | FCD | NFC | NFD
+--------------------- | --- | --- | ---
+A-ring | Y | Y |
+Angstrom | Y | |
+A + ring | Y | | Y
+A + grave | Y | Y |
+A-ring + grave | Y | |
+A + cedilla + ring | Y | | Y
+A + ring + cedilla | | |
+A-ring + cedilla | | Y |
+
+With normalization mode turned on, the `ucol_strcoll` function slows down by 10%.
+In addition, the time to generate a sort key also increases by about 25%.
+
+#### Alternate handling
+
+This attribute allows shifting of the variable characters (usually spaces and
+punctuation, in the UCA also most symbols) from the primary to the quaternary
+strength level. This is set by using the `UCOL_ALTERNATE_HANDLING` attribute. For
+details see [UCA: Variable
+Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting), [LDML:
+Collation
+Settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings),
+and [“Ignore Punctuation” Options](customization/ignorepunct.md).
+
+1. `UCOL_NON_IGNORABLE` (CLDR/ICU default): variable characters are treated as
+ all the other characters
+
+2. `UCOL_SHIFTED` (UCA default): all the variable characters will be ignored at
+ the primary, secondary and tertiary levels and their primary strengths will
+ be shifted to the quaternary level.
+
+#### Case Ordering
+
+Some conventions require uppercase letters to sort before lowercase ones, while
+others require the opposite. This attribute is controlled by the value of the
+`UCOL_CASE_FIRST`. The case difference in the UCA is contained in the tertiary
+weights along with other appearance characteristics (like circling of letters).
+The case-first attribute allows for emphasizing of the case property of the
+letters by reordering the tertiary weights with either upper-first, and/or
+lowercase-first. This difference gets the most significant bit in the weight.
+Valid values for this attribute are:
+
+1. `UCOL_OFF` (default): leave tertiary weights unaffected
+
+2. `UCOL_LOWER_FIRST`: causes lowercase letters and uncased characters to sort
+ before uppercase
+
+3. `UCOL_UPPER_FIRST` : causes uppercase letters to sort first
+
+The case-first attribute does not affect the performance substantially.
+
+#### Case level
+
+When this attribute is set, an additional level is formed between the secondary
+and tertiary levels, known as the Case Level. The case level is used to
+distinguish large and small Japanese Kana characters. Case level could also be
+used in other situations. for example to distinguish certain Pinyin characters.
+Case level is controlled by `UCOL_CASE_LEVEL` attribute. Valid values for this
+attribute are
+
+1. `UCOL_OFF` (default): no additional case level
+
+2. `UCOL_ON` : adds a case level
+
+#### Hiragana Quaternary
+
+*This setting is deprecated and ignored in recent versions of ICU.*
+
+Hiragana Quaternary can be set to `UCOL_ON`, in which case Hiragana code points
+will sort before everything else on the quaternary level. If set to `UCOL_OFF`
+Hiragana letters are treated the same as all the other code points. This setting
+can be changed on run-time using the `UCOL_HIRAGANA_QUATERNARY_MODE` attribute.
+You probably won't need to use it.
+
+#### Variable Top
+
+Variable Top is a boundary which decides whether the code points will be treated
+as variable (shifted to quaternary level in the **shifted** mode) or
+non-ignorable. Special APIs are used for setting of variable top. It can
+basically be set either to a codepoint or a primary strength value.
+
+## Performance
+
+ICU collation is designed to be fast, small and customizable. Several techniques
+are used to enhance the performance:
+
+1. Providing optimized processing for Latin characters.
+
+2. Comparing strings incrementally and stopping at the first significant
+ difference.
+
+3. Tuning to eliminate unnecessary file access or memory allocation.
+
+4. Providing efficient preflight functions that allows fast sort key size
+ generation.
+
+5. Using a single, shared copy of UCA in memory for the read-only default sort
+ order. Only small tailoring tables are kept in memory for locale-specific
+ customization.
+
+6. Compressing sort keys efficiently.
+
+7. Making the sort order be data-driven.
+
+In general, the best performance from the Collation Service is expected by
+doing the following:
+
+1. After opening a collator, keep and reuse it until done. Do not open new
+ collators for the same sort order. (Note the restriction on
+ multi-threading.)
+
+2. Use `ucol_strcoll` etc. when comparing strings. If it is necessary to
+ compare strings thousands or millions of times,
+ create the sort keys first and compare the sort keys instead.
+ Generating the sort keys of two strings is about 5-10
+ times slower than just comparing them directly.
+
+3. Follow the best practice guidelines for generating sort keys. Do not call
+ `ucol_getSortKey` twice to first size the key and then allocate the sort key
+ buffer and repeat the call to the function to fill in the buffer.
+
+### Performance and Storage Implications of Attributes
+
+Most people use the default attributes when comparing strings or when creating
+sort keys. When they do want to customize the ordering, the most common options
+are the following :
+
+`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED`\
+Used to ignore space and punctuation characters
+
+`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED` **and** `UCOL_STRENGTH == UCOL_QUATERNARY`\
+Used to ignore the space and punctuation characters except when there are no previous letter, accent, or case/variable differences.
+
+`UCOL_CASE_FIRST == UCOL_LOWER_FIRST` **or** `UCOL_CASE_FIRST == UCOL_UPPER_FIRST`\
+Used to change the ordering of upper vs. lower case letters (as
+well as small vs. large kana)
+
+`UCOL_CASE_LEVEL == UCOL_ON` **and** `UCOL_STRENGTH == UCOL_PRIMARY`\
+Used to ignore only the accent differences.
+
+`UCOL_NORMALIZATION_MODE == UCOL_ON`\
+Force to always check for normalization. This
+is used if the input text may not be in FCD form.
+
+`UCOL_FRENCH_COLLATION == UCOL_OFF`\
+This is only useful for languages like French and Catalan that may turn this attribute on.
+(It is the default only for Canadian French ("fr-CA").)
+
+In String Comparison, most of these options have little or no effect on
+performance. The only noticeable one is normalization, which can cost 10%-40% in
+performance.
+
+For Sort Keys, most of these options either leave the storage alone or reduce
+it. Shifting can reduce the storage by about 10%-20%; case level + primary-only
+can decrease it about 20% to 40%. Using no French accents can reduce the storage
+by about 38% , but only for languages like French and Catalan that turn it on by
+default. On the other hand, using Shifted + Quaternary can increase the storage by
+10%-15%. (The Identical Level also increases the length, but this option is not
+recommended).
+
+> :point_right: **Note** All of the above numbers are based on
+> tests run on a particular machine, with a particular set of data.
+> (The data for each language is a large number of names
+> in that language in the format , .)
+> The performance and storage may vary, depending on the particular computer,
+> operating system, and data.
+
+## Versioning
+
+Sort keys are often stored on disk for later reuse. A common example is the use
+of keys to build indexes in databases. When comparing keys, it is important to
+know that both keys were generated by the same algorithms and weightings.
+Otherwise, identical strings with keys generated on two different dates, for
+example, might compare as unequal. Sort keys can be affected by new versions of
+ICU or its data tables, new sort key formats, or changes to the Collator.
+Starting with release 1.8.1, ICU provides a versioning mechanism to identify the
+version information of the following (but not limited to),
+
+1. The run-time executable
+
+2. The collation element content
+
+3. The Unicode/UCA database
+
+4. The tailoring table
+
+The version information of Collator is a 32-bit integer. If a new version of ICU
+has changes affecting the content of collation elements, the version information
+will be changed. In that case, to use the new version of ICU collator will
+require regenerating any saved or stored sort keys.
+
+However, it is possible to modify ICU code or data without changing relevant version numbers,
+so it is safer to regenerate sort keys any time after any part of ICU has been updated.
+
+Since ICU4C 1.8.1.
+it is possible to build your program so that it uses more than one version of
+ICU (only in C/C++, not in Java). Therefore, you could use the current version
+for the features you need and use the older version for collation.
+
+## Programming Examples
+
+See the [Collation Examples](examples.md) chapter for an example of how to
+compare and create sort keys with the default locale in C, C++ and Java.
diff --git a/docs/userguide/collation/concepts.md b/docs/userguide/collation/concepts.md
new file mode 100644
index 00000000000..c8468b54db8
--- /dev/null
+++ b/docs/userguide/collation/concepts.md
@@ -0,0 +1,814 @@
+
+
+# Collation Concepts
+
+The previous section demonstrated many of the requirements imposed on string
+comparison routines that try to correctly collate strings according to
+conventions of more than a hundred different languages, written in many
+different scripts. This section describes the principles and architecture behind
+the ICU Collation Service.
+
+## Sortkeys vs Comparison
+
+Sort keys are most useful in databases, where the overhead of calling a function
+for each comparison is very large.
+
+Generating a sort key from a Collator is many times more expensive than doing a
+compare with the Collator (for common use cases). That's if the two functions
+are called from Java or C. So for those languages, unless there is a very large
+number of comparisons, it is better to call the compare function.
+
+Here is an example, with a little back-of-the-envelope calculation. Let's
+suppose that with a given language on a given platform, the compare performance
+(CP) is 100 faster than sortKey performance (SP), and that you are doing a
+binary search of a list with 1,000 elements. The binary comparison performance
+is BP. We'd do about 10 comparisons, getting:
+
+compare: 10 \* CP
+
+sortkey: 1 \* SP + 10 \* BP
+
+Even if BP is free, compare would be better. One has to get up to where log2(n)
+= 100 before they break even.
+
+But even this calculation is only a rough guide. First, the binary comparison is
+not completely free. Secondly, the performance of compare function varies
+radically with the source data. We optimized for maximizing performance of
+collation in sorting and binary search, so comparing strings that are "close" is
+optimized to be much faster than comparing strings that are "far away". That
+optimization is important because normal sort/lookup operations compare close
+strings far more often -- think of binary search, where the last few comparisons
+are always with the closest strings. So even the above calculation is not very
+accurate.
+
+## Comparison Levels
+
+In general, when comparing and sorting objects, some properties can take
+precedence over others. For example, in geometry, you might consider first the
+number of sides a shape has, followed by the number of sides of equal length.
+This causes triangles to be sorted together, then rectangles, then pentagons,
+etc. Within each category, the shapes would be ordered according to whether they
+had 0, 2, 3 or more sides of the same length. However, this is not the only way
+the shapes can be sorted. For example, it might be preferable to sort shapes by
+color first, so that all red shapes are grouped together, then blue, etc.
+Another approach would be to sort the shapes by the amount of area they enclose.
+
+Similarly, character strings have properties, some of which can take precedence
+over others. There is more than one way to prioritize the properties.
+
+For example, a common approach is to distinguish characters first by their
+unadorned base letter (for example, without accents, vowels or tone marks), then
+by accents, and then by the case of the letter (upper vs. lower). Ideographic
+characters might be sorted by their component radicals and then by the number of
+strokes it takes to draw the character.
+An alternative ordering would be to sort these characters by strokes first and
+then by their radicals.
+
+The ICU Collation Service supports many levels of comparison (named "Levels",
+but also known as "Strengths"). Having these categories enables ICU to sort
+strings precisely according to local conventions. However, by allowing the
+levels to be selectively employed, searching for a string in text can be
+performed with various matching conditions.
+
+Performance optimizations have been made for ICU collation with the default
+level settings. Performance specific impacts are discussed in the Performance
+section below.
+
+Following is a list of the names for each level and an example usage:
+
+1. Primary Level: Typically, this is used to denote differences between base
+ characters (for example, "a" < "b"). It is the strongest difference. For
+ example, dictionaries are divided into different sections by base character.
+ This is also called the level-1 strength.
+
+2. Secondary Level: Accents in the characters are considered secondary
+ differences (for example, "as" < "às" < "at"). Other differences between
+ letters can also be considered secondary differences, depending on the
+ language. A secondary difference is ignored when there is a primary
+ difference anywhere in the strings. This is also called the level-2
+ strength.
+ Note: In some languages (such as Danish), certain accented letters are
+ considered to be separate base characters. In most languages, however, an
+ accented letter only has a secondary difference from the unaccented version
+ of that letter.
+
+3. Tertiary Level: Upper and lower case differences in characters are
+ distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In
+ addition, a variant of a letter differs from the base form on the tertiary
+ level (such as "A" and "Ⓐ"). Another example is the difference between large
+ and small Kana. A tertiary difference is ignored when there is a primary or
+ secondary difference anywhere in the strings. This is also called the
+ level-3 strength.
+
+4. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations
+ (§)) at level 1-3, an additional level can be used to distinguish words with
+ and without punctuation (for example, "ab" < "a-b" < "aB"). This difference
+ is ignored when there is a primary, secondary or tertiary difference. This
+ is also known as the level-4 strength. The quaternary level should only be
+ used if ignoring punctuation is required or when processing Japanese text
+ (see Hiragana processing (§)).
+
+5. Identical Level: When all other levels are equal, the identical level is
+ used as a tiebreaker. The Unicode code point values of the NFD form of each
+ string are compared at this level, just in case there is no difference at
+ levels 1-4 . For example, Hebrew cantillation marks are only distinguished
+ at this level. This level should be used sparingly, as only code point
+ value differences between two strings is an extremely rare occurrence.
+ Using this level substantially decreases the performance for
+ both incremental comparison and sort key generation (as well as increasing
+ the sort key length). It is also known as level 5 strength.
+
+## Backward Secondary Sorting
+
+Some languages require words to be ordered on the secondary level according to
+the *last* accent difference, as opposed to the *first* accent difference. This
+was previously the default for all French locales, based on some French
+dictionary ordering traditions, but is currently only applicable to Canadian
+French (locale **fr_CA**), for conformance with the [Canadian sorting
+standard](http://www.unicode.org/reports/tr10/#CanStd). The difference in
+ordering is only noticeable for a small number of pairs of real words. For more
+information see [UCA: Contextual
+Sensitivity](http://www.unicode.org/reports/tr10/#Contextual_Sensitivity).
+
+Example:
+
+Forward secondary | Backward secondary
+----------------- | ------------------
+cote | cote
+coté | côte
+côte | coté
+côté | côté
+
+## Contractions
+
+A contraction is a sequence consisting of two or more letters. It is considered
+a single letter in sorting.
+
+For example, in the traditional Spanish sorting order, "ch" is considered a
+single letter. All words that begin with "ch" sort after all other words
+beginning with "c", but before words starting with "d".
+
+Other examples of contractions are "ch" in Czech, which sorts after "h", and
+"lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n"
+respectively.
+
+Example:
+
+Order without contraction | Order with contraction "lj" sorting after letter "l"
+------------------------- | ----------------------------------------------------
+la | la
+li | li
+lj | lk
+lja | lz
+ljz | lj
+lk | lja
+lz | ljz
+ma | ma
+
+Contracting sequences such as the above are not very common in most languages.
+
+> :point_right: **Note** Since ICU 2.2, and as required by the UCA,
+> if a completely ignorable code point
+> appears in text in the middle of contraction, it will not break the contraction.
+> For example, in Czech sorting, cU+0000h will sort as it were ch.
+
+## Expansions
+
+If a letter sorts as if it were a sequence of more than one letter, it is called
+an expansion.
+
+For example, in German phonebook sorting (de@collation=phonebook or BCP 47
+de-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae."
+All words starting with "ä" will sort between words starting with "ad" and words
+starting with "af".
+
+In the case of Unicode encoding, characters can often be represented either as
+pre-composed characters or in decomposed form. For example, the letter "à" can
+be represented in its decomposed (a+\`) and pre-composed (à) form. Most
+applications do not want to distinguish text by the way it is encoded. A search
+for "à" should find all instances of the letter, regardless of whether the
+instance is in pre-composed or decomposed form. Therefore, either form of the
+letter must result in the same sort ordering. The architecture of the ICU
+Collation Service supports this.
+
+## Contractions Producing Expansions
+
+It is possible to have contractions that produce expansions.
+
+One example occurs in Japanese, where the vowel with a prolonged sound mark is
+treated to be equivalent to the long vowel version:
+
+カアー<<< カイー and\
+キイー<<< キイー
+
+> :point_right: **Note** Since ICU 2.0 Japanese tailoring uses
+> [prefix analysis](http://www.unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings)
+> instead of contraction producing expansions.
+
+## Normalization
+
+In the section on expansions, we discussed that text in Unicode can often be
+represented in either pre-composed or decomposed forms. There are other types of
+equivalences possible with Unicode, including Canonical and Compatibility. The
+process of
+Normalization ensures that text is written in a predictable way so that searches
+are not made unnecessarily complicated by having to match on equivalences. Not
+all text is normalized, however, so it is useful to have a collation service
+that can address text that is not normalized, but do so with efficiency.
+
+The ICU Collation Service handles un-normalized text properly, producing the
+same results as if the text were normalized.
+
+In practice, most data that is encountered is in normalized or semi-normalized
+form already. The ICU Collation Service is designed so that it can process a
+wide range of normalized or un-normalized text without a need for normalization
+processing. When a case is encountered that requires normalization, the ICU
+Collation Service drops into code specific to this purpose. This maximizes
+performance for the majority of text that does not require normalization.
+
+In addition, if the text is known with certainty not to contain un-normalized
+text, then even the overhead of checking for normalization can be eliminated.
+The ICU Collation Service has the ability to turn Normalization Checking either
+on or off. If Normalization Checking is turned off, it is the user's
+responsibility to insure that all text is already in the appropriate form. This
+is true in a great majority of the world languages, so normalization checking is
+turned off by default for most locales.
+
+If the text requires normalization processing, Normalization Checking should be
+on. Any language that uses multiple combining characters such as Arabic, ancient
+Greek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking
+to be on, or the text to go through a normalization process before collation.
+
+For more information about Normalization related reordering please see
+[Unicode Technical Note #5](http://www.unicode.org/notes/tn5/) and
+[UAX #15.](http://www.unicode.org/reports/tr15/)
+
+> :point_right: **Note** ICU supports two modes of normalization: on and off.
+> Java.text.\* classes offer compatibility decomposition mode, which is not supported in ICU.
+
+## Ignoring Punctuation
+
+In some cases, punctuation can be ignored while searching or sorting data. For
+example, this enables a search for "biweekly" to also return instances of
+"bi-weekly". In other cases, it is desirable for punctuated text to be
+distinguished from text without punctuation, but to have the text sort close
+together.
+
+These two behaviors can be accomplished if there is a way for a character to be
+ignored on all levels except for the quaternary level. If this is the case, then
+two strings which compare as identical on the first three levels (base letter,
+accents, and case) are then distinguished at the fourth level based on their
+punctuation (if any). If the comparison function ignores differences at the
+fourth level, then strings that differ by punctuation only are compared as
+equal.
+
+The following table shows the results of sorting a list of terms in 3 different
+ways. In the first column, punctuation characters (space " ", and hyphen "-")
+are not ignored (" " < "-" < "b"). In the second column, punctuation characters
+are ignored in the first 3 levels and compared only in the fourth level. In the
+third column, punctuation characters are ignored in the first 3 levels and the
+fourth level is not considered. In the last column, punctuated terms are
+equivalent to the identical terms without punctuation.
+
+For more options and details see the [“Ignore Punctuation”
+Options](customization/ignorepunct.md) page.
+
+Non-ignorable | Ignorable and Quaternary strength | Ignorable and Tertiary strength
+------------- | --------------------------------- | -------------------------------
+black bird | black bird | **black bird**
+black Bird | black-bird | **black-bird**
+black birds | blackbird | **blackbird**
+black-bird | black Bird | black Bird
+black-Bird | black-Bird | black-Bird
+black-birds | blackBird | blackBird
+blackbird | black birds | black birds
+blackBird | black-birds | black-birds
+blackbirds | blackbirds | blackbirds
+
+> :point_right: **Note** The strings with the same font format in the last column are
+compared as equal by ICU Collator.\
+> Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that
+> follow shifted code points will be completely ignored. This means that an accent
+> following a space will compare as if it was a space alone.
+
+## Case Ordering
+
+The tertiary level is used to distinguish text by case, by small versus large
+Kana, and other letter variants as noted above.
+
+Some applications prefer to emphasize case differences so that words starting
+with the same case sort together. Some Japanese applications require the
+difference between small and large Kana be emphasized over other tertiary
+differences.
+
+The UCA does not provide means to separate out either case or Kana differences
+from the remaining tertiary differences. However, the ICU Collation Service has
+two options that help in customize case and/or Kana differences. Both options
+are turned off by default.
+
+### CaseFirst
+
+The Case-first option makes case the most significant part of the tertiary
+level. Primary and secondary levels are unaffected. With this option, words
+starting with the same case sort together. The Case-first option can be set to
+make either lowercase sort before
+uppercase or uppercase sort before lowercase.
+
+Note: The case-first option does not constitute a separate level; it is simply a
+reordering of the tertiary level.
+
+ICU makes use of the following three case categories for sorting
+
+1. uppercase: "ABC"
+
+2. mixed case: "Abc", "aBc"
+
+3. normal (lowercase or no case): "abc", "123"
+
+Mixed case is always sorted between uppercase and normal case when the
+"case-first" option is set.
+
+### CaseLevel
+
+The Case Level option makes a separate level for case differences. This is an
+extra level positioned between secondary and tertiary. The case level is used in
+Japanese to make the difference between small and large Kana more important than
+the other tertiary differences. It also can be used to ignore other tertiary
+differences, or even secondary differences. This is especially useful in
+matching. For example, if the strength is set to primary only (level-1) and the
+case level is turned on, the comparison ignores accents and tertiary differences
+except for case. The contents of the case level are affected by the case-first
+option.
+
+The case level is independent from the strength of comparison. It is possible to
+have a collator set to primary strength with the case level turned on. This
+provides for comparison that takes into account the case differences, while at
+the same time ignoring accents and tertiary differences other than case. This
+may be used in searching.
+
+Example:
+
+**Case-first off, Case level off**
+
+apple\
+ⓐⓟⓟⓛⓔ\
+Abernathy\
+ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
+ähnlich\
+Ähnlichkeit
+
+**Lowercase-first, Case level off**
+
+apple\
+ⓐⓟⓟⓛⓔ\
+ähnlich\
+Abernathy\
+ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
+Ähnlichkeit
+
+**Uppercase-first, Case level off**
+
+Abernathy\
+ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
+Ähnlichkeit\
+apple\
+ⓐⓟⓟⓛⓔ\
+ähnlich
+
+**Lowercase-first, Case level on**
+
+apple\
+Abernathy\
+ⓐⓟⓟⓛⓔ\
+ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
+ähnlich\
+Ähnlichkeit
+
+**Uppercase-first, Case level on**
+
+Abernathy\
+apple\
+ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
+ⓐⓟⓟⓛⓔ\
+Ähnlichkeit\
+ähnlich
+
+## Script Reordering
+
+Script reordering allows scripts and some other groups of characters to be moved
+relative to each other. This reordering is done on top of the DUCET/CLDR
+standard collation order. Reordering can specify groups to be placed at the
+start and/or the end of the collation order.
+
+By default, reordering codes specified for the start of the order are placed in
+the order given after several special non-script blocks. These special groups of
+characters are space, punctuation, symbol, currency, and digit. Script groups
+can be intermingled with these special non-script groups if those special groups
+are explicitly specified in the reordering.
+
+The special code `others` stands for any script that is not explicitly mentioned
+in the list. Anything that is after others will go at the very end of the list
+in the order given. For example, `[Grek, others, Latn]` will result in an
+ordering that puts all scripts other than Greek and Latin between them.
+
+### Examples:
+
+Note: All examples below use the string equivalents for the scripts and reorder
+codes that would be used in collator rules. The script and reorder code
+constants that would be used in API calls will be different.
+
+**Example 1:**\
+set reorder code - `[Grek]`\
+result - `[space, punctuation, symbol, currency, digit, Grek, others]`
+
+**Example 2:**\
+set reorder code - `[Grek]`\
+result - `[space, punctuation, symbol, currency, digit, Grek, others]`
+
+followed by: set reorder code - `[Hani]`\
+result -` [space, punctuation, symbol, currency, digit, Hani, others]`
+
+That is, setting a reordering always modifies
+the DUCET/CLDR order, replacing whatever was previously set, rather than adding
+on to it. In order to cumulatively modify an ordering, you have to retrieve the
+existing ordering, modify it, and then set it.
+
+**Example 3:**\
+set reorder code - `[others, digit]`\
+result - `[space, punctuation, symbol, currency, others, digit]`
+
+**Example 4:**\
+set reorder code - `[space, Grek, punctuation]`\
+result - `[symbol, currency, digit, space, Grek, punctuation, others]`
+
+**Example 5:**\
+set reorder code - `[Grek, others, Hani]`\
+result - `[space, punctuation, symbol, currency, digit, Grek, others, Hani]`
+
+**Example 6:**\
+set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
+result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
+
+followed by:\
+set reorder code - `[NONE]`\
+result - DUCET/CLDR
+
+**Example 7:**\
+set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
+result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
+
+followed by:\
+set reorder code - `[DEFAULT]`\
+result - original reordering for the locale which may or may not be DUCET/CLDR
+
+**Example 8:**\
+set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
+result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
+
+followed by:\
+set reorder code - `[]`\
+result - original reordering for the locale which may or may not be DUCET/CLDR
+
+**Example 9:**\
+set reorder code - `[Hebr, Phnx]`\
+result - error
+
+Beginning with ICU 55, scripts only reorder together if they are primary-equal,
+for example Hiragana and Katakana.
+
+ICU 4.8-54:
+
+* Scripts were reordered in groups, each normally starting with a [Recommended
+ Script](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts).
+* Reorder codes moved as a group (were “equivalent”) if their scripts shared a
+ primary-weight lead byte.
+* For example, Hebr and Phnx were “equivalent” reordering codes and were
+ reordered together. Their order relative to each other could not be changed.
+* Only any one code out of any group could be reordered, not multiple of the
+ same group.
+
+## Sorting of Japanese Text (JIS X 4061)
+
+Japanese standard JIS X 4061 requires two changes to the collation procedures:
+special processing of Hiragana characters and (for performance reasons) prefix
+analysis of text.
+
+### Hiragana Processing
+
+JIS X 4061 standard requires more levels than provided by the UCA. To offer
+conformant sorting order, ICU uses the quaternary level to distinguish between
+Hiragana and Katakana. Hiragana symbols are given smaller values than Katakana
+symbols on quaternary level, thus causing Hiragana sequences to sort before
+corresponding Katakana sequences.
+
+### Prefix Analysis
+
+Another characteristics of sorting according to the JIS X 4061 is a large number
+of contractions followed by expansions (see
+[Contractions Producing Expansions](#contractions-producing-expansions)).
+This causes all the Hiragana and Katakana codepoints to be treated as
+contractions, which reduces performance. The solution we adopted introduces the
+prefix concept which allows us to improve the performance of Japanese sorting.
+More about this can be found in the [customization
+chapter](customization/index.md) .
+
+## Thai/Lao reordering
+
+UCA requires that certain Thai and Lao prevowels be reordered with a code point
+following them. This option is always on in the ICU implementation, as
+prescribed by the UCA.
+
+This rule takes effect when:
+
+1. A Thai vowel of the range \\U0E40-\\U0E44 precedes a Thai consonant of the
+ range \\U0E01-\\U0E2E
+ or
+
+2. A Lao vowel of the range \\U0EC0-\\U0EC4 precedes a Lao consonant of the
+ range \\U0E81-\\U0EAE. In these cases the vowel is placed after the
+ consonant for collation purposes.
+
+> :point_right: **Note** There is a difference between java.text.\* classes and ICU in regard to Thai
+> reordering. Java.text.\* classes allow tailorings to turn off reordering by
+> using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai
+> prevowels.
+
+## Space Padding
+
+In many database products, fields are padded with null. To get correct results,
+the input to a Collator should omit any superfluous trailing padding spaces. The
+problem arises with contractions, expansions, or normalization. Suppose that
+there are two fields, one containing "aed" and the other with "äd". German
+phonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will
+compare "ä" as if it were "ae" (on a primary level), so the order will be "äd" <
+"aed". But if both fields are padded with spaces to a length of 3, then this
+will reverse the order, since the first will compare as if it were one character
+longer. In other words, when you start with strings 1 and 2
+
+1 | a | e | d | \
+-- | -- | -- | --------- | ---------
+2 | ä | d | \ | \
+
+they end up being compared on a primary level as if they were 1' and 2'
+
+1' | a | e | d | \ |
+-- | -- | -- | -- | --------- | ---------
+2' | a | e | d | \ | \
+
+Since 2' has an extra character (the extra space), it counts as having a primary
+difference when it shouldn't. The correct result occurs when the trailing
+padding spaces are removed, as in 1" and 2"
+
+1" | a | e | d
+-- | -- | -- | --
+2" | a | e | d
+
+## Collator naming scheme
+
+***Starting with ICU 54, the following naming scheme and its API functions are
+deprecated.*** Use ucol_open() with language tag collation keywords instead (see
+[Collation API Details](api.md)). For example,
+ucol_open("de-u-co-phonebk-ka-shifted", &errorCode) for German Phonebook order
+with "ignore punctuation" mode.
+
+When collating or matching text, a number of attributes can be used to affect
+the desired result. The following describes the attributes, their values, their
+effects, their normal usage, and the string comparison performance and sort key
+length implications. It also includes single-letter abbreviations for both the
+attributes and their values. These abbreviations allow a 'short-form'
+specification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which
+can be used to specific that the desired options are: UCA version 4.0.0; ignore
+spaces, punctuation and symbols; use Swedish linguistic conventions; compare
+case-insensitively.
+
+A number of attribute values are common across different attributes; these
+include **Default** (abbreviated as D), **On** (O), and **Off** (X). Unless
+otherwise stated, the examples use the UCA alone with default settings.
+
+> :point_right: **Note** In order to achieve uniqueness, a collator name always
+> has the attribute abbreviations sorted.
+
+### Main References
+
+1. For a full list of supported locales in ICU, see [Locale
+ Explorer](http://demo.icu-project.org/icu-bin/locexp) , which also contains
+ an on-line demo showing sorting for each locale. The demo allows you to try
+ different attribute values, to see how they affect sorting.
+
+2. To see tabular results for the UCA table itself, see the [Unicode Collation
+ Charts](http://www.unicode.org/charts/collation/) .
+
+3. For the UCA specification, see [UTS #10: Unicode Collation
+ Algorithm](http://www.unicode.org/reports/tr10/) .
+
+4. For more detail on the precise effects of these options, see [Collation
+ Customization](customization/index.md) .
+
+#### Collator Naming Attributes
+
+Attribute | Abbreviation | Possible Values
+---------------------- | ------------ | ---------------
+Locale | L | \
+Script | Z | \