mirror of
https://github.com/unicode-org/icu.git
synced 2025-04-07 22:44:49 +00:00
ICU-1080 add Transliterator and UnicodeSet release notes
X-SVN-Rev: 6697
This commit is contained in:
parent
e6919d7596
commit
dc74a1d2c0
1 changed files with 205 additions and 0 deletions
|
@ -51,6 +51,10 @@
|
|||
license</a></li>
|
||||
|
||||
<li><a href="#NewsCollation">Collation Improvements</a></li>
|
||||
|
||||
<li><a href="#NewsTranslit">Transliterator Improvements</a></li>
|
||||
|
||||
<li><a href="#NewsUnicodeSet">UnicodeSet Improvements</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
|
||||
|
@ -200,6 +204,207 @@
|
|||
"http://oss.software.ibm.com/icu/develop/collation/">collation design
|
||||
document</a>.</p>
|
||||
|
||||
<h3><a name="NewsTranslit">Transliterator Improvements</a></h3>
|
||||
|
||||
<p>The transliterator service has undergone an extensive overhaul,
|
||||
in both the rule-based engine and the built-in system rules.
|
||||
|
||||
<ul>
|
||||
|
||||
<li><b>New or rewritten rules:</b> <tt>Any-Accents</tt>,
|
||||
<tt>Any-Publishing</tt>, <tt>Cyrillic-Latin</tt>*,
|
||||
<tt>Greek-Latin</tt>*, <tt>Greek-Latin/UNGEGN</tt> (aka
|
||||
<tt>el-Latin</tt>), <tt>Hiragana-Latin</tt>*, and
|
||||
<tt>Latin-Katakana</tt>*. New algorithmic rules include
|
||||
<tt>Any-Name</tt>*, the normalization rules <tt>Any-NFC</tt>,
|
||||
<tt>Any-NFKC</tt>, <tt>Any-NFD</tt>, and <tt>Any-NFKD</tt>, casing
|
||||
rules <tt>Any-Upper</tt>, <tt>Any-Lower</tt>, and
|
||||
<tt>Any-Title</tt>. <tt>Unicode-Hex</tt>* has been renamed
|
||||
<tt>Any-Hex</tt>*. <tt>Any-Remove</tt> deletes its input.
|
||||
[*<em>applies to reverse rule as well</em>]
|
||||
|
||||
<li><b>Indic script rules:</b> Transliterators between Indic
|
||||
scripts and from each script to and from Latin have been
|
||||
completely revised. Scripts included are Bengali, Devanagari,
|
||||
Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu.
|
||||
Taking Bengali as an example, transliterators <tt>Bengali-X</tt>
|
||||
and <tt>X-Bengali</tt> exist, where X is any of the other listed
|
||||
Indic scripts, or Latin.
|
||||
|
||||
<li><b>Deleted rules:</b> <tt>UnicodeName-UnicodeChar</tt> has
|
||||
been replaced by <tt>Any-Name</tt>*. <tt>Latin-Arabic</tt>* and
|
||||
<tt>Latin-Hebrew</tt>* have been removed until they can be
|
||||
rewritten. <tt>KeyboardEscape-Latin1</tt> has been replaced by
|
||||
<tt>Any-Accents</tt> and <tt>Any-Publishing</tt>.
|
||||
<tt>Latin-Kana</tt>* has been replaced by <tt>Latin-Katakana</tt>*
|
||||
and <tt>Latin-Hiragana</tt>*.
|
||||
[*<em>applies to reverse rule as well</em>]
|
||||
|
||||
<li><b>ID syntax changes:</b> Transliterator IDs ignore case and
|
||||
whitespace now. They now have the standard form
|
||||
<em>[filter]source-target/variant</em>. The "<em>[filter]</em>"
|
||||
element is optional; if present, it limits the characters that the
|
||||
transliterator operates on. The "<em>source-</em>" element is
|
||||
optional; if omitted, it is taken to be <tt>Any</tt>. The
|
||||
"<em>/variant</em>" element is also optional; if present, it
|
||||
selects between different flavors of a related set of
|
||||
transliterators, for example, <tt>Greek-Latin</tt> and
|
||||
<tt>Greek-Latin/UNGEGN</tt>. The source, target, and variant
|
||||
specifiers are case-insensitive strings of the form
|
||||
<tt>/[_[:L:]][_[:L:][:N:]]*/</tt>.
|
||||
|
||||
<li><b>Locale support:</b> The source, target, or both may be
|
||||
locales. In this case the transliterator rules will be looked up
|
||||
in the system locale resource bundles. Rules are sought under
|
||||
three tags, listed below. The text after the underscore in each
|
||||
tag is always canonicalized to uppercase before lookup. <em>Note:
|
||||
The underscore is currently omitted from ICU4C tags, but will be
|
||||
restored when possible.</em>
|
||||
|
||||
<ul><li><tt>TransliterateTo_<em>SCRIPT</em></tt>:
|
||||
Unidirectional rules from the enclosing locale to another script
|
||||
or specifier.
|
||||
<li><tt>TransliterateFrom_<em>SCRIPT</em></tt>:
|
||||
Unidirectional rules from another script
|
||||
or specifier to the enclosing locale.
|
||||
<li><tt>Transliterate_<em>SCRIPT</em></tt>:
|
||||
Bidirectional rules, with the forward direction being To and
|
||||
the reverse direction being From.
|
||||
</ul>
|
||||
|
||||
Lookup proceeds in the following order:
|
||||
|
||||
<ul><li>In the dynamic registry: <em>source-target</em>
|
||||
<li>In the <em>source</em> locale:
|
||||
<tt>TransliterateTo_<em>TARGET</em></tt> then
|
||||
<tt>Transliterate_<em>TARGET</em></tt> (forward direction)
|
||||
<li>In the <em>target</em> locale:
|
||||
<tt>TransliterateFrom_<em>SOURCE</em></tt> then
|
||||
<tt>Transliterate_<em>SOURCE</em></tt> (reverse direction)
|
||||
</ul>
|
||||
|
||||
If either the source or target specifier is not a locale then the
|
||||
corresponding locale lookup is skipped. If either is a locale,
|
||||
then locale fallback from <tt>aa_BB_CCC</tt> to <tt>aa_BB</tt> to
|
||||
<tt>aa</tt> is performed (where <tt>aa</tt>, <tt>BB</tt>, and
|
||||
<tt>CCC</tt> are the locale language, country, and variant). The
|
||||
final fallback is from the specifier, whether it is a locale or
|
||||
not (e.g., script abbreviation), to the long script name
|
||||
associated with that specifier. If a tag lookup succeeds, the
|
||||
attached element should be a string array of <i>2n</i> items where
|
||||
<i>n</i> >= 1. Each pair of strings is a variant name and rule
|
||||
string. The variants are matched against the requested variant.
|
||||
If no variant is specified then the first variant is considered to
|
||||
match.
|
||||
|
||||
<li><b>Filters on compounds IDs:</b> A filter on a compound
|
||||
transliterator can now be specified by giving a leading entry that
|
||||
contains a filter and no transliterator ID. For example,
|
||||
"<tt>[abc]; Latin-Katakana; Katakana-Hiragana</tt>" submits only
|
||||
the characters contained in the UnicodeSet <tt>[abc]</tt> to the
|
||||
compound transliterator <tt>Latin-Katakana;
|
||||
Katakana-Hiragana</tt>.
|
||||
|
||||
<li><b>Explicit reverse IDs:</b> Typically if a transliterator
|
||||
<tt>A-B</tt> is formed, and its inverse is requested, the system
|
||||
tries to create <tt>B-A</tt>. That is, the source and target are
|
||||
exchanged. In some cases, the user may wish a different
|
||||
transliterator to be considered the reverse. In order to do this,
|
||||
the reverse ID is specified in parentheses immediately following
|
||||
the ID. For example, "<tt>A-B (B-C)</tt>" is a transliterator
|
||||
<tt>A-B</tt> whose inverse is <tt>B-C</tt>. If the ID of the
|
||||
inverse is requested, "<tt>B-C (A-B)</tt>" is returned. The
|
||||
forward or reverse component may be empty, so "<tt>(B-C)</tt>" and
|
||||
"<tt>A-B()</tt>" are legal IDs with <tt>Null</tt> transliterator
|
||||
for the forward and reverse direction, respectively. This is most
|
||||
useful in compounds where one element has no inverse or where a
|
||||
different inverse from the standard inverse is desired. For
|
||||
example, "<tt>Any-Lower(); Latin-Cyrillic</tt>".
|
||||
|
||||
<li><b>Quantifiers:</b> Transliterator rules may now contain
|
||||
quantifiers '<tt>*</tt>', '<tt>+</tt>', and '<tt>?</tt>'. These
|
||||
indicate zero or more, one or more, and zero or one matches,
|
||||
respectively. Quantifiers apply to the last element, be it a
|
||||
single character, a UnicodeSet, a segment definition, or a quote;
|
||||
the entire preceding element is repeated. Quantifiers are
|
||||
implemented as greedy, non-backtracking matchers, unlike their
|
||||
typical implementation in regular expressions. As a result,
|
||||
expressions that match in a traditional regular expression engine
|
||||
(e.g., Perl) will not match in transliterator. E.g., "[a-z]+ q >
|
||||
x;" will <em>not</em> match "abcq", since the '<tt>+</tt>'
|
||||
quantifier consumes all four characters.
|
||||
|
||||
<li><b>Dot character:</b> A new special character is recognized in
|
||||
rules, '<tt>.</tt>' (U+0020). This character matches any
|
||||
characters in the set <tt>[^[:Zp:][:Zl:]\r\n$]</tt>. Note the
|
||||
trailing '<tt>$</tt>' in the set pattern, which indicates that the
|
||||
ETHER character is <em>not</em> matched by '<tt>.</tt>'.
|
||||
|
||||
<li><b>::ID blocks in rules:</b> Transliterator IDs may now be
|
||||
included in rule sets. These may occur in two locations: as one
|
||||
contiguous block before any other rules, and as one contiguous
|
||||
block after all rules. The effect of placing <tt>::ID</tt>s into
|
||||
a rule set is to enclose the rule-based transliterator within a
|
||||
compound transliterator containing the indicated IDs. The
|
||||
<tt>::ID</tt> syntax is exactly the same as the standard ID
|
||||
syntax, with the difference that each ID element is preceded by
|
||||
the special token "<tt>::</tt>".
|
||||
|
||||
<li><b>Segment definitions more flexible:</b> Segment definitions
|
||||
may be nested and are now unlimited in number. Prior to 2.0,
|
||||
segments could not be nested and were limited to nine ($1 to $9).
|
||||
|
||||
<li><b>Variable range pragma:</b> A new pragma is supported. This
|
||||
follows the syntax:<code>use variable range 0xE800 0xEFFF;</code>
|
||||
(Any two code points may be specified.) The code points are
|
||||
specified as decimal constants, octal constants with a leading
|
||||
'0', or hexadecimal constants with a leading "0x". The given
|
||||
range is used internally for stand-in characters during
|
||||
processing. The default range is <b>0xF000..0xF8FF</b>. If a
|
||||
rule set explicitly uses characters in the default variable range,
|
||||
a new range, not containing any characters in use in the rule set,
|
||||
must be specified. <em>Note:</em> This is the first of several
|
||||
planned pragmas.
|
||||
|
||||
<li><b>Factory method registration:</b> Factory methods (function
|
||||
pointers in ICU4C; functor objects in ICU4J) may be registered
|
||||
against transliterator IDs. This is generally more efficient than
|
||||
the registration of singleton prototypes, since no actual
|
||||
transliterator object need be created until the user requires one.
|
||||
See the <tt>registerFactory()</tt> method in
|
||||
<tt>Transliterator</tt>.
|
||||
|
||||
<li><b>Filtering semantics changed for subclasses:</b> Subclasses
|
||||
now need not concern themselves with filters. Instead, they may
|
||||
assume that all characters received by
|
||||
<tt>handleTransliterate()</tt> have already passed through the
|
||||
filter. This simplifies subclass code greatly.
|
||||
|
||||
</ul>
|
||||
|
||||
<h3><a name="NewsUnicodeSet">UnicodeSet Improvements</a></h3>
|
||||
|
||||
<ul>
|
||||
|
||||
<li><b><tt>[:Any:]</tt> set:</b> The set <tt>[:Any:]</tt> matches
|
||||
all Unicode code points, that is, U+0000..U+10FFFF.
|
||||
|
||||
<li><b><tt>\p{}</tt> syntax:</b> UnicodeSet now recognizes a
|
||||
Perlish syntax for character properties. Any property designated
|
||||
as <tt>[:Foo:]</tt> may equivalently be designated
|
||||
<tt>\p{Foo}</tt>.
|
||||
|
||||
<li><b>Short, medium, and long property names:</b> In addition to
|
||||
the short property names, such as <tt>[:Ll:]</tt>, equivalent
|
||||
medium (e.g., <tt>[:gc=Ll:]</tt>) and long (e.g.,
|
||||
<tt>[:GeneralCategory=LowercaseLetter:]</tt>) forms are
|
||||
recongized. See the <a
|
||||
href="http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/unicodeset_properties.html">UnicodeSet
|
||||
Properties design document</a> for details. As of this release,
|
||||
general categories, numeric value, and script are supported.
|
||||
|
||||
</ul>
|
||||
|
||||
<h2><a name="WhatContain">What the International Components for Unicode
|
||||
Contain</a></h2>
|
||||
|
||||
|
|
Loading…
Add table
Reference in a new issue