ICU-21697 Convert ICU Site pages to markdown for Github Pages

See #1785
This commit is contained in:
Elango Cheran 2023-05-27 06:21:16 +00:00
parent de26ea8c6a
commit 5435007e6a
47 changed files with 5950 additions and 59 deletions

74
docs/demos/index.md Normal file
View file

@ -0,0 +1,74 @@
---
layout: default
title: Demos
nav_order: 350
description: Demos
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Demos
## ICU4C Demos
[List of ICU Demonstrations](https://icu4c-demos.unicode.org/icu-bin/icudemos)
## ICU4J Demos
### Server Side Demos
#### Web Demos
These demos are running on the ICU server, and are implemented as Java Servlets
and JSP pages.
* [Browse the Demos](http://demo.icu-project.org/icu4jweb/)
* [View Demo Source](https://github.com/unicode-org/icu-demos/tree/master/icu4jweb/)
### Client Side demos
#### To build the client side samples:
1. Download the ICU4J source code ( see [Source Code Setup](../devsetup/source) )
2. Run `ant jar` to build ICU4J jar
3. Run `ant jarDemos` to build the demos
4. Run `cp icu4j.jar demos/out/lib`
5. Finally, run `java -jar demos/out/lib/icu4j-demos.jar` to launch the demos
**CalendarApp** This demo compares two calendars against each other. Choose the
two calendar types, and the display language, from the pop-up menus. Navigate by
days using the < and > buttons, or by years using the << and >> buttons.
**Translit** This demonstration shows ICU Transliteration. The transliteration
mode chosen in the menu will be used as you type.
**HolidayCalendarDemo** This demo displays holidays from a certain locale,
localized into the display language of your choice. Navigate by days using the <
and > buttons, or by years using the << and >> buttons.
**RbnfDemo** This demo shows Rule Based Number Formatting. Please expand the
window to show the entire demo. A number may be entered in the top left corner,
or the navigation buttons may be used. The pop-up menus in the top right corner
will pick the rule and the variant used.
**DetectingViewer** By opening a document using the Open file or Open URL menu
items, this demo will statistically detect the probable file encoding of a file.
Use the DetectedEncodings menu to see which encodings were detected.
*Note:* Due to security constraints, you must use the Downloadable Demo Jar in
order to use these demos with files on your local disk. The Java Web Start
application will not have permission to read local files.
---
### ICU Introduction Applets
#### About the Applets
This is a paper introducing ICU calendars, which has live applets throughout the
text to demonstrate various features.
The paper is now archived, see <https://github.com/unicode-org/icu-demos/pull/5>

14
docs/design/index.md Normal file
View file

@ -0,0 +1,14 @@
---
layout: default
title: Design Docs
nav_order: 8000
has_children: true
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Design Docs

333
docs/design/props/ppucd.md Normal file
View file

@ -0,0 +1,333 @@
---
layout: default
title: Preparsed UCD
parent: Design Docs
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Preparsed UCD
## What
A text file with preparsed UCD ([Unicode Character
Database](http://www.unicode.org/ucd/)) data.
* Preparser script:
[tools/unicode/py/**preparseucd.py**](https://github.com/unicode-org/icu/blob/master/tools/unicode/py/preparseucd.py)
* ppucd.txt output:
[icu4c/source/data/unidata/**ppucd.txt**](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt)
([raw text
version](https://raw.githubusercontent.com/unicode-org/icu/master/icu4c/source/data/unidata/ppucd.txt))
* Parser for ppucd.txt:
[icu4c/source/tools/toolutil/**ppucd.h**](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.h)
&
[.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.cpp)
* genprops tool rewritten to use that:
[tools/unicode/c/**genprops**](https://github.com/unicode-org/icu/tree/master/tools/unicode/c/genprops)
## Syntax
```
# Preparsed UCD generated by ICU preparseucd.py
```
Only whole-line comments starting with #, no inline comments.
```
ucd;10.0.0
```
Data lines start with a type keyword. Data fields are semicolon-separated. The
number of fields per line is highly variable.
The ucd line should be the first data line. It provides the Unicode version
number.
```
property;Binary;Alpha;Alphabetic
property;Enumerated;bc;Bidi_Class
```
Property lines define properties with a type and two or more aliases.
```
binary;N;No;F;False
binary;Y;Yes;T;True
value;bc;ON;Other_Neutral
```
Property value lines define the values of enumerated and catalog properties,
with the property short name and two or more aliases for each value.
There is only one shared definition of the values and aliases for binary
properties.
```
defaults;0000..10FFFF;age=NA;bc=L;blk=NB;bpt=n;cf=<code point>;dm=<code point>;dt=None;ea=N;FC_NFKC=<code point>;gc=Cn;GCB=XX;gcm=Cn;hst=NA;InPC=NA;InSC=Other;jg=No_Joining_Group;jt=U;lb=XX;lc=<code point>;NFC_QC=Y;NFD_QC=Y;NFKC_CF=<code point>;NFKC_QC=Y;NFKD_QC=Y;nt=None;SB=XX;sc=Zzzz;scf=<code point>;scx=<script>;slc=<code point>;stc=<code point>;suc=<code point>;tc=<code point>;uc=<code point>;vo=R;WB=XX
```
After the version, property, and property value lines, and before other data
lines, the defaults line defines default values for all code points
(corresponding to @missing data in the UCD). Any properties not mentioned here
default to null values according to their type, such as False or the empty
string.
The general syntax of this line is the same as for the following data lines:
1. Line type keyword.
2. Code point or start..end range (inclusive end).
3. Zero or more property values.
* Binary values are given by their property name alone if True ("Alpha"),
or with a minus sign prepended ("-Alpha").
* Other values are given as "pname=value" pairs, where pname is the
property name.
* In the ppucd.txt file, short names of properties and values are used,
but parsers should be prepared to accept any of the aliases according to
the earlier sections of the file.
* In the ppucd.txt file, properties are listed in sorted order, but this
is not required by the syntax.
```
block;20000..2A6DF;age=3.1;Alpha;blk=CJK_Ext_B;ea=W;gc=Lo;Gr_Base;IDC;Ideo;IDS;lb=ID;SB=LE;sc=Hani;UIdeo;vo=U;XIDC;XIDS
# 20000..2A6D6 CJK Unified Ideographs Extension B
algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-
cp;20001;nt=Nu;nv=7
cp;20064;nt=Nu;nv=4
unassigned;2A6D7..2A6DF;ea=W;lb=ID;vo=U
# No block
unassigned;2A6E0..2A6FF;ea=W;lb=ID;vo=U
algnamesrange;AC00..D7A3;hangul
```
Block lines specify a Unicode Block and provide an opportunity for compact data
lines for ranges inside the block, by listing common property values once for
the whole block. Block properties override the defaults for cp and unassigned
lines with code point ranges inside the block. The file syntax and parser do not
require the presence of block lines.
cp lines provide the data for a code point or range. They override the
default+block properties. Properties that are not mentioned fall back to the
block, then to the defaults.
Unassigned lines (new in ICU 60 for Unicode 10) provide the data for an
unassigned code point or range (gc=Cn). They override only the default
properties, except for the blk=Block property (if the range is inside a block).
Properties that are not mentioned fall back to the defaults, except that the
blk=Block property applies to unassigned lines as well.
A range is considered inside a block if it is fully inside the range of the last
defined block. Otherwise it is considered outside a block and falls back only to
the defaults. This is the case even if the range is inside an earlier block, to
simplify parsing & processing (such data lines should be avoided).
A range inside the block for which there is no data line inherits all of the
default+block properties (see Han blocks). Note that this is very different from
the behavior of an unassigned line, in particular since such blocks typically
default to gc!=Cn.
Non-default properties for unassigned ranges inside and outside of blocks are
typically for [complex
defaults](http://www.unicode.org/reports/tr44/#Default_Values_Table) and for
noncharacters.
ppucd.txt data lines are in code point order, although this should not be
strictly required.
Assigned characters normally have their unique na=Name property value. For
Hangul syllables with their algorithmically computed names, the entire range is
covered by the line "algnamesrange;AC00..D7A3;hangul". For ranges of ideographic
characters, a line like "algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-"
provides a Name prefix which is to be followed by the code point (in hex like
%04lX).
## Why not UCD .txt files?
See [UAX #44 "Unicode Character Database"](http://www.unicode.org/reports/tr44/)
Nontrivial parsing:
* The UCD has grown from a couple of semicolon-delimited files plus an
informative "Property dump" (early PropList.txt) to a collection of dozens
of files with a variety of (now more regular) formats.
* Related properties are scattered over several files.
* Full information for Numeric_Value and Numeric_Type requires parsing two
files.
* Default values are "hidden" in comments.
* The UCD folder structure (which file where) has changed over time.
* UCD filenames change during each Unicode beta period. (A detailed version
number is inserted into each filename.)
* Many files are bloated with comments that show the General Category and name
of each character or range start/end; if the data were combined into a
single file, then all properties for a character or range would be listed
together, without need for such comments.
Nontrivial patching: Adding characters (e.g., PUA or proposed/draft) requires
adding data in many of the UCD files.
ICU already preprocesses some of the UCD .txt files. We strip comments from some
files (because they are huge) and in some files merge adjacent same-property
code points into ranges.
Some changes are manual, such as updating and adding ranges of algorithmic
character names.
Then we run several tools, most of them twice, to parse different sets of .txt
files and write several output files. We use several Python and shell scripts,
and a "log" (unidata/changes.txt) with details of what was changed and run in
each Unicode version upgrade.
Markus has done ICU Unicode updates since about 2002. Someone else might have a
hard time picking this up for maintenance and future Unicode version updates.
### Why not UCD XML files?
See [UAX #42 "Unicode Character Database in
XML"](http://www.unicode.org/reports/tr42/)
Good: The UCD XML file format stores all properties in a single file with a
relatively simple structure, with property values as XML attributes.
Issues:
* **Missing data** which is needed for ICU
* Name_Alias added in UCD 5.0 but missing in UCD XML as of UCD 6.1 beta.
* Script_Extensions added in UCD 6.0 but not "blessed" as a Unicode
property as of UCD 6.1. Useful, used in ICU, but not available in UCD
XML.
* Adopting UCD XML would require to either still also parse some UCD .txt
files or write another tool to merge more data into the XML.
* Dependency on third party
* Lag time between UCD .txt vs. XML availability during beta.
* Unable to fix/update/extend XML generator tools.
* For new properties, need to wait for standardization (UAX #42), tool
update, and XML publication.
* Will not support custom/nonstandard data.
* Could be simpler: Parsing XML is easy in Java, Python, etc. and doable in
C++ (we have a "poor man's" XML parser), but not as easy as
`line.split(";")`.
* There is no need for complex structure for the UCD.
* Could be easier to read for humans: By not storing defaults for all of
Unicode in one place, each `<group>` carries them, making it hard to see which
values are specific to each group. "Fluffy" XML makes for longer text lines,
more horizontal scrolling.
* Hard to diff: The XML format can be used in different ways, and Unicode
publishes different forms of the same data. Also, the precise XML text
depends on the XML formatting code used.
* For diffing, a special tool needs to be run, parse old & new XML data,
compare values and generate a diff report. Unicode publishes some of
those too.
* Some data still requires nontrivial parsing.
* For algorithmic character names, the range needs to be determined by
collecting a contiguous sequence of elements with a shared name pattern.
There is not even any special notation for the algorithmic names for
Hangul syllables.
* Minor: Unnecessary data (for ICU)
* Precomputed Hangul syllable names
* Irrelevant contributory properties like "Other_Xyz"
* Properties not used by ICU
* Minor, just awkward: Blocks are treated as auxiliary data, rather than as a
core means to organize and store the data. On the other hand, the "grouped"
XML files also use them as the basis for the `<group>` elements and associated
compaction. (The "flat" files don't.)
## Goals
* Single file with all data relevant for ICU.
* Very easy to parse and use the data in C/C++ tools.
* Easily human readable.
* Easy-to-read diffs from standard diff tools.
* Compact file format.
* Conversion tool easy to write, maintain, extend.
* Convert from UCD .txt files because those are maintained directly by the UTC
& editorial committee. No waiting for third party to convert the files.
* Able to extend for new kinds of data.
* Easy format for manual data fixes/additions (e.g., PUA or proposed/draft).
* Move much of the parsing from scattered C code into one Python script.
## Details
* All-Unicode defaults in one place, but only list non-null default values.
(`blk=No_Block, cf=<code point>, ...`)
* Line-oriented, always semicolon-separated, with type-of-line in the first
field.
* Block properties override defaults; only for few properties where properties
in the block have common, non-default values.
* Effective because blocks represent actual allocation & organization of
Unicode. Maintained by UTC.
* Code point/range properties override default+block properties.
* Algorithmic names stored as ranges with type & shared name prefixes (for
CJK).
* No gratuitous white space or syntax characters.
* Mostly key=value, simpler format for binary properties. Easy to read.
* Comment lines with headings from NamesList.txt further improve readability.
(There are few of them, so no significant size bloat.)
* Simple, stable file generation allows diffing.
* E.g., list properties in sorted order of property names.
* No need to implement/store properties that are not used in ICU. (But format
& tool are easy to extend.)
## Plan
* (done) Write Python tool to preparse UCD .txt files and generate one output
ppucd.txt file.
* (done) Subsume existing ucdcopy.py.
* (done) Write toolutil C++ parser for ppucd.txt, add ppucd.txt to the unidata
folder.
* (done) Merge genbidi, gencase, gennames, gennorm into genprops
* Replace scattered many-.txt parsers with calls to the toolutil ppucd.txt
parser.
* Generate all output files in one genprops invocation.
* Update makeprops.sh (delete half of it) & changes.txt.
* (done) Make preparseucd.py also parse uchar.h & uscript.h and write the
property names data header file. (was: ~~Change genpname/preparse.pl to read
ppucd.txt rather than Property\[Value\]Aliases.txt.~~)
* (done) Consider changing pnames_data.h so that minor changes don't change
most of the file contents.
* (done) Write wiki/Markus/ReviewTicket8972 with diff links.
* 2019-sep-27: The old Trac server is going away. I copied the wiki page
contents into a comment on
[ICU-8972](https://unicode-org.atlassian.net/browse/ICU-8972).
* Move UCD tests from cintltst to intltest, change to use the toolutil
ppucd.txt parser. ([ticket
#9041](https://unicode-org.atlassian.net/browse/ICU-9041))
* Change Java UCD tests to parse & use ppucd.txt. (ticket #9041)
* (partially done) Change Python preparser to not copy input UCD .txt files
any more, delete them from unidata & Java. (ticket #9041)
## Other tool improvements
**Bad**: Until **ICU 4.8**, the process is
build & install ICU -> build Unicode tools -> run genpname -> build & install
ICU (now with updated property names) -> build Unicode tools -> run UCD parsers
-> build & install ICU (now also with case properties & normalization etc.) ->
build Unicode tools -> run genuca -> build & install ICU
It should be possible to
1. merge the Unicode tools into one binary
2. parameterize the relevant properties code (property name lookup, case & some
other properties, NFC)
3. inject newly built data into the common library for the next part of the
merged Unicode tool's processing.
**ICU 49**:
build & install ICU -> build Unicode tools -> run genprops -> build & install
ICU (now with updated properties) -> build Unicode tools -> run genuca -> build
& install ICU
genprops builds the property (value) names data and injects it into the live
ppucd.txt parser for further processing.
**Goal**:
build & install ICU -> build Unicode tool -> run it -> build & install ICU (now
with all updated Unicode data)
Requires [ticket #9040](https://unicode-org.atlassian.net/browse/ICU-9040),
could be "hard".

View file

@ -0,0 +1,20 @@
---
layout: default
title: Data Structures
parent: Design Docs
has_children: true
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Data Structures
## Subpage Listing
* [ICU Code Point Tries](./utrie)
* [ICU String Tries](./tries/)
* [BytesTrie](./tries/bytestrie/)
* [UCharsTrie](./tries/ucharstrie)

View file

@ -0,0 +1,358 @@
// © 2016 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
/*
*******************************************************************************
* Copyright (C) 2010, International Business Machines
* Corporation and others. All Rights Reserved.
*******************************************************************************
* file name: bytetrie.h
* encoding: US-ASCII
* tab size: 8 (not used)
* indentation:4
*
* created on: 2010sep25
* created by: Markus W. Scherer
*/
#ifndef __BYTETRIE_H__
#define __BYTETRIE_H__
/**
* \file
* \brief C++ API: Dictionary trie for mapping arbitrary byte sequences
* to integer values.
*/
#include "unicode/utypes.h"
#include "unicode/uobject.h"
U_NAMESPACE_BEGIN
class ByteTrieBuilder;
class ByteTrieIterator;
/**
* Light-weight, non-const reader class for a ByteTrie.
* Traverses a byte-serialized data structure with minimal state,
* for mapping byte sequences to non-negative integer values.
*/
class /*U_COMMON_API*/ ByteTrie : public UMemory {
public:
ByteTrie(const void *trieBytes)
: bytes(reinterpret_cast<const uint8_t *>(trieBytes)),
pos(bytes), remainingMatchLength(-1), value(0) {}
ByteTrie &reset() {
pos=bytes;
remainingMatchLength=-1;
return *this;
}
/**
* Traverses the trie from the current state for this input byte.
* @return TRUE if the byte continues a matching byte sequence.
*/
UBool next(int inByte);
/**
* @return TRUE if the trie contains the byte sequence so far.
* In this case, an immediately following call to getValue()
* returns the byte sequence's value.
*/
UBool contains();
/**
* Traverses the trie from the current state for this byte sequence,
* calls next(b) for each byte b in the sequence,
* and calls contains() at the end.
*/
UBool containsNext(const char *s, int32_t length);
/**
* Returns a byte sequence's value if called immediately after contains()
* returned TRUE. Otherwise undefined.
*/
int32_t getValue() const { return value; }
// TODO: For startsWith() functionality, add
// UBool getRemainder(ByteSink *remainingBytes, &value);
// Returns TRUE if exactly one byte sequence can be reached from the current iterator state.
// The remainingBytes sink will receive the remaining bytes of that one sequence.
// It might receive some bytes even when the function returns FALSE.
private:
friend class ByteTrieBuilder;
friend class ByteTrieIterator;
inline void stop() {
pos=NULL;
}
// Reads a compact 32-bit integer and post-increments pos.
// pos is already after the leadByte.
// Returns TRUE if the integer is a final value.
inline UBool readCompactInt(int32_t leadByte);
inline UBool readCompactInt() {
int32_t leadByte=*pos++;
return readCompactInt(leadByte);
}
// pos is on the leadByte.
inline void skipCompactInt(int32_t leadByte);
inline void skipCompactInt() { skipCompactInt(*pos); }
// Reads a fixed-width integer and post-increments pos.
inline int32_t readFixedInt(int32_t bytesPerValue);
// Node lead byte values.
// 0..3: Branch node with one comparison byte, 1..4 bytes for less-than jump delta,
// and compact int for equality.
// 04..0b: Branch node with a list of 2..9 bytes comparison bytes, each except last one
// followed by compact int as final value or jump delta.
static const int32_t kMinListBranch=4;
// 0c..1f: Node with 1..20 bytes to match.
static const int32_t kMinLinearMatch=0xc;
// 20..ff: Intermediate value or jump delta, or final value, with 0..4 bytes following.
static const int32_t kMinValueLead=0x20;
// It is a final value if bit 0 is set.
static const int32_t kValueIsFinal=1;
// Compact int: After testing bit 0, shift right by 1 and then use the following thresholds.
static const int32_t kMinOneByteLead=0x10;
static const int32_t kMinTwoByteLead=0x51;
static const int32_t kMinThreeByteLead=0x6d;
static const int32_t kFourByteLead=0x7e;
static const int32_t kFiveByteLead=0x7f;
static const int32_t kMaxOneByteValue=0x40; // At least 6 bits in the first byte.
static const int32_t kMaxTwoByteValue=0x1bff;
static const int32_t kMaxThreeByteValue=0x11ffff; // A little more than Unicode code points.
static const int32_t kMaxListBranchLength=kMinLinearMatch-kMinListBranch+1; // 9
static const int32_t kMaxLinearMatchLength=kMinValueLead-kMinLinearMatch; // 20
// Map a shifted-right compact-int lead byte to its number of bytes.
static const int8_t bytesPerLead[kFiveByteLead+1];
// Fixed value referencing the ByteTrie bytes.
const uint8_t *bytes;
// Iterator variables.
// Pointer to next trie byte to read. NULL if no more matches.
const uint8_t *pos;
// Remaining length of a linear-match node, minus 1. Negative if not in such a node.
int32_t remainingMatchLength;
// Value for a match, after contains() returned TRUE.
int32_t value;
};
const int8_t ByteTrie::bytesPerLead[kFiveByteLead+1]={
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 5
};
UBool
ByteTrie::readCompactInt(int32_t leadByte) {
UBool isFinal=(UBool)(leadByte&kValueIsFinal);
leadByte>>=1;
int numBytes=bytesPerLead[leadByte]-1; // -1: lead byte was already consumed.
switch(numBytes) {
case 0:
value=leadByte-kMinOneByteLead;
break;
case 1:
value=((leadByte-kMinTwoByteLead)<<8)|*pos;
break;
case 2:
value=((leadByte-kMinThreeByteLead)<<16)|(pos[0]<<8)|pos[1];
break;
case 3:
value=(pos[0]<<16)|(pos[1]<<8)|pos[2];
break;
case 4:
value=(pos[0]<<24)|(pos[1]<<16)|(pos[2]<<8)|pos[3];
break;
}
pos+=numBytes;
return isFinal;
}
void
ByteTrie::skipCompactInt(int32_t leadByte) {
pos+=bytesPerLead[leadByte>>1];
}
int32_t
ByteTrie::readFixedInt(int32_t bytesPerValue) {
int32_t fixedInt;
switch(bytesPerValue) { // Actually number of bytes minus 1.
case 0:
fixedInt=*pos;
break;
case 1:
fixedInt=(pos[0]<<8)|pos[1];
break;
case 2:
fixedInt=(pos[0]<<16)|(pos[1]<<8)|pos[2];
break;
case 3:
fixedInt=(pos[0]<<24)|(pos[1]<<16)|(pos[2]<<8)|pos[3];
break;
}
pos+=bytesPerValue+1;
return fixedInt;
}
UBool
ByteTrie::next(int inByte) {
if(pos==NULL) {
return FALSE;
}
int32_t length=remainingMatchLength; // Actual remaining match length minus 1.
if(length>=0) {
// Remaining part of a linear-match node.
if(inByte==*pos) {
remainingMatchLength=length-1;
++pos;
return TRUE;
} else {
// No match.
stop();
return FALSE;
}
}
int32_t node=*pos++;
if(node>=kMinValueLead) {
if(node&kValueIsFinal) {
// No further matching bytes.
stop();
return FALSE;
} else {
// Skip intermediate value.
skipCompactInt(node);
// The next node must not also be a value node.
node=*pos++;
// TODO: U_ASSERT(node<kMinValueLead);
}
}
if(node<kMinLinearMatch) {
// Branch according to the current byte.
while(node<kMinListBranch) {
// Branching on a byte value,
// with a jump delta for less-than, a compact int for equals,
// and continuing for greater-than.
// The less-than and greater-than branches must lead to branch nodes again.
uint8_t trieByte=*pos++;
if(inByte<trieByte) {
int32_t delta=readFixedInt(node);
pos+=delta;
} else {
pos+=node+1; // Skip fixed-width integer.
node=*pos;
if(inByte==trieByte) {
// TODO: U_ASSERT(node>=KMinValueLead);
if(node&kValueIsFinal) {
// Leave the final value for contains() to read.
} else {
// Use the non-final value as the jump delta.
++pos;
readCompactInt(node);
pos+=value;
}
return TRUE;
} else { // inByte>trieByte
skipCompactInt(node);
}
}
node=*pos++;
// TODO: U_ASSERT(node<kMinLinearMatch);
}
// Branch node with a list of key-value pairs where
// values are compact integers: either final values or jump deltas.
// If the last key byte matches, just continue after it rather
// than jumping.
length=node-(kMinListBranch-1); // Actual list length minus 1.
for(;;) {
uint8_t trieByte=*pos++;
// U_ASSERT(listLength==0 || *pos>=KMinValueLead);
if(inByte==trieByte) {
if(length>0) {
node=*pos;
if(node&kValueIsFinal) {
// Leave the final value for contains() to read.
} else {
// Use the non-final value as the jump delta.
++pos;
readCompactInt(node);
pos+=value;
}
}
return TRUE;
}
if(inByte<trieByte || length--==0) {
stop();
return FALSE;
}
skipCompactInt();
}
} else {
// Match the first of length+1 bytes.
length=node-kMinLinearMatch; // Actual match length minus 1.
if(inByte==*pos) {
remainingMatchLength=length-1;
++pos;
return TRUE;
} else {
// No match.
stop();
return FALSE;
}
}
}
UBool
ByteTrie::contains() {
int32_t node;
if(pos!=NULL && remainingMatchLength<0 && (node=*pos)>=kMinValueLead) {
// Deliver value for the matching bytes.
++pos;
if(readCompactInt(node)) {
stop();
}
return TRUE;
}
return FALSE;
}
UBool
ByteTrie::containsNext(const char *s, int32_t length) {
if(length<0) {
// NUL-terminated
int b;
while((b=(uint8_t)*s++)!=0) {
if(!next(b)) {
return FALSE;
}
}
} else {
while(length>0) {
if(!next((uint8_t)*s++)) {
return FALSE;
}
--length;
}
}
return contains();
}
U_NAMESPACE_END
#endif // __BYTETRIE_H__

View file

@ -0,0 +1,536 @@
// © 2016 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
/*
*******************************************************************************
* Copyright (C) 2010, International Business Machines
* Corporation and others. All Rights Reserved.
*******************************************************************************
* file name: bytetriebuilder.h
* encoding: US-ASCII
* tab size: 8 (not used)
* indentation:4
*
* created on: 2010sep25
* created by: Markus W. Scherer
*
* Builder class for ByteTrie dictionary trie.
*/
#ifndef __BYTETRIEBUILDER_H__
#define __BYTETRIEBUILDER_H__
#include "unicode/utypes.h"
#include "unicode/stringpiece.h"
#include "bytetrie.h"
#include "charstr.h"
#include "cmemory.h"
#include "uarrsort.h"
U_NAMESPACE_BEGIN
class ByteTrieElement;
class /*U_TOOLUTIL_API*/ ByteTrieBuilder : public UMemory {
public:
ByteTrieBuilder()
: elements(NULL), elementsCapacity(0), elementsLength(0),
bytes(NULL), bytesCapacity(0), bytesLength(0) {}
~ByteTrieBuilder();
ByteTrieBuilder &add(const StringPiece &s, int32_t value, UErrorCode &errorCode);
StringPiece build(UErrorCode &errorCode);
ByteTrieBuilder &clear() {
strings.clear();
elementsLength=0;
bytesLength=0;
return *this;
}
private:
void makeNode(int32_t start, int32_t limit, int32_t byteIndex);
void makeListBranchNode(int32_t start, int32_t limit, int32_t byteIndex, int32_t length);
void makeThreeWayBranchNode(int32_t start, int32_t limit, int32_t byteIndex, int32_t length);
UBool ensureCapacity(int32_t length);
void write(int32_t byte);
void write(const char *b, int32_t length);
void writeCompactInt(int32_t i, UBool final);
int32_t writeFixedInt(int32_t i); // Returns number of bytes.
CharString strings;
ByteTrieElement *elements;
int32_t elementsCapacity;
int32_t elementsLength;
// Byte serialization of the trie.
// Grows from the back: bytesLength measures from the end of the buffer!
char *bytes;
int32_t bytesCapacity;
int32_t bytesLength;
};
/*
* Note: This builder implementation stores (bytes, value) pairs with full copies
* of the byte sequences, until the ByteTrie is built.
* It might(!) take less memory if we collected the data in a temporary, dynamic trie.
*/
class ByteTrieElement : public UMemory {
public:
// Use compiler's default constructor, initializes nothing.
void setTo(const StringPiece &s, int32_t val, CharString &strings, UErrorCode &errorCode);
StringPiece getString(const CharString &strings) const {
int32_t offset=stringOffset;
int32_t length;
if(offset>=0) {
length=(uint8_t)strings[offset++];
} else {
offset=~offset;
length=((int32_t)(uint8_t)strings[offset]<<8)|(uint8_t)strings[offset+1];
offset+=2;
}
return StringPiece(strings.data()+offset, length);
}
int32_t getStringLength(const CharString &strings) const {
int32_t offset=stringOffset;
if(offset>=0) {
return (uint8_t)strings[offset];
} else {
offset=~offset;
return ((int32_t)(uint8_t)strings[offset]<<8)|(uint8_t)strings[offset+1];
}
}
char charAt(int32_t index, const CharString &strings) const { return data(strings)[index]; }
int32_t getValue() const { return value; }
int32_t compareStringTo(const ByteTrieElement &o, const CharString &strings) const;
private:
const char *data(const CharString &strings) const {
int32_t offset=stringOffset;
if(offset>=0) {
++offset;
} else {
offset=~offset+2;
}
return strings.data()+offset;
}
// If the stringOffset is non-negative, then the first strings byte contains
// the string length.
// If the stringOffset is negative, then the first two strings bytes contain
// the string length (big-endian), and the offset needs to be bit-inverted.
// (Compared with a stringLength field here, this saves 3 bytes per string for most strings.)
int32_t stringOffset;
int32_t value;
};
void
ByteTrieElement::setTo(const StringPiece &s, int32_t val,
CharString &strings, UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) {
return;
}
int32_t length=s.length();
if(length>0xffff) {
// Too long: We store the length in 1 or 2 bytes.
errorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return;
}
int32_t offset=strings.length();
if(length>0xff) {
offset=~offset;
strings.append((char)(length>>8), errorCode);
}
strings.append((char)length, errorCode);
stringOffset=offset;
value=val;
strings.append(s, errorCode);
}
int32_t
ByteTrieElement::compareStringTo(const ByteTrieElement &other, const CharString &strings) const {
// TODO: add StringPiece.compareTo()
StringPiece thisString=getString(strings);
StringPiece otherString=other.getString(strings);
int32_t lengthDiff=thisString.length()-otherString.length();
int32_t commonLength;
if(lengthDiff<=0) {
commonLength=thisString.length();
} else {
commonLength=otherString.length();
}
int32_t diff=uprv_memcmp(thisString.data(), otherString.data(), commonLength);
return diff!=0 ? diff : lengthDiff;
}
ByteTrieBuilder::~ByteTrieBuilder() {
delete[] elements;
uprv_free(bytes);
}
ByteTrieBuilder &
ByteTrieBuilder::add(const StringPiece &s, int32_t value, UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) {
return *this;
}
if(bytesLength>0) {
// Cannot add elements after building.
errorCode=U_NO_WRITE_PERMISSION;
return *this;
}
bytesCapacity+=s.length()+1; // Crude bytes preallocation estimate.
if(elementsLength==elementsCapacity) {
int32_t newCapacity;
if(elementsCapacity==0) {
newCapacity=1024;
} else {
newCapacity=4*elementsCapacity;
}
ByteTrieElement *newElements=new ByteTrieElement[newCapacity];
if(newElements==NULL) {
errorCode=U_MEMORY_ALLOCATION_ERROR;
}
if(elementsLength>0) {
uprv_memcpy(newElements, elements, elementsLength*sizeof(ByteTrieElement));
}
delete[] elements;
elements=newElements;
}
elements[elementsLength++].setTo(s, value, strings, errorCode);
return *this;
}
U_CDECL_BEGIN
static int32_t U_CALLCONV
compareElementStrings(const void *context, const void *left, const void *right) {
const CharString *strings=reinterpret_cast<const CharString *>(context);
const ByteTrieElement *leftElement=reinterpret_cast<const ByteTrieElement *>(left);
const ByteTrieElement *rightElement=reinterpret_cast<const ByteTrieElement *>(right);
return leftElement->compareStringTo(*rightElement, *strings);
}
U_CDECL_END
StringPiece
ByteTrieBuilder::build(UErrorCode &errorCode) {
StringPiece result;
if(U_FAILURE(errorCode)) {
return result;
}
if(bytesLength>0) {
// Already built.
result.set(bytes+(bytesCapacity-bytesLength), bytesLength);
return result;
}
if(elementsLength==0) {
errorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return result;
}
uprv_sortArray(elements, elementsLength, (int32_t)sizeof(ByteTrieElement),
compareElementStrings, &strings,
FALSE, // need not be a stable sort
&errorCode);
if(U_FAILURE(errorCode)) {
return result;
}
// Duplicate strings are not allowed.
StringPiece prev=elements[0].getString(strings);
for(int32_t i=1; i<elementsLength; ++i) {
StringPiece current=elements[i].getString(strings);
if(prev==current) {
errorCode=U_ILLEGAL_ARGUMENT_ERROR;
return result;
}
prev=current;
}
// Create and byte-serialize the trie for the elements.
if(bytesCapacity<1024) {
bytesCapacity=1024;
}
bytes=reinterpret_cast<char *>(uprv_malloc(bytesCapacity));
if(bytes==NULL) {
errorCode=U_MEMORY_ALLOCATION_ERROR;
return result;
}
makeNode(0, elementsLength, 0);
if(bytes==NULL) {
errorCode=U_MEMORY_ALLOCATION_ERROR;
} else {
result.set(bytes+(bytesCapacity-bytesLength), bytesLength);
}
return result;
}
// Requires start<limit,
// and all strings of the [start..limit[ elements must be sorted and
// have a common prefix of length firstByteIndex.
void
ByteTrieBuilder::makeNode(int32_t start, int32_t limit, int32_t byteIndex) {
if(byteIndex==elements[start].getStringLength(strings)) {
// An intermediate or final value.
int32_t value=elements[start++].getValue();
UBool final= start==limit;
if(!final) {
makeNode(start, limit, byteIndex);
}
writeCompactInt(value, final);
return;
}
// Now all [start..limit[ strings are longer than byteIndex.
int32_t minByte=(uint8_t)elements[start].charAt(byteIndex, strings);
int32_t maxByte=(uint8_t)elements[limit-1].charAt(byteIndex, strings);
if(minByte==maxByte) {
// Linear-match node: All strings have the same character at byteIndex.
int32_t lastByteIndex=byteIndex;
int32_t length=0;
do {
++lastByteIndex;
++length;
} while(length<ByteTrie::kMaxLinearMatchLength &&
elements[start].getStringLength(strings)>lastByteIndex &&
elements[start].charAt(lastByteIndex, strings)==
elements[limit-1].charAt(lastByteIndex, strings));
makeNode(start, limit, lastByteIndex);
write(elements[start].getString(strings).data()+byteIndex, length);
write(ByteTrie::kMinLinearMatch+length-1);
return;
}
// Branch node.
int32_t length=0; // Number of different bytes at byteIndex.
int32_t i=start;
do {
char byte=elements[i++].charAt(byteIndex, strings);
while(i<limit && byte==elements[i].charAt(byteIndex, strings)) {
++i;
}
++length;
} while(i<limit);
// length>=2 because minByte!=maxByte.
if(length<=ByteTrie::kMaxListBranchLength) {
makeListBranchNode(start, limit, byteIndex, length);
} else {
makeThreeWayBranchNode(start, limit, byteIndex, length);
}
}
// start<limit && all strings longer than byteIndex &&
// 2..kMaxListBranchLength different bytes at byteIndex
void
ByteTrieBuilder::makeListBranchNode(int32_t start, int32_t limit, int32_t byteIndex, int32_t length) {
// List of byte-value pairs where values are either final values
// or jumps to other parts of the trie.
int32_t starts[ByteTrie::kMaxListBranchLength-1];
UBool final[ByteTrie::kMaxListBranchLength-1];
// For each byte except the last one, find its elements array start and its value if final.
int32_t byteNumber=0;
do {
int32_t i=starts[byteNumber]=start;
char byte=elements[i++].charAt(byteIndex, strings);
while(byte==elements[i].charAt(byteIndex, strings)) {
++i;
}
final[byteNumber]= start==i-1 && byteIndex+1==elements[start].getStringLength(strings);
start=i;
} while(++byteNumber<length-1);
// byteNumber==length-1, and the maxByte elements range is [start..limit[
// Write the sub-nodes in reverse order: The jump lengths are deltas from
// after their own positions, so if we wrote the minByte sub-node first,
// then its jump delta would be larger.
// Instead we write the minByte sub-node last, for a shorter delta.
int32_t jumpTargets[ByteTrie::kMaxListBranchLength-1];
byteNumber-=2;
do {
if(!final[byteNumber]) {
makeNode(starts[byteNumber], starts[byteNumber+1], byteIndex+1);
jumpTargets[byteNumber]=bytesLength;
}
} while(--byteNumber>=0);
// The maxByte sub-node is written as the very last one because we do
// not jump for it at all.
byteNumber=length-1;
makeNode(start, limit, byteIndex+1);
write(elements[start].charAt(byteIndex, strings));
// Write the rest of this node's byte-value pairs.
while(--byteNumber>=0) {
start=starts[byteNumber];
int32_t value;
if(final[byteNumber]) {
// Write the final value for the one string ending with this byte.
value=elements[start].getValue();
} else {
// Write the delta to the start position of the sub-node.
value=bytesLength-jumpTargets[byteNumber];
}
writeCompactInt(value, final[byteNumber]);
write(elements[start].charAt(byteIndex, strings));
}
// Write the node lead byte.
write(ByteTrie::kMinListBranch+length-2);
}
// start<limit && all strings longer than byteIndex &&
// at least three different bytes at byteIndex
void
ByteTrieBuilder::makeThreeWayBranchNode(int32_t start, int32_t limit, int32_t byteIndex, int32_t length) {
// Three-way branch on the middle byte.
// Find the middle byte.
length/=2; // >=1
int32_t i=start;
do {
char byte=elements[i++].charAt(byteIndex, strings);
while(byte==elements[i].charAt(byteIndex, strings)) {
++i;
}
} while(--length>0);
// Encode the less-than branch first.
// Unlike in the list-branch node (see comments above) where
// all jumps are encoded in compact integers, in this node type the
// less-than jump is more efficient
// (because it is only ever a jump, with a known number of bytes)
// than the equals jump (where a jump needs to be distinguished from a final value).
makeNode(start, i, byteIndex);
int32_t leftNode=bytesLength;
// Find the elements range for the middle byte.
start=i;
char byte=elements[i++].charAt(byteIndex, strings);
while(byte==elements[i].charAt(byteIndex, strings)) {
++i;
}
// Encode the equals branch.
int32_t value;
UBool final;
if(start==i-1 && byteIndex+1==elements[start].getStringLength(strings)) {
// Store the final value for the one string ending with this byte.
value=elements[start].getValue();
final=TRUE;
} else {
// Store the start position of the sub-node.
makeNode(start, i, byteIndex+1);
value=bytesLength;
final=FALSE;
}
// Encode the greater-than branch last because we do not jump for it at all.
makeNode(i, limit, byteIndex);
// Write this node.
if(!final) {
value=bytesLength-value;
}
writeCompactInt(value, final); // equals
int32_t bytesForJump=writeFixedInt(bytesLength-leftNode); // less-than
write(byte);
write(bytesForJump-1);
}
UBool
ByteTrieBuilder::ensureCapacity(int32_t length) {
if(bytes==NULL) {
return FALSE; // previous memory allocation had failed
}
if(length>bytesCapacity) {
int32_t newCapacity=bytesCapacity;
do {
newCapacity*=2;
} while(newCapacity<=length);
char *newBytes=reinterpret_cast<char *>(uprv_malloc(newCapacity));
if(newBytes==NULL) {
// unable to allocate memory
uprv_free(bytes);
bytes=NULL;
return FALSE;
}
uprv_memcpy(newBytes+(newCapacity-bytesLength),
bytes+(bytesCapacity-bytesLength), bytesLength);
uprv_free(bytes);
bytes=newBytes;
bytesCapacity=newCapacity;
}
return TRUE;
}
void
ByteTrieBuilder::write(int32_t byte) {
int32_t newLength=bytesLength+1;
if(ensureCapacity(newLength)) {
bytesLength=newLength;
bytes[bytesCapacity-bytesLength]=(char)byte;
}
}
void
ByteTrieBuilder::write(const char *b, int32_t length) {
int32_t newLength=bytesLength+length;
if(ensureCapacity(newLength)) {
bytesLength=newLength;
uprv_memcpy(bytes+(bytesCapacity-bytesLength), b, length);
}
}
void
ByteTrieBuilder::writeCompactInt(int32_t i, UBool final) {
char intBytes[5];
int32_t length=1;
if(i<0 || i>0xffffff) {
intBytes[0]=(char)(ByteTrie::kFiveByteLead);
intBytes[1]=(char)(i>>24);
intBytes[2]=(char)(i>>16);
intBytes[3]=(char)(i>>8);
intBytes[4]=(char)(i);
length=5;
} else if(i<=ByteTrie::kMaxOneByteValue) {
intBytes[0]=(char)(ByteTrie::kMinOneByteLead+i);
} else {
if(i<=ByteTrie::kMaxTwoByteValue) {
intBytes[0]=(char)(ByteTrie::kMinTwoByteLead+(i>>8));
} else {
if(i<=ByteTrie::kMaxThreeByteValue) {
intBytes[0]=(char)(ByteTrie::kMinThreeByteLead+(i>>16));
} else {
intBytes[0]=(char)(ByteTrie::kFourByteLead);
intBytes[1]=(char)(i>>16);
length=2;
}
intBytes[length++]=(char)(i>>8);
}
intBytes[length++]=(char)(i);
}
intBytes[0]=(char)((intBytes[0]<<1)|final);
write(intBytes, length);
}
int32_t
ByteTrieBuilder::writeFixedInt(int32_t i) {
char intBytes[4];
int32_t length;
if(i<0 || i>0xffffff) {
intBytes[0]=(char)(i>>24);
intBytes[1]=(char)(i>>16);
intBytes[2]=(char)(i>>8);
length=3; // last byte below
} else {
if(i<=0xffff) {
length=0;
} else {
intBytes[0]=(char)(i>>16);
length=1;
}
if(i>0xff) {
intBytes[length++]=(char)(i>>8);
}
}
intBytes[length++]=(char)(i);
write(intBytes, length);
return length;
}
U_NAMESPACE_END
#endif // __BYTETRIEBUILDER_H__

View file

@ -0,0 +1,137 @@
// © 2016 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
/*
*******************************************************************************
* Copyright (C) 2010, International Business Machines
* Corporation and others. All Rights Reserved.
*******************************************************************************
* file name: bytetriedemo.cpp
* encoding: US-ASCII
* tab size: 8 (not used)
* indentation:4
*
* created on: 2010nov05
* created by: Markus W. Scherer
*/
#include <stdio.h>
#include "unicode/utypes.h"
#include "unicode/stringpiece.h"
#include "bytetrie.h"
#include "bytetriebuilder.h"
#include "bytetrieiterator.h"
#include "denseranges.h"
#include "toolutil.h"
#define LENGTHOF(array) (int32_t)(sizeof(array)/sizeof((array)[0]))
static void
printBytes(const char *name, const StringPiece &bytes) {
printf("%18s [%3d]", name, (int)bytes.length());
for(int32_t i=0; i<bytes.length(); ++i) {
printf(" %02x", bytes.data()[i]&0xff); // TODO: Add StringPiece::operator[] const
}
puts("");
}
static void
printTrie(const StringPiece &bytes) {
IcuToolErrorCode errorCode("printTrie");
ByteTrieIterator iter(bytes.data(), errorCode);
while(iter.next(errorCode)) {
printf(" '%s': %d\n", iter.getString().data(), (int)iter.getValue());
}
}
static void printRanges(const int32_t ranges[][2], int32_t length) {
printf("ranges[%d]", (int)length);
for(int32_t i=0; i<length; ++i) {
printf(" [%ld..%ld]", (long)ranges[i][0], (long)ranges[i][1]);
}
puts("");
}
extern int main(int argc, char* argv[]) {
IcuToolErrorCode errorCode("bytetriedemo");
ByteTrieBuilder builder;
StringPiece sp=builder.add("", 0, errorCode).build(errorCode);
printBytes("empty string", sp);
ByteTrie empty(sp.data());
UBool contains=empty.contains();
printf("empty.next() %d %d\n", contains, (int)empty.getValue());
printTrie(sp);
sp=builder.clear().add("a", 1, errorCode).build(errorCode);
printBytes("a", sp);
ByteTrie a(sp.data());
contains=a.next('a') && a.contains();
printf("a.next(a) %d %d\n", contains, (int)a.getValue());
printTrie(sp);
sp=builder.clear().add("ab", -1, errorCode).build(errorCode);
printBytes("ab", sp);
ByteTrie ab(sp.data());
contains=ab.next('a') && ab.next('b') && ab.contains();
printf("ab.next(ab) %d %d\n", contains, (int)ab.getValue());
printTrie(sp);
sp=builder.clear().add("a", 1, errorCode).add("ab", 100, errorCode).build(errorCode);
printBytes("a+ab", sp);
ByteTrie a_ab(sp.data());
contains=a_ab.next('a') && a_ab.contains();
printf("a_ab.next(a) %d %d\n", contains, (int)a_ab.getValue());
contains=a_ab.next('b') && a_ab.contains();
printf("a_ab.next(b) %d %d\n", contains, (int)a_ab.getValue());
contains=a_ab.contains();
printf("a_ab.next() %d %d\n", contains, (int)a_ab.getValue());
printTrie(sp);
sp=builder.clear().add("a", 1, errorCode).add("b", 2, errorCode).add("c", 3, errorCode).build(errorCode);
printBytes("a+b+c", sp);
ByteTrie a_b_c(sp.data());
contains=a_b_c.next('a') && a_b_c.contains();
printf("a_b_c.next(a) %d %d\n", contains, (int)a_b_c.getValue());
contains=a_b_c.next('b') && a_b_c.contains();
printf("a_b_c.next(b) %d %d\n", contains, (int)a_b_c.getValue());
contains=a_b_c.reset().next('b') && a_b_c.contains();
printf("a_b_c.r.next(b) %d %d\n", contains, (int)a_b_c.getValue());
contains=a_b_c.reset().next('c') && a_b_c.contains();
printf("a_b_c.r.next(c) %d %d\n", contains, (int)a_b_c.getValue());
contains=a_b_c.reset().next('d') && a_b_c.contains();
printf("a_b_c.r.next(d) %d %d\n", contains, (int)a_b_c.getValue());
printTrie(sp);
builder.clear().add("a", 1, errorCode).add("b", 2, errorCode).add("c", 3, errorCode);
builder.add("d", 10, errorCode).add("e", 20, errorCode).add("f", 30, errorCode);
builder.add("g", 100, errorCode).add("h", 200, errorCode).add("i", 300, errorCode);
builder.add("j", 1000, errorCode).add("k", 2000, errorCode).add("l", 3000, errorCode);
sp=builder.build(errorCode);
printBytes("a-l", sp);
ByteTrie a_l(sp.data());
for(char c='`'; c<='m'; ++c) {
contains=a_l.reset().next(c) && a_l.contains();
printf("a_l.r.next(%c) %d %d\n", c, contains, (int)a_l.getValue());
}
printTrie(sp);
static const int32_t values[]={
-1, 0, 1, 2,
4, 5, 6, 7,
12, 13, 14,
24, 25, 26
};
int32_t ranges[3][2];
int32_t length;
length=uprv_makeDenseRanges(values, LENGTHOF(values), 1, ranges, LENGTHOF(ranges));
printRanges(ranges, length);
length=uprv_makeDenseRanges(values, LENGTHOF(values), 0xc0, ranges, LENGTHOF(ranges));
printRanges(ranges, length);
length=uprv_makeDenseRanges(values, LENGTHOF(values), 0xf0, ranges, LENGTHOF(ranges));
printRanges(ranges, length);
length=uprv_makeDenseRanges(values, LENGTHOF(values), 0x100, ranges, LENGTHOF(ranges));
printRanges(ranges, length);
return 0;
}

View file

@ -0,0 +1,199 @@
// © 2016 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
/*
*******************************************************************************
* Copyright (C) 2010, International Business Machines
* Corporation and others. All Rights Reserved.
*******************************************************************************
* file name: bytetrieiterator.h
* encoding: US-ASCII
* tab size: 8 (not used)
* indentation:4
*
* created on: 2010nov03
* created by: Markus W. Scherer
*/
#ifndef __BYTETRIEITERATOR_H__
#define __BYTETRIEITERATOR_H__
/**
* \file
* \brief C++ API: ByteTrie iterator for all of its (byte sequence, value) pairs.
*/
// Needed if and when we change the .dat package index to a ByteTrie,
// so that icupkg can work with an input package.
#include "unicode/utypes.h"
#include "unicode/stringpiece.h"
#include "bytetrie.h"
#include "charstr.h"
#include "uvectr32.h"
U_NAMESPACE_BEGIN
/**
* Iterator for all of the (byte sequence, value) pairs in a a ByteTrie.
*/
class /*U_TOOLUTIL_API*/ ByteTrieIterator : public UMemory {
public:
ByteTrieIterator(const void *trieBytes, UErrorCode &errorCode)
: trie(trieBytes), value(0), stack(errorCode) {}
/**
* Finds the next (byte sequence, value) pair if there is one.
* @return TRUE if there is another element.
*/
UBool next(UErrorCode &errorCode);
/**
* @return TRUE if there are more elements.
*/
UBool hasNext() const { return trie.pos!=NULL || !stack.isEmpty(); }
/**
* @return the NUL-terminated byte sequence for the last successful next()
*/
const StringPiece &getString() const { return sp; }
/**
* @return the value for the last successful next()
*/
const int32_t getValue() const { return value; }
private:
// The stack stores pairs of integers for backtracking to another
// outbound edge of a branch node.
// The first integer is an offset from ByteTrie.bytes.
// The second integer has the str.length() from before the node in bits 27..0,
// and the state in bits 31..28.
// Except for the following values for a three-way-branch node,
// the lower values indicate how many branches of a list-branch node
// are left to be visited.
static const int32_t kThreeWayBranchEquals=0xe;
static const int32_t kThreeWayBranchGreaterThan=0xf;
ByteTrie trie;
CharString str;
StringPiece sp;
int32_t value;
UVector32 stack;
};
UBool
ByteTrieIterator::next(UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) {
return FALSE;
}
if(trie.pos==NULL) {
if(stack.isEmpty()) {
return FALSE;
}
// Read the top of the stack and continue with the next outbound edge of
// the branch node.
// The last outbound edge causes the branch node to be popped off the stack
// and the iteration to continue from the trie.pos there.
int32_t stackSize=stack.size();
int32_t state=stack.elementAti(stackSize-1);
trie.pos=trie.bytes+stack.elementAti(stackSize-2);
str.truncate(state&0xfffffff);
state=(state>>28)&0xf;
if(state==kThreeWayBranchEquals) {
int32_t node=*trie.pos; // Known to be a three-way-branch node.
uint8_t trieByte=trie.pos[1];
trie.pos+=node+3; // Skip node, trie byte and fixed-width integer.
UBool isFinal=trie.readCompactInt();
// Rewrite the top of the stack for the greater-than branch.
stack.setElementAt((int32_t)(trie.pos-trie.bytes), stackSize-2);
stack.setElementAt((kThreeWayBranchGreaterThan<<28)|str.length(), stackSize-1);
str.append((char)trieByte, errorCode);
if(isFinal) {
value=trie.value;
trie.stop();
sp.set(str.data(), str.length());
return TRUE;
} else {
trie.pos+=trie.value;
}
} else if(state==kThreeWayBranchGreaterThan) {
// Pop the state.
stack.setSize(stackSize-2);
} else {
// Remainder of a list-branch node.
// Read the next key byte.
str.append((char)*trie.pos++, errorCode);
if(state>0) {
UBool isFinal=trie.readCompactInt();
// Rewrite the top of the stack for the next branch.
stack.setElementAt((int32_t)(trie.pos-trie.bytes), stackSize-2);
stack.setElementAt(((state-1)<<28)|(str.length()-1), stackSize-1);
if(isFinal) {
value=trie.value;
trie.stop();
sp.set(str.data(), str.length());
return TRUE;
} else {
trie.pos+=trie.value;
}
} else {
// Pop the state.
stack.setSize(stackSize-2);
}
}
}
for(;;) {
int32_t node=*trie.pos++;
if(node>=ByteTrie::kMinValueLead) {
// Deliver value for the byte sequence so far.
if(trie.readCompactInt(node)) {
value=trie.value;
trie.stop();
}
sp.set(str.data(), str.length());
return TRUE;
} else if(node<ByteTrie::kMinLinearMatch) {
// Branch node, needs to take the first outbound edge and push state for the rest.
if(node<ByteTrie::kMinListBranch) {
// Branching on a byte value,
// with a jump delta for less-than, a compact int for equals,
// and continuing for greater-than.
stack.addElement((int32_t)(trie.pos-1-trie.bytes), errorCode);
stack.addElement((kThreeWayBranchEquals<<28)|str.length(), errorCode);
// For the less-than branch, ignore the trie byte.
++trie.pos;
// Jump.
int32_t delta=trie.readFixedInt(node);
trie.pos+=delta;
} else {
// Branch node with a list of key-value pairs where
// values are compact integers: either final values or jump deltas.
int32_t length=node-ByteTrie::kMinListBranch; // Actual list length minus 2.
// Read the first (key, value) pair.
uint8_t trieByte=*trie.pos++;
UBool isFinal=trie.readCompactInt();
stack.addElement((int32_t)(trie.pos-trie.bytes), errorCode);
stack.addElement((length<<28)|str.length(), errorCode);
str.append((char)trieByte, errorCode);
if(isFinal) {
value=trie.value;
trie.stop();
sp.set(str.data(), str.length());
return TRUE;
} else {
trie.pos+=trie.value;
}
}
} else {
// Linear-match node, append length bytes to str.
int32_t length=node-ByteTrie::kMinLinearMatch+1;
str.append(reinterpret_cast<const char *>(trie.pos), length, errorCode);
trie.pos+=length;
}
}
}
U_NAMESPACE_END
#endif // __BYTETRIEITERATOR_H__

View file

@ -0,0 +1,164 @@
// © 2016 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
/*
*******************************************************************************
* Copyright (C) 2010, International Business Machines
* Corporation and others. All Rights Reserved.
*******************************************************************************
* file name: denseranges.h
* encoding: US-ASCII
* tab size: 8 (not used)
* indentation:4
*
* created on: 2010sep25
* created by: Markus W. Scherer
*
* Helper code for finding a small number of dense ranges.
*/
#ifndef __DENSERANGES_H__
#define __DENSERANGES_H__
#include "unicode/utypes.h"
// Definitions in the anonymous namespace are invisible outside this file.
namespace {
/**
* Collect up to 15 range gaps and sort them by ascending gap size.
*/
class LargestGaps {
public:
LargestGaps(int32_t max) : maxLength(max<=kCapacity ? max : kCapacity), length(0) {}
void add(int32_t gapStart, int64_t gapLength) {
int32_t i=length;
while(i>0 && gapLength>gapLengths[i-1]) {
--i;
}
if(i<maxLength) {
// The new gap is now one of the maxLength largest.
// Insert the new gap, moving up smaller ones of the previous
// length largest.
int32_t j= length<maxLength ? j=length++ : j=(maxLength-1);
while(j>i) {
gapStarts[j]=gapStarts[j-1];
gapLengths[j]=gapLengths[j-1];
--j;
}
gapStarts[i]=gapStart;
gapLengths[i]=gapLength;
}
}
void truncate(int32_t newLength) {
if(newLength<length) {
length=newLength;
}
}
int32_t count() const { return length; }
int32_t gapStart(int32_t i) const { return gapStarts[i]; }
int64_t gapLength(int32_t i) const { return gapLengths[i]; }
int32_t firstAfter(int32_t value) const {
if(length==0) {
return -1;
}
int32_t minValue=0;
int32_t minIndex=-1;
for(int32_t i=0; i<length; ++i) {
if(value<gapStarts[i] && (minIndex<0 || gapStarts[i]<minValue)) {
minValue=gapStarts[i];
minIndex=i;
}
}
return minIndex;
}
private:
static const int32_t kCapacity=15;
int32_t maxLength;
int32_t length;
int32_t gapStarts[kCapacity];
int64_t gapLengths[kCapacity];
};
} // namespace
/**
* Does it make sense to write 1..capacity ranges?
* Returns 0 if not, otherwise the number of ranges.
* @param values Sorted array of signed-integer values.
* @param length Number of values.
* @param density Minimum average range density, in 256th. (0x100=100%=perfectly dense.)
* Should be 0x80..0x100, must be 1..0x100.
* @param ranges Output ranges array.
* @param capacity Maximum number of ranges.
* @return Minimum number of ranges (at most capacity) that have the desired density,
* or 0 if that density cannot be achieved.
*/
U_CAPI int32_t U_EXPORT2
uprv_makeDenseRanges(const int32_t values[], int32_t length,
int32_t density,
int32_t ranges[][2], int32_t capacity) {
if(length<=2) {
return 0;
}
int32_t minValue=values[0];
int32_t maxValue=values[length-1]; // Assume minValue<=maxValue.
// Use int64_t variables for intermediate-value precision and to avoid
// signed-int32_t overflow of maxValue-minValue.
int64_t maxLength=(int64_t)maxValue-(int64_t)minValue+1;
if(length>=(density*maxLength)/0x100) {
// Use one range.
ranges[0][0]=minValue;
ranges[0][1]=maxValue;
return 1;
}
if(length<=4) {
return 0;
}
// See if we can split [minValue, maxValue] into 2..capacity ranges,
// divided by the 1..(capacity-1) largest gaps.
LargestGaps gaps(capacity-1);
int32_t i;
int32_t expectedValue=minValue;
for(i=1; i<length; ++i) {
++expectedValue;
int32_t actualValue=values[i];
if(expectedValue!=actualValue) {
gaps.add(expectedValue, (int64_t)actualValue-(int64_t)expectedValue);
expectedValue=actualValue;
}
}
// We know gaps.count()>=1 because we have fewer values (length) than
// the length of the [minValue..maxValue] range (maxLength).
// (Otherwise we would have returned with the one range above.)
int32_t num;
for(i=0, num=2;; ++i, ++num) {
if(i>=gaps.count()) {
// The values are too sparse for capacity or fewer ranges
// of the requested density.
return 0;
}
maxLength-=gaps.gapLength(i);
if(length>num*2 && length>=(density*maxLength)/0x100) {
break;
}
}
// Use the num ranges with the num-1 largest gaps.
gaps.truncate(num-1);
ranges[0][0]=minValue;
for(i=0; i<=num-2; ++i) {
int32_t gapIndex=gaps.firstAfter(minValue);
int32_t gapStart=gaps.gapStart(gapIndex);
ranges[i][1]=gapStart-1;
ranges[i+1][0]=minValue=(int32_t)(gapStart+gaps.gapLength(gapIndex));
}
ranges[num-1][1]=maxValue;
return num;
}
#endif // __DENSERANGES_H__

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,135 @@
---
layout: default
title: BytesTrie
parent: Data Structures
grand_parent: Design Docs
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# BytesTrie
This is an idea for a trie that is intended to be fairly simple but also fairly
efficient and versatile. It maps from arbitrary byte sequences to 32-bit
integers. (Small non-negative integers are stored more efficiently. Negative
integers are the least efficient.)
Input strings would be mapped to byte sequences. Invariant-character strings
could be used directly, if the trie was built for the appropriate charset
family, or we could map EBCDIC input to ASCII (while lowercasing for
case-insensitive matching).
For Thai DBBI, each of U+0E00..U+0EFF could be mapped to its low byte.
For CJK DBBI, we could use UTF-16BE or a slight variant of it. For general
Unicode strings (e.g., time zone names), we could devise a simple encoding that
maps printable ASCII to single bytes, Unihan & Hangul and some other ranges to
two bytes per character, and the rest to three bytes per character. (We could
also use this for CJK DBBI, to reduce the number of such "converters".) Or, we
use a [UCharsTrie](../ucharstrie.md) for those.
Sample code is linked below.
See the [UCharsTrie](../ucharstrie.md) sibling page for some more details. The
BytesTrie and UCharsTrie structures are nearly the same, except that the
UCharsTrie uses fewer, larger units.
See also the [diagram of a BytesTrie for a sample set of string-to-value
mappings](https://docs.google.com/drawings/edit?id=1-doZNpcByYItcDAcvKmIpwJMWFgXpYCm43GnUrbat3g).
## Design points
* The BytesTrie and UCharsTrie are designed to be
byte-serialized/UChar-serialized, for trivial platform swapping.
* Compact: Small values and jump deltas should be encoded in few bytes. This
requires variable-length encodings.
* The length of each value/delta is encoded either in a preceding node or in
its own lead unit. This makes skipping values efficient, and fewer units
need to be range-checked while reading variable-length values.
* Nodes with small values are encoded in single units.
* Linear-match nodes match a sequence of units without choice/selection.
* Branches
* Branches store relative deltas to "jump" to following nodes. Small
deltas are encoded in single units; encoding deltas is much more
efficient than encoding absolute offsets.
* Variable-width values make binary search on branch nodes infeasible.
Therefore, branches with lists of (key, value) pairs are limited to
short list lengths for linear search.
* For large branches, branch nodes contain one unit, for branching to the
left (less-than) or to the right (greater-or-equal). This encodes a
binary search into the data structure.
* Initially, I had an equals edge in split-branch sub-nodes as well,
but that slowed down matching significantly (9% in one case) without
noticeably helping with the serialized size (0.2% in that case).
* At the end of each node (except for a final-value node), matching
continues with the next node, rather than using another jump to a
different location.
* Each branch head node encodes the length of the branch (the number of
units to select from). The split-branch and list-branch sub-nodes do not
have node heads. Instead, the code tracks the remaining length of the
branch, halving it for each split-branch edge and counting down in a
list-branch sub-node.
* The maximum length of a list-branch sub-node is fixed, that is, part of
the serialized data format and cannot be changed compatibly. This
constant is used in the branching code to decide whether to split
less-than/greater-or-equal vs. walk a list of key-value pairs.
* This constant must be at least 3 so that split-branch sub-nodes have a
length of at least 4 so that the following list-branch nodes have a
length of at least 2 and can use a do-while loop rather than a while
loop. (Saving one length check.)
* I explored an alternative, with only split-branch nodes down to length 1
and then a final match unit with continuing matching after that. It was
fast but also significantly larger. A branch like this is about twice
the size of a key-value pair list. If the average list-branch length is
n, a branch has (length/n)-1 split-branch sub-nodes. This experiment
corresponds to n=1.
* API
* The API is simple and low-level. At the core, next(unit) "turns the
crank" and returns basically a 2-bit result that encodes matches() (this
unit continues a matching sequence), hasNext() (another unit can
continue a matching sequence) and hasValue() (the units so far are a
matching string).
* Higher-level functions that handle different input (e.g., normalize
units on the fly) and provide variations of functionality (e.g., longest
match, startsWith, find all matches from some point in text, ...) can be
built on top of the low-level functions without cluttering the API or
pulling in further dependencies.
* The next(unit) function stops on a value node rather than decoding the
value, saving time until the value is requested (via getValue()). The
following next(unit2) call will then skip over the value node.
* There is enough API to serve a variety of uses, including
matching/mapping whole strings, finding out if a prefix belongs only to
strings with the same value, getting all units that can continue from
some point, and getting all (string, value) pairs. This should be able
to support lookups, parsing with abbreviations, word segmentation, etc.
* The "fast" builder code is simple. The builder builds, it need not use a
trie structure until writing the serialized form, and it need not provide
any of the trie runtime API.
* There is builder code that makes a "small" trie, attempting to avoid writing
duplicate nodes. This is possible when whole trees of nodes are the same and
at least one is reached via a "jump" delta which can "jump" to the
previously written serialization of such a tree.
## Sample Code
The following demo code was last updaed Nov. 2010:
* [`bytetrie.h`](./bytetrie.h)
* [`bytetriebuilder.h`](./bytetriebuilder.h)
* [`bytetriedemo.cpp`](./bytetriedemo.cpp)
* [`bytetrieiterator.h`](./bytetrieiterator.h)
* [`denseranges.h`](./denseranges.h)
* [`genpname.cpp`](./genpname.cpp)
### Latest versions of source code
The latest versions of the above sample code (except for `bytetriedemo.cpp`) exist in the ICU repository, sometimes under slightly different names and reorganized:
* [icu4c/source/common/unicode/**bytestrie.h**](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/bytestrie.h)
* [icu4c/source/common/unicode/**bytestriebuilder.h**](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/bytestriebuilder.h)
* [icu4c/source/tools/toolutil/**denseranges.h**](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/toolutil/denseranges.h)
* [tools/unicode/c/genprops/**pnamesbuilder.cpp**](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops/pnamesbuilder.cpp)

View file

@ -0,0 +1,38 @@
---
layout: default
title: ICU String Tries
parent: Data Structures
grand_parent: Design Docs
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# ICU String Tries
We have several implementations of string tries, mapping strings to boolean or
integer values: Currently for time zone name parsing and DBBI. Other areas might
also benefit from tries: Property names, character names, UnicodeSetStringSpan,
.dat package file TOC.
We should have a small number of common map-from-string trie implementations;
fairly compact, fairly efficient, easily serializable, and well-tested.
See the subpages for ideas.
For a UnicodeSetStringSpan, we would want to find each next match starting from
some point in the text, rather than passing each unit of text and finding out if
the units so far match.
Note: In terms of whole-string-lookup performance, the fastest data structure is
a hash map. Where whole-string-lookup is the only relevant operation, we could
consider implementing an easily serialized hash map.
See also [ICU Code Point Tries](../utrie.md).
Implementations:
* [BytesTrie](./bytestrie/)
* [UCharsTrie](./ucharstrie)

View file

@ -0,0 +1,32 @@
---
layout: default
title: UCharsTrie
parent: Data Structures
grand_parent: Design Docs
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# UCharsTrie
Same design as a [BytesTrie](bytestrie/index.md), but mapping any UnicodeString
(any sequence of 16-bit units) to 32-bit integer values. This can use somewhat
simpler code because there are more bits to work with in each unit, and it is
probably more appropriate and faster than a BytesTrie for collation
contractions/prefixes, CJK dictionaries, and maybe for use with Unicode strings
in general when it is not known that we work with a small script or mostly with
ASCII.
The code and data structure are quite similar to the BytesTrie. In general,
larger units are used to store larger values and deltas in single units than
possible in a BytesTrie, and fewer variable-length units are needed in all
cases.
In addition, some of the bits of match-nodes (linear-match and branch nodes) are
used for intermediate values (small values or most significant bits), rather
than separate intermediate-value nodes in a BytesTrie. Larger intermediate
values have one or two units following the match node head, then followed by the
match node's contents.

312
docs/design/struct/utrie.md Normal file
View file

@ -0,0 +1,312 @@
---
layout: default
title: ICU Code Point Tries
parent: Data Structures
grand_parent: Design Docs
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# ICU Code Point Tries
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
## Fast lookup in arrays
For fast lookup by code point, we store data in arrays. It costs too much space
to use a single array indexed directly by code points: There are about 1.1M of
them (max 0x10ffff, about 20.1 bits), and about 90% are unassigned or private
use code points. For some uses, there are non-default values only for a few
hundred characters.
We use a form of "trie" adapted to single code points. The bits in the code
point integer are divided into two or more parts. The first part is used as an
array offset, the value there is used as a start offset into another array. The
next code point bit field is used as an additional offset into that array, to
fetch another value. The final part yields the data for the code point.
Non-final arrays are called index arrays or tables.
> See also [ICU String Tries](tries/index.md).
For lookup of arbitrary code points, we need at least three successive arrays,
so that the first index table is not too large.
For all but the first index table, different blocks of code points with the same
values can overlap. A special block contains only default values and is shared
among all blocks of code points that map there.
Block sharing works better, and thus leads to smaller data structures, the
smaller the blocks are, that is, the fewer bits in the code point bit fields
used as intra-block offsets.
On the other hand, shorter bit fields require more bit fields and more
successive arrays and lookups, which adds code size and makes lookups slower.
(Until about 2001, all ICU data structures only handled BMP code points.
"Compact arrays" split 16-bit code points into fields of 9 and 7 bits.)
We tend to make compromises including additional index tables for smaller parts
of the Unicode code space, for simpler, faster lookup there.
For a general-purpose structure, we want to be able to be able to store a unique
value for every character. This determines the number of bits needed in the last
index table. With 136,690 characters assigned in Unicode 10, we need at least 18
bits. We allocate data values in blocks aligned at multiples of 4, and we use
16-bit index words shifted left by 2 bits. This leads to a small loss in how
densely the data table can be used, and how well it can be compacted, but not
nearly as much as if we were using 32-bit index words.
## Character conversion
The ICU conversion code uses several variants of code point tries with data
values of 1, 2, 3, or 4 bytes corresponding to the number of bytes in the output
encoding.
## UTrie
The original "UTrie" structure was developed for Unicode Normalization for all
of Unicode. It was then generalized for collation, character properties, and
eventually almost every Unicode data lookup. Values are 16 or 32 bits wide.
It was designed for fast UTF-16 lookup with a special, complicated structure for
supplementary code points using custom values for lead surrogate units. This
custom data and code made this structure relatively hard to use.
11:5 bits for the BMP and effectively 5:5:5:5 bits for supplementary code points
provide for good compaction. The BMP index table is always 2<sup>11</sup> uint16_t = 4kB.
Small index blocks for the supplementary range are added as needed.
The structure stores different values for lead surrogate code *units* (for fast
moving through UTF-16_ vs. code *points* (for lookup by code point).
The first 256 data table entries are a fixed-size, linear table for Latin-1 (up
to U+00FF).
## UTrie2
The "UTrie2" structure, developed in 2008, was designed to enable fast lookup
from UTF-8 without always having to assemble whole code points and to split them
again into the trie bit fields.
It retains separate lookups for lead surrogate code units vs. code points.
It retains the same 11:5 lookup for BMP code points, for good compaction and
good performance.
There is a special small index for lead bytes of two-byte UTF-8 sequences (up to
U+07FF), for 5:6 lookup. These index values are not shifted left by 2.
Lookup for three-byte UTF-8 uses the BMP index, which is clumsy.
Lookup for supplementary code points is much simpler than with UTrie, without
custom data values or code. Two index tables are used for 9:6:5 code point bits.
The first index table omits the BMP part. The structure stores the a code point
after which every one maps to the default value, and the first index is
truncated to below that.
With the fixed BMP index table and other required structures, an empty UTrie2 is
about 5kB large.
The UTF-8 lookup was also designed for the original handling of ill-formed
UTF-8: The first 192 data table entries are a linear table for ASCII plus the 64
trail bytes, to look up "single" bytes 0..BF without further checking, with
error values for the trail bytes. Lookup of two-byte non-shortest forms (C0
80..C1 BF) also yields error values. These error values became unused in 2017
when ICU 60 changed to handling ill-formed UTF-8 compatible with the W3C
Encoding standard (substituting maximal subparts of valid sequences). C0 and C1
are no longer recognized as lead bytes, requiring full byte sequence validation
separate from the data lookup.
## Ideas
Possible goals: Simpler code, smaller data especially for sparse tries, maybe
faster UTF-8, not much slower UTF-16.
We should try to store only one set of values for surrogates. Unicode property
APIs use only by-code point lookup without special lead surrogate values.
Collation uses special lead surrogate data but does not use code point lookup.
Normalization does both, but the per-code point lookup could test for surrogate
code points first and return trivial values for all of them. UTF-16 string
lookup should map unpaired surrogates to the error value.
We should remove the special data for old handling of ill-formed UTF-8, the
error values for trail bytes and two-byte non-shortest forms.
If we use 6 bits for the last code point bit field, then we can use the same
index table for code point/UTF-16 lookup as well as UTF-8 lookup. Compaction
will be less effective, so data will grow some. This would be somewhat
compensated by the smaller BMP index table.
If we also continue to use 6 bits for the second-to-last table, that is, 8:6:6
bits, then we can simplify the code for three- and four-byte UTF-8.
If we always include the BMP in the first index table, then we can also simplify
enumeration code a bit, and use smaller code for code point lookups where code
size is more important than maximum speed.
Alternatively, we could improve compaction and speed for the BMP by using no
index shift-left for BMP indexes (and keep omitting the BMP part of the first
index table). In order to ensure that BMP data can be indexed directly with
16-bit index values, the builder would probably have to copy at least the BMP
data into a new array for compaction, before adding data for supplementary code
points. When some of the indexes are not shifted, and their data is compacted to
arbitrary offsets, then that data cannot also be addressed with uniform
double-index lookup. We may or may not store unused first-index entries. If not
the whole BMP is indexed differently, then UTF-16 and three-byte UTF-8 lookups
need another code branch. (Size vs. simplicity & speed.)
The more tries we use, the higher the total cost of the size overhead. (For
example, many of the 100 or so collation tailorings carry a UTrie2.) The less
overhead, the more we could use separate tries where we currently combine them
or avoid them. Smaller overhead would make it more attractive to offer a public
code point map structure.
Going to 10:6 bits for the BMP cuts the fixed-size index in half, to 2kB.
We could reduce the fixed-size index table much further by using two-index
lookup for some or most of the BMP, trading off data size for speed and
simplicity. The index must be at least 32 uint16_t's for two-byte UTF-8, for up
to U+07FF including Cyrillic and Arabic. We could experiment with length 64 for
U+0FFF including Indic scripts and Thai, 208 entries for U+33FF (most small
scripts and Kana), or back to 1024 entries for the whole BMP. We could configure
a different value at build time for different services (optimizing for speed vs.
size). If we use the faster lookup for three-byte UTF-8, then the boundaries
should be multiples of 0x1000 (up to U+3FFF instead of U+33FF).
## UCPTrie / CodePointTrie
Added as public API in ICU 63. Developed between the very end of 2017 and
mid-2018.
Based on many of the ideas above and experimentation.
Continued linear data array lookup for ASCII.
No more separate values for lead surrogate code points vs. code units.
* Normalization switched to UCPTrie, working around this: Storing special lead
surrogate values for UTF-16 forward iteration; for code point lookup, the
normalization code checks for lead surrogates and returns an "inert" value
for them; for code point range iteration uses special API that treats lead
surrogates as "inert" as well.
* Otherwise simpler API, easier to explain.
* UTF-16 string lookup maps unpaired surrogates to the error value.
For low-code point lookup, uses 6 bits for the last code point field.
* No more need for special UTF-8 2/3-byte lookup structures.
* Smaller BMP index reduces size overhead.
No more data structures for non-shortest UTF-8 sequences.
"Fast" type uses two-stage lookup for all of the BMP (10:6 bits). "Small" type
uses two-stage lookup only up to U+0FFF to trade off size vs. speed. (fastLimit
U+10000 vs. U+1000)
Continued use of highStart for the start of the last range (ending at U+10FFFF),
and highValue for the value of all of its code points.
For code points between fastLimit and highStart, a four-stage lookup is used
(compared with three stages in UTrie2), with small bit fields (6:5:5:4 bits).
"Fast" type: Only for supplementary code points below highStart, if any. "Small"
type: For all code points below highStart; this means that for U+0000..U+0FFF in
a "small" trie data can be accessed with either the two-stage or the four-stage
lookup (and for ASCII also with linear access).
Experimentation confirmed that larger bit fields, especially for the last one or
two stages, lead to poor compaction of sparse data. 6 bits for the data offset
work well for UTF-8 lookup and are a reasonable compromise for the BMP, but for
the large supplementary area which tends to have more sparse data, using a 4 bit
data offset was useful. The drawback is that then the index blocks get larger
and compact less well. Four-byte UTF-8 lookup (for supplementary
* Started with 8:6:6 bits, but some tries were 30% larger than with UTrie2.
* Went to 10:6:4 bits which saved 12% overall, with only one trie larger than
UTrie2 (by 8%).
* Experimented with a "gap", omitting parts of the index for another range
like highStart for a typically large range of code points with a single
common value. This helped
* Experimented with 10:6:4 vs. 11:5:4 vs. 9:6:5 vs. 10:5:5 bits plus the gap.
\*:4 were smaller than \*:5, but the bit distribution within the index
stages had little effect. 11:5:4 yielded the smallest data, indicating that
small bit fields are useful for index stages as well.
* Replaced the gap with splitting the first index bit field into two, for a
four-stage 6:5:5:4 lookup. Just slightly smaller data than 11:5:4+gap, but
less complicated than checking for the gap and working around it; replaces
gap start/limit reads and comparisons with unconditional index array
accesses. 14% smaller overall than UTrie2.
* Added the "small" type where the two-stage lookup only extends to U+0FFF
(6:6 bits) and the four-stage lookup covers all code points below highStart.
34% smaller overall than UTrie2.
The normalization code also lazy-builds a trie with CanonicalIterator data which
is very sparse even in the BMP. With a "fast" UCPTrie it is significantly larger
than with UTrie2, with a "small" UCPTrie it is significantly smaller. Switched
the code to use a "small" trie because it is less performance-sensitive than the
trie used for normalizing strings.
In order to cover up to 256k data values, UTrie2 always shifts 16-bit data block
start offsets left by 2. UCPTrie abandons this, which simplifies two-stage
lookups slightly and improves compaction (no more granularity of 4 for data
block alignment).
* For a "fast" trie to always reach all BMP data values with 16-bit index
entries, the data array is always accessed via a separate pointer, rather
than UTrie2's sharing of the index array with 16-bit data via offsetting by
the length of the index. This also simplifies code slightly and makes access
uniform for all data value widths.
* There are now at most 64k data values for BMP code points because there is
no separate data for lead surrogates any more. The builder code writes data
blocks in code point order to ensure that low code points have low data
block offsets.
* For supplementary code points, data block offsets may need 18 bits. This is
very unusual but possible. (It currently happens only in the collation root
data with Han radical-stroke order, and in a unit test.)
* UCPTrie uses the high bit of the index-2 entry to indicate that the index-3
block stores 18-bit data block offsets rather than 16-bit ones. (This limits
somewhat the length of the index.) In this case, groups of 8 index-3 entries
(= data block start offsets) share an additional entry that stores the two
high bits of each of the eight entries. More complicated lookup, but almost
never used, and keeps BMP lookups always simple.
* A possible alternative could have used a bit per entry, or per small group
of entries, to indicate that a common data value should be returned for
"unused" parts of a sparse data block. There could have been a common value
per index-3 block, per index-2 block, or for the whole trie, etc. Rejected
as much too complicated.
UTrie2 stores a whole block of 64 error values for UTF-8 non-shortest-form
lookup. UCPTrie does not have this block any more; it stores the error value at
the end of the data array, at dataLength-1.
UTrie2 stores the highValue at dataLength-4. UCPTrie stores it at dataLength-2.
Comparison: [UTrie2 vs.
UCPTrie/CodePointTrie](https://docs.google.com/document/d/e/2PACX-1vTbwdDe2tVJ6pACMpOq7uKW_FgvyyjvPVdgZYsIwSoFJj-27cXR20wAO9qHVoaKOIoo-d8iHnsFOCdc/pub)
Sizes for BreakIterator & Collator tries, UTrie2 vs. UTrie3 experiments: [In
this
spreadsheet](https://docs.google.com/spreadsheets/d/e/2PACX-1vTgL260NFgmbiUAtptKj4fNf9wNm-OJ6Q0TbWzFWvhV7wVZk2Qe-gk2pbJh0pHY9XVsObZ3YaoOnb3I/pubhtml)
see the "nocid" sheet (no CanonicalIterator data).
The last columns on the "nocid" sheet, highlighted in green and blue, correspond
to the final UCPTrie/CodePointTrie. For these tries, the "fast" type (green)
yields 14% smaller data than UTrie2; the "small" type (blue) yields 34% smaller
data.
The simplenormperf sheets show performance comparison data between UTrie2 and
"fast" UCPTrie. There should be little difference for BMP characters; the
numbers are too inconsistent to show a significant difference.
UCPTrie has an option of storing 8-bit values, in addition to 16-bit and 32-bit
values that UTrie2 supports. It would be possible to add 12-bit or 64-bit values
etc. later.

View file

@ -0,0 +1,96 @@
---
layout: default
title: Profiling ICU4C with callgrind
grand_parent: Setup for Contributors
parent: C++ Setup
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Profiling ICU4C with callgrind
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
## Prerequisites
Valgrind, callgrind and kcachegrind together proved performance profiling of C++
code, including annotated source code with time consumption at each line.
Prequisites:
* Linux with the clang compiler.
* Valgrind. If not already installed, from the command line,
* `sudo apt install valgrind`
* kcachegrind. To install:
* `sudo apt install kcachegrind`
Build ICU. An optimized build with debug symbols is generally best for
profiling:
```
cd icu4c/source
./runConfigureICU --enable-debug Linux
make -j6 check
```
## Run test code
Prepare the test code you wish to measure. Valgrind is very slow, so be wary of
long running tests. Because Valgrind tracks every last machine instruction (it's
not a sampling profiler), getting good results does not require a long run.
Run the test code under valgrind with callgrind. The example below runs a test
from intltest, but that is not a requirement; valgrind will profile any
executable. The differences from a normal (non-profile) invocation are
highlighted.
Without the `LD_BIND_NOW=y` the output is polluted by symbol lookups.
```
LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY_PATH
LD_BIND_NOW=y valgrind --tool=callgrind
--callgrind-out-file=callgrind.out ./intltest
translit/TransliteratorTest/TestAllCodepoints
```
The raw profiling data will be left in a callgrind.out file,
```
ls -l callgrind*
-rw------- 1 aheninger eng 325779 Oct 3 15:51 callgrind.out
```
## View in kcachegrind
Run kcachegrind to view the results.
```
kcachegrind callgrind.out
```
Explore. Lots of interesting data is available.
[kcachegrind docs](https://kcachegrind.github.io/html/Documentation.html)
For the above run, here are the top functions, ordered by cumulative time
(including calls out) spent in each.
![image](kcache-cumulative.png)
Time spent in each function, self time only. `UnicodeSet::add()` is hot.
![image](kcache-flat.png)
Annotated source for `UnicodeSet::add()`
![image](kcache-source.png)

Binary file not shown.

After

Width:  |  Height:  |  Size: 266 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 254 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 232 KiB

View file

@ -1,3 +1,5 @@
// © 2016 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
{
"configurations": [
{

125
docs/devsetup/cpp/index.md Normal file
View file

@ -0,0 +1,125 @@
---
layout: default
title: C++ Setup
parent: Setup for Contributors
has_children: true
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# C++ Setup
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
## C/C++ workspace structure
It is best to keep the source file tree and the build-output files separate
("out-of-source build"). It keeps your source tree clean, and you can build
multiple configurations from the same source tree (e.g., debug build, release
build, build with special flags such as no-using-namespace). You could keep the
source and build trees in parallel folders.
**Important:** If you use runConfigureICU together with CXXFLAGS or similar, the
*custom flags must be before the runConfigureICU invocation*. (So that they
are visible as environment variables in the runConfigureICU shell script, rather
than just options text.) See the sample runConfigureICU invocations below.
See the ICU4C readme's [Recommended Build
Options](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#RecBuild).
For example:
* `~/icu/mine/**src**`
* source tree including icu (ICU4C) & icu4j folders
* setup: mkdir + git clone your fork (see the [Linux Tips
subpage](linux.md)) + cd to here.
* Use `git checkout <branch>` to switch between branches.
* Use `git checkout -b <newbranchname>` to create a new branch and switch
to it.
* After switching branches, remember to update your IDE's view of the
source tree.
* For C++ code, you may want to `make clean` *before* switching to a
different branch.
* `~/icu/mine/icu4c/**bld**`
* release build output
* not-using-namespace is always recommended
* setup: mkdir+cd to here, then something like
`CXXFLAGS="-DU_USING_ICU_NAMESPACE=0"
CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1"
../../src/icu4c/source/**runConfigureICU** Linux
--prefix=/home/*your_user_name*/icu/mine/inst > config.out 2>&1`
* build: `make -j5 check > out.txt 2>&1`
* `~/icu/mine/icu4c/**dbg**`
* debug build output
* not-using-namespace is always recommended
* setup: mkdir+cd to here, then something like
`CXXFLAGS="-DU_USING_ICU_NAMESPACE=0"
CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1"
../../src/icu4c/source/**runConfigureICU** --enable-debug
--disable-release Linux --prefix=/home/*your_user_name*/icu/mine/inst >
config.out 2>&1`
* build: make -j5 check > out.txt 2>&1
* Be sure to test with gcc and g++ too! `CC=gcc CXX=g++
CXXFLAGS="-DU_USING_ICU_NAMESPACE=0"
CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1"
../../src/icu4c/source/runConfigureICU --enable-debug --disable-release
Linux`
* `~/icu/mine/icu4c/**nm_utf8**`
* not-using-namespace and default-hardcoded-UTF-8
* setup: mkdir+cd to here, then something like
`../../src/icu4c/source/**configure**
CXXFLAGS="-DU_USING_ICU_NAMESPACE=0" CPPFLAGS="-DU_CHARSET_IS_UTF8=1
-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1"
--prefix=/home/*your_user_name*/icu/mine/inst > config.out 2>&1`
* ~/icu/mine/icu4c/static
* gcc with static linking
* setup: mkdir+cd to here, then something like
`../../src/icu4c/source/**configure**
CXXFLAGS="-DU_USING_ICU_NAMESPACE=0"
CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -O2 -ffunction-sections
-fdata-sections" LDFLAGS="-Wl,--gc-sections" --enable-static
--disable-shared --prefix=/home/*your_user_name*/icu/mine/inst >
config.out 2>&1`
* `~/icu/mine/`**`inst`**
* “make install” destination (dont clobber your platform ICU during
development)
* `~/icu/**msg48**/src`
* Optional: You could have multiple parallel workspaces, each with their
own git clones, to reduce switching a single workspace (and the IDE
looking at it) from one branch to another.
### Run individual test suites
* `cd ~/icu/mine/icu4c/dbg/test/intltest`
* `export LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw`
* `make -j5 && ./intltest utility/ByteTrieTest utility/UCharTrieTest`
* `cd ~/icu/mine/icu4c/dbg/test/cintltst`
* same relative `LD_LIBRARY_PATH` as for intltest
* `make -j5 && ./cintltst`
## gdb pretty-printing
Shane wrote this gdb script in 2017: It pretty-prints UnicodeString in GDB.
Instead of seeing the raw internals of UnicodeString, you will see the length,
storage type, and content of the UnicodeString in your debugger. There are
installation instructions in the top comment on the file (it's a matter of
downloading the file and adding a line to `~/.gdbinit`).
<https://gist.github.com/sffc/7b3826fd67cb78057a9e66f2b350a647>
This also works in anything that wraps GDB, like CLion and Visual Studio Code.
## Linux Tips
For more Linux-specific tips see the [Linux Tips subpage](linux.md).

178
docs/devsetup/cpp/linux.md Normal file
View file

@ -0,0 +1,178 @@
---
layout: default
title: C++ Setup on Linux
grand_parent: Setup for Contributors
parent: C++ Setup
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# C++ Setup on Linux
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
## Compiler
For ICU4C 50 or newer the `configure` script picks `clang` if it is installed,
or else `gcc`. Clang produces superior error messages and warnings.
Most Linuxes should have clang available to install. On Ubuntu or other
Debian-based systems, install it with
```
sudo apt-get install clang
```
Debug builds must use compiler option `-g` and should not optimize (`-O0` is the
default). A future version of `gcc` might support `-Og` as the recommended
optimization level for debugging.
Release builds can use `-O3` for best performance. See
<http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html>
`clang` might even benefit from `-O4` where "whole program optimization is done
at link time". See
<http://developer.apple.com/library/mac/#documentation/Darwin/Reference/Manpages/man1/clang.1.html>
## Other build flags
On a modern Linux you can configure with `CPPFLAGS="-DU_CHARSET_IS_UTF8=1"`.
## Debugging
`gdb` should work with both out-of-source and in-source builds. If not,
double-check with "`make VERBOSE=1`" that both .c and .cpp files are compiled
with `-g` and either `-O0` or no `-O*anything*` at all.
`kdbg` is a reasonable GUI frontend for gdb. It keeps the source code in sync
and updates views of variables & memory etc.
* kdbg versions below 2.5.2 do not work with gdb 7.5; you get a message box
with "GDB: Reading symbols from..."
* As a workaround,
* Create a `~/.gdbinit` file with `set print symbol-loading off`
* Start kdbg, open `Settings/Global options` and remove the `--nx`
argument to gdb.
## Portability Testing
GitHub pull requests are automatically tested on Windows, Linux with both clang
& gcc, and Macintosh. The build results show up as check results on the status
page.
Build errors will block the pull request. It's also useful to check the build
logs for new warnings on platforms other than the one used for development.
## Clang sanitizers
Clang has built-in santizers to check for several classes of problems. Here are
the configure options for building ICU with the address checker:
```
CPPFLAGS=-fsanitize=address LDFLAGS=-fsanitize=address ./runConfigureICU
--enable-debug --disable-release Linux --disable-renaming
```
The other available sanitizers are `thread`, `memory` and `undefined` behavior.
At the time of this writing, thread and address run cleanly, the others show
warnings that have not yet been resolved.
## Heap Usage (ICU4C)
HeapTrack is a useful tool for analyzing heap usage of a test program, to check
the total heap activity of a particular function or object creation, for
example. It will show totals by line in the source, and can move up and down the
stack to see more detail.
<https://github.com/KDE/heaptrack>
To install on Linux,
```
sudo apt install heaptrack
sudo apt install heaptrack-gui
```
## Quick Scripts for small test programs
I use the following simple scripts to simplify building and debugging small
stand-alone programs against ICU, without needing to set up makefiles. They
assume a program with a single .cpp file with the same name as the directory in
which it resides.
```
b: build
r: run
d: debug
v: run under valgrind
```
You will probably need to modify them to reflect where you keep your most
commonly used ICU build, and whether you routinely use an out-of-source ICU
build.
```
$ cat \`which b\`
#! /bin/sh
if \[\[ -z "${ICU_HOME}" \]\] ; then
ICU_HOME=$HOME/icu/icu4c
fi
DIR=\`pwd\`
PROG=\`basename $DIR\`
clang++ -g -I $ICU_HOME/source/common -I $ICU_HOME/source/i18n -I
$ICU_HOME/source/io -L$ICU_HOME/source/lib -L$ICU_HOME/source/stubdata -licuuc
-licui18n -licudata -o $PROG $PROG.cpp
$ cat \`which r\`
#! /bin/sh
if \[\[ -z "${ICU_HOME}" \]\] ; then
ICU_HOME=$HOME/icu/icu/icu4c
fi
DIR=\`pwd\`
PROG=\`basename $DIR\`
LD_LIBRARY_PATH=$ICU_HOME/source/lib:$ICU_HOME/source/stubdata
ICU_DATA=$ICU_HOME/source/data/out ./$PROG
cat \`which d\`
#! /bin/sh
if \[\[ -z "${ICU_HOME}" \]\] ; then
ICU_HOME=$HOME/icu/icu/icu4c
fi
DIR=\`pwd\`
PROG=\`basename $DIR\`
LD_LIBRARY_PATH=$ICU_HOME/source/lib:$ICU_HOME/source/stubdata
ICU_DATA=$ICU_HOME/source/data/out gdb ./$PROG
$ cat \`which v\`
#! /bin/sh
if \[\[ -z "${ICU_HOME}" \]\] ; then
ICU_HOME=$HOME/icu/icu/icu4c
fi
DIR=\`pwd\`
PROG=\`basename $DIR\`
LD_LIBRARY_PATH=$ICU_HOME/source/lib:$ICU_HOME/source/stubdata
ICU_DATA=$ICU_HOME/source/data/out valgrind --leak-check=full ./$PROG
```

View file

@ -1,10 +1,28 @@
---
layout: default
title: Configuring VS Code for ICU4C
grand_parent: Setup for Contributors
parent: C++ Setup
---
<!--- © 2020 and later: Unicode, Inc. and others. --->
<!--- License & terms of use: http://www.unicode.org/copyright.html --->
# Configuring VS Code for ICU4C
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
- Create a `.vscode` folder in icu4c/source
- Copy the `tasks.json`, `launch.json` and `c_cpp_properties.json` files into
- Copy the [`tasks.json`](tasks.json), [`launch.json`](launch.json) and [`c_cpp_properties.json`](c_cpp_properties.json) files into
the `.vscode` folder.
- To test only specific test targets, specify them under `args` in
`launch.json`.

14
docs/devsetup/index.md Normal file
View file

@ -0,0 +1,14 @@
---
layout: default
title: Setup for Contributors
nav_order: 10000
has_children: true
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Setup for Contributors

Binary file not shown.

After

Width:  |  Height:  |  Size: 69 KiB

View file

@ -0,0 +1,246 @@
---
layout: default
title: Ant Setup for Java
grand_parent: Setup for Contributors
parent: Java Setup
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Ant Setup for Java
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
## Overview
ICU4J source layout was changed after 4.2. There are several ways to set up the
ICU4J development environment.
Get the source code by following the [Quick Start
instruction](http://site.icu-project.org/repository). Go into the icu4j/
directory to see the build.xml file. You can run targets displayed by `ant -p`.
Main targets:
* `all` Build all primary targets
* `apireport` Run API report generator tool
* `apireportOld` Run API report generator tool (Pre Java 5 Style)
* `build-tools` Build build-tool classes
* `charset` Build charset classes
* `charset-tests` Build charset tests
* `charsetCheck` Run only the charset tests
* `check` Run the standard ICU4J test suite
* `checkDeprecated` Check consistency between javadoc @deprecated and @Deprecated annotation
* `checkTest` Run only the specified tests of the specified test class or, if no arguments are given, the standard ICU4J test suite.
* `checktags` Check API tags before release
* `cldrUtil` Build Utilities for CLDR tooling
* `clean` Clean up build outputs
* `collate` Build collation classes
* `collate-tests` Build core tests
* `collateCheck` Run only the collation tests
* `core` Build core classes
* `core-tests` Build core tests
* `coreCheck` Run only the core tests
* `coverageJaCoCo` Run the ICU4J unit tests and generate code coverage report
* `currdata` Build currency data classes
* `demos` Build demo classes
* `docs` Build API documents
* `docsStrict` Build API documents with all doclint check enabled
* `draftAPIs` Run API collector tool and generate draft API report
* `exhaustiveCheck` Run the standard ICU4J test suite in exhaustive mode
* `findbugs` Run FindBugs on all library sub projects.
* `gatherapi` Run API database generator tool
* `gatherapiOld` Run API database generator tool (Pre Java 5 style)
* `icu4jJar` Build ICU4J all-in-one core jar
* `icu4jSrcJar` Build icu4j-src.jar
* `icu4jtestsJar` Build ICU4J all-in-one test jar
* `indicIMEJar` Build indic IME 'icuindicime.jar' jar file
* `info` Display the build environment information
* `init` Initialize the environment for build and test. May require internet access.
* `jar` Build ICU4J runtime library jar files
* `jarDemos` Build ICU4J demo jar file
* `jdktzCheck` Run the standard ICU4J test suite with JDK TimeZone
* `langdata` Build language data classes
* `localespi` Build Locale SPI classes
* `localespi-tests` Build Locale SPI tests
* `localespiCheck` Run the ICU4J Locale SPI test suite
* `main` Build ICU4J runtime library classes
* `packaging-tests` Build packaging tests
* `packagingCheck` Run packaging tests
* `perf-tests` Build performance test classes
* `regiondata` Build region data classes
* `release` Build all ICU4J release files for distribution
* `releaseBinaries` Build ICU4J binary files for distribution
* `releaseCLDR` Build release files for CLDR tooling
* `releaseDocs` Build ICU4J API reference doc jar file for distribution
* `releaseSourceArchiveTgz` Build ICU4J source release archive (.tgz)
* `releaseSourceArchiveZip` Build ICU4J source release archive (.zip)
* `releaseSrcJars` Build ICU4J src jar files for distribution
* `releaseVer` Build all ICU4J release files for distribution with versioned file names
* `runTest` Run the standard ICU4J test suite without calling any other build targets
* `samples` Build sample classes
* `secure` (Deprecated)Build ICU4J API and test classes for running the ICU4J test suite with Java security manager enabled
* `secureCheck` Run the secure (applet-like) ICU4J test suite
* `test-framework` Build test framework classes
* `tests` Build ICU4J test classes
* `timeZoneCheck` Run the complete test for TimeZoneRoundTripAll
* `tools` Build tool classes
* `translit` Build translit classes
* `translit-tests` Build translit tests
* `translitCheck` Run the ICU4J Translit test suite
* `translitIMEJar` Build transliterator IME 'icutransime.jar' jar file
* `xliff` Build xliff converter tool
Default target: main
The typical usage is `ant check`, which will build main ICU4J libraries and
run the standard unit test suite.
For running ant you may need to set up some environment variables first. For
example, on Windows:
```
set ANT_HOME=C:\\ant\\apache-ant-1.7.1
set JAVA_HOME=C:\\Program Files\\Java\\jdk1.5.0_07
set PATH=%JAVA_HOME%\\bin;%ANT_HOME%\\bin;%PATH%
```
## Test arguments and running just one test or the tests of just one test class
You can pass arguments to the test system by using the 'testclass' and
'testnames' variables and the 'checkTest' target. For example:
|Command Line|Meaning|
|------------|--------|
|`ant checkTest -Dtestclass='com.ibm.icu.dev.test.lang.TestUScript'` | Runs all the tests in test class 'TestUScript'.|
|`ant checkTest -Dtestclass='com.ibm.icu.dev.test.lang.TestUScript' -Dtestnames='TestNewCode,TestHasScript'` | Runs the tests `TestNewCode` and `TestHasScript` in test class `TestUScript`. |
|`ant checkTest -Dtestnames='TestNewCode,TestHasScript'` | Error: test class not specified.|
|`ant checkTest` | Runs the standard ICU4J test suite (same as 'ant check').|
The JUnit-generated test result reports are in out/junit-results/checkTest. Go
into the `html/` subdirectory and load `index.html` into a browser.
## Generating Test Code Coverage Report
[#10513](http://bugs.icu-project.org/trac/ticket/10513) added code coverage
target "coverageJaCoCo" in the ICU4J ant build.xml. To run the target:
1. Download JaCoCo library from [EclEmma
site](http://eclemma.org/jacoco/index.html).
2. Extract library files to your local system - e.g. `C:\jacoco-0.7.6`
3. Set environment variable JACOCO_DIR pointing to the directory where JaCoCo
files are extracted - e.g. `set JACOCO_DIR=C:\jacoco-0.7.6`
4. Set up ICU4J ant build environment.
5. Run the ant target "coverageJaCoCo" in the top-level ICU4J build.xml
Following output report files will be generated in /out/jacoco directory.
* report.csv
* report.xml
* report_html.zip
## Building ICU4J API Reference Document with JCite
Since ICU4J 49M2, JCite (Java Source Code Citation System) is integrated into
ICU4J documentation build. To build the API documentation for public release,
you must use JCite for embedding some coding examples in the API documentation.
To set up the environment:
1. <http://arrenbrecht.ch/jcite/>Download JCite binary (you need 1.13.0+ for JDK 7 support) from
* Note that JCite no longer is available for download from the official
web site, which links to Google Code, which was closed down in 2016.
* The Internet Archive has a copy of the last version of JCite found on
Google Code before it was closed down:
[jcite-1.13.0-bin.zip](https://web.archive.org/web/20160710183051/http://jcite.googlecode.com/files/jcite-1.13.0-bin.zip)
2. Extract JCite file to your local system - e.g. `C:\jcite-1.13.0`
3. Set environment variable `JCITE_DIR` pointing to the directory where JCite
files are extracted. - e.g. `set JCITE_DIR=C:\jcite-1.13.0`
4. Set up ICU4J ant build environment.
5. Run the ant target "docs" in the top-level ICU4J build.xml
6. If the build (on Linux) fails because package com.sun.javadoc is not found
then set the JAVA_HOME environment variable to point to `<path>/java/jdk`. The
Javadoc package is in `<path>/java/jdk/lib/tools.jar`.
*Note: The ant target "docs" checks if `JCITE_DIR` is defined or not. If not
defined, it will build ICU4J API docs without JCite. In this case, JCite taglet
"{@.jcite ....}" won't be resolved and the embedded tag is left unchanged in the
output files.*
## Build and test ICU4J Eclipse Plugin
Building Eclipse ICU4J plugin
1. Download and install the latest Eclipse release from
<http://www.eclipse.org/> (The latest stable milestone is desired, but the
latest official release should be OK).
2. cd to `<icu4j root>` directory, and make sure `$ ant releaseVer` runs clean.
3. cd to` <icu4j root>/eclipse-build` directory.
4. Copy `build-local.properties.template` to `build-local.properties`, edit the
properties files
* eclipse.home pointing to the directory where the latest Eclipse version
is installed (the directory contains configuration, dropins, features,
p2 and others)
* java.rt - see the explanation in the properties file
5. Run the default ant target - $ ant The output ICU4J plugin jar file is
included in `<icu4j
root>/eclipse-build/out/projects/ICU4J.com.ibm.icu/com.ibm.icu-com.ibm.icu.zip`
Plugin integration test
1. Backup Eclipse installation (if you want to keep it - just copy the entire
Eclipse installation folder)
2. Delete ICU4J plugin included in Eclipse installation -
`<eclipse>/plugins/com.ibm.icu_XX.Y.Z.vYYYYMMDD-HHMM.jar` XX.YY.Z is the ICU
version, and YYYYMMDD-HHMM is build date. For example,
com.ibm.icu_58.2.0.v20170418-1837.jar
3. Copy the new ICU4J plugin jar file built by previous steps (e.g.
com.ibm.icu_61.1.0.v20180502.jar) to the same folder.
4. Search a text "`com.ibm.icu"` in files under `<eclipse>/features`. The RCP
feature has a dependency on the ICU plugin and its `feature.xml` (e.g.
`<eclipse>/features/org.eclipse.e4.rcp_1.6.2.v20171129-0543/feature.xml`)
contains the dependent plugin information. Replace just version attribute to
match the version built by above steps. You can leave size attributes
unchanged. The current ICU build script does not append hour/minute in
plugin jar file, so the version format is XX.Y.Z.vYYYYMMDD.
` <plugin`
` id="com.ibm.icu"`
` download-size="11775"`
` install-size="26242"`
` version="58.2.0.v20170418-1837" -> "61.1.0.v20180502" `
` unpack="false"/>`
5. Open
`<eclipse>/configuration/org.eclipse.equinox.simpleconfigurator/bundles.info`
in a text editor, and update the line including com.ibm.icu plugin
information.
```
com.ibm.icu,58.2.0.v20170418-1837,plugins/com.ibm.icu_58.2.0.v20170418-1837.jar,4,false
```
then becomes ->
```
com.ibm.icu,**61.1.0.v20190502**,plugins/com.ibm.icu_**61.1.0.v20190502**.jar,4,false
```
6. Make sure Eclipse can successfully starts with no errors. If ICU4J plug-in
is not successfully loaded, Eclipse IDE won't start.
ICU4J plugin test - Note: This is currently broken
<http://bugs.icu-project.org/trac/ticket/13072>
1. Start the Eclipse (with new ICU4J plugin), and create a new workspace.
2. Import existing Eclipse project from `<icu4jroot>/eclipse-build/out/projects/com.ibm.icu.tests`
3. Run the project as JUnit Plug-in Test.
## Building ICU4J Release Files
See [Release Build](../../../processes/release/tasks/release-build.md)

Binary file not shown.

After

Width:  |  Height:  |  Size: 3 KiB

View file

@ -0,0 +1,290 @@
---
layout: default
title: Eclipse Setup for Java Developers
grand_parent: Setup for Contributors
parent: Java Setup
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Eclipse Setup for Java Developers
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
ICU4J source layout was changed after 4.2. There are several ways to set up the ICU4J development environment.
*If you want to use Eclipse, you should create a new clean workspace first.*
## Java Language Level
Eclipse typically requires a newer Java version than what we can depend on for
ICU4J. If you don't do the following, you run the risk of calling Java library
APIs that are newer than ICU4J's Java version, and you cause runtime exceptions
for people who use the older version.
Currently (as of 2018-aug / ICU 63), ICU4J is on Java 7 (and Eclipse 4.6
requires Java 8).
(Note: localespi/localespi-tests may use a different Java version from ICU4J
proper.)
1. Check if you already have an older JRE or a JDK for the minimum version
required for ICU4J.
* A JRE (runtime environment, no compiler) is sufficient.
* If you don't have one yet, then install one (OpenJDK or Oracle).
2. Select \[Window\] - \[Preferences\] *(On Mac, this is \[Eclipse -
Preferences\])*
3. Navigate the preferences tree to Java/Installed JREs/Execution Environments
4. On the left, Execution Enviornments: Select J2SE-1.7
5. On the right, Compatible JREs, if there is no old-version Java 7 JRE:
1. Go up one tree level to Java/Installed JREs.
2. Click "Add..." and select "Standard VM" as JRE type.
3. Click "Directory..." and find the location of your old-version JRE (or
JDK) on your system
* Linux tip: When you install an OpenJDK, look for it in /usr/lib/jvm/
4. You can leave the detected settings as is - Click "Finish", then Click
"OK" in Installed JREs (or "Apply" the modified settings as you navigate
away from here).
5. Go back down in the tree to Java/Installed JREs/Execution Environments.
6. On the right, Compatible JREs, you should now see your old-version JRE
6. The matching-old-version JRE should have a "\[perfect match\]" suffix.
Select it for "JavaSE-1.7" on the left.
## Other Settings
1. ~~Turn on warnings about resource leaks. Preferences>Java>Compiler>Errors/Warnings>\[filter on leak\], set both "Resource leak" and "Potential resource leak" to "Warning".~~
(ICU project files were updated to include these settings, so this has no
effects 2015-03-11)
## Import ICU4J from the file system
(Recommended)
In <icu workspace root>/icu4j remember to run "ant init" first. You might run
"ant check" as well for good measure.
If you check out ICU4J source from the repository using an external client
(usually command-line git clone), the new instruction is not much different. You
just follow the steps below -
1. File/Import...
2. Select General/Existing Projects into Workspace
3. Select root directory: Browse to <icu svn workspace root>/icu4j, which will
show a number of projects.
4. Deselect the following projects (i.e., do not import them). These are not
needed for normal ICU development (and would require installing further
prerequisite libraries to get them to build).
* com.ibm.\* (Eclipse plug-in)
* icu4j-localespi\* (more plug-in)
* icu4j-build-tools
* icu4j-packaging-tests
5. Click Finish.
6. Wait for Eclipse to build the projects.
## Obsolete: Import ICU4J using the [Subversive](http://www.eclipse.org/subversive/) SVN plugin
Subversive is the standard SVN plugin for Eclipse 3.4+. If you have
[subversive](http://www.eclipse.org/subversive/) installed/configured on your
Eclipse environment, you can directly check out these 8 projects from the SVN
repository directory. (It looks this does not work well with
"[subclipse](http://www.eclipse.org/subversive/)")
#### Installing Subversive (Eclipse 3.6 or later)
1. Select \[Help\] - \[Install New Software...\] from menu
2. Select the appropriate Eclipse update site in "Work with:" field - for
example, select "Indigo - http://download.eclipse.org/releases/indigo" for
Eclipse 3.7.x and hit enter key.
3. Expand "Collaboration" and check "Subversive SVN Team Provider
(Incubation)", then click "Next >". Confirm the item selected in the next
screen, then click "Next >" again, then accept license terms in the next
screen and click "Finish". After the installation, click "Restart Now" to
restart Eclipse.
4. Select \[Window\] - \[Preferences\] to open Preferences. Expand Team on the
left pane and click SVN. It will open "Subversive Connector Discovery".
Select one from the list. **Note: Some people (including myself) are
experiencing a problem with SVNKit 1.3.5. If you want to use SVNKit, use
1.3.3 instead (2011-10-24 yoshito)** Restart Eclipse.
### Installing Subversive (Old)
1. Goto <http://www.eclipse.org/subversive/downloads.php>
2. Goto "latest release" on that page
3. Copy the update site, eg
"<http://download.eclipse.org/technology/subversive/0.7/update-site/> "
4. Go to Eclipse, then Help > Install New Software...
5. Into "Work with...", paste the update site.
6. Set the checkbox on Plug-ins. Hit Next and Finish until you are done.
Restart Eclipse.
7. Start Eclipse. It will ask for the connectors. Select all the SVN kits and
install. Restart Eclipse.
### Importing ICU4J
1. File - Import
2. Select "Project from SVN" under "SVN", Next
3. In the General Tab, set URL to:
svn+ssh://source.icu-project.org/repos/icu/icu4j, and set your User name:
XXXXXX
4. In the SSH Settings, fill-in proper authentication information. (port 922,
your ssh key (eg icu-project-key) and passphrase..)
5. If the connection is properly established, it opens next dialog "Select
Resource". Set URL to be:
svn+ssh://source.icu-project.org/repos/icu/icu4j/trunk/main - then click
Finish
6. The next dialog "Check Out As" should have 4 options indicating how to check
out. Select the second option "Find projects in the children of the selected
resource" - click Finish
7. It takes a while to locate projects. The next dialog shows a batch of
projects -
1. You may want to deselect localespi and localespi-tests.
2. Click Finish (that means, "Check out as a projects into workspace" is
selected)
8. After these projects are imported into the workspace, open Java perspective.
You might notice there are modification marker (">") displayed for the
projects. This is caused by build output directory created in each project's
workspace. To resolve this issue, you can go to Window - Preferences, then
select Team - Ignore Resources, then Add Pattern "out" (which is build
output directory used by these projects). You may need to restart Eclipse
after adding the new pattern.
**Note:** With the instruction above, you may see Eclipse errors when you open
ant build.xml in each project, such as "Target @build-all does not exist in this
project". This is because the import operation above flatten the original SVN
directory structure and files referenced via ${share.dir} does not work well. To
resolve the issue, you need to override the property by importing
locations-eclipse.properties globally. See the following steps to configure the
override.
1. From Eclipse menu, select \[Window\] - \[Preferences\]
2. Select "Ant" - "Runtime" on the left in the Preferences dialog
3. Open "Properties" tab
4. Under "Global property files", click "Add Files..."
5. Select icu4j-shared project in the list, then select
build/locations-eclipse.properties
6. Click OK - OK, to save the configuration.
## Another method using Eclipse SVN plugin (Subversive and Subclipse)
1. File - New - Other... then, select "Repository Location" under "SVN"
1. General Tab
1. URL - svn+ssh://source.icu-project.org/repos/icu/icu4j
2. User name: <yourname>
3. Password: <leave empty>
2. SSH Settings
1. Port: 922
2. Private key: <browse to your ssh private key>
3. Passphrase: <your passphrase>
3. Finish
2. Open SVN Repositories perspective (Window>Open Perspective>SVN Repository
Exploring) and expand the repository location you added above.
3. Navigate to trunk
4. Right click and select "Check Out" - this may take a few minutes.
5. File - Import and select "Existing Projects into Workspace" under "General"
6. Select root directory - navigate to "main" directory under your workspace
location where the source files were checked out (for example,
C:\\eclipse_ws\\icu4j\\trunk\\main)
7. You should see 10 projects including icu4j-charset, icu4j-charset-tests,
icu4j-core.... (number of projects might be changed in future)
1. All of them should be selected
2. **Important**: unclick "copy projects into Workspace"
3. Click Finish to import all
8. Back in the Java perspective, you should see the new projects.
9. For this time, projects are associated with SVN workspace. If you see the
modification marker (">") displayed for the projects, configure your
workspace to ignore pattern "out" (See step 7 in b-2 above).
10. From the command line, run "ant init" in the top level "main" (for example,
C:\\eclipse_ws\\icu4j\\trunk\\main)
## Testing & Debugging
### Run All Tests
To run all of the main tests, do the following:
**58 or later**
* "ant check" from the command line?
**53-57**
* Select icu4j-testall project in package explorer
* Right Click > Run As > Java Application
**52 or before**
* In icu4j-test-testframework, open com.ibm.icu.dev.test.TestAll
* RightClick>Run As>Java Application...
* It will fail, but create a Run Configuration
* RightClick>Run Configuration...
* Change the name to "TestAll - ICU4J"
* Click on Arguments, and set to "-n -t"
* Click on Classpath>User Entries>Add Projects...
* Select all of your ICU projects but **except icu4j-localespi and
icu4j-localespi-test**, and Add, eg:
* icu4j-charset
* icu4j-charset-tests
* ...
* Now Run.
### Run specific tests
#### 58 or later
* Right click on a test package (for example `com.ibm.icu.dev.test.rbbi` in
the **icu4j-core-tests** project), or an entire test source directory (such
as src in the **icu4j-core-tests** project) and choose **Run As->JUnit
Test**
* For test coverage, install EclEmma (below) and use **Coverage As** instead
of **Run As**.
### Test in Eclipse with ICU4J from jar files
You can manually create an Eclipse Run Configuration that doesn't include any of
the directories but all of the JAR files:
<http://stackoverflow.com/questions/1732259/eclipse-how-to-debug-a-java-program-as-a-jar-file>
### Test Coverage (53 or later)
* Install EclEmma plug-in. The installation instruction is found in [the
EclEmma site page](http://www.eclemma.org/installation.html).
* Run all tests once as described in the above section once.
* For the menu, select "Run" > "Coverage..." to open "Coverage Configurations"
window.
* Go to "Coverage" tab and uncheck all test projects (icu4j-\*-tests,
icu4j-test-framework, icu4j-testall) to exclude test codes from coverage
analysis.
* Click "Coverage" to run the all tests with coverage analysis enabled. After
the text execution, coverage report displayed in "Coverage" view.
* After running coverage, source lines are highlighted in different colors
depending on coverage level. Too remove the highlights, click "Remove All
Sessions" icon below (which also delete the coverage results).
![image](Capture.png)
* If you want to run coverage again, you can just right click on icu4j-testall
project and select "Coverage As" > "Java Application"
## Branching
* // Needs review
To Create the Branch
* Modify
* To merge, use Team>Merge. Pick Start from Copy.
To Merge a Branch
* ...

View file

@ -0,0 +1,15 @@
---
layout: default
title: Java Setup
parent: Setup for Contributors
has_children: true
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Java Setup

View file

@ -0,0 +1,39 @@
---
layout: default
title: Java Profiling and Monitoring tools
grand_parent: Setup for Contributors
parent: Java Setup
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Java Profiling and Monitoring tools
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
There are many Java development tools available for analyzing Java application
run time performance. Eclipse has a set of plug-ins called TPTP which provides
Java application profiling/monitoring framework. However, TPTP is very slow and
I experienced frequent crash while profiling ICU4J codes. For ICU4J development,
I recommend several tools described below.
## VisualVM
VisualVM is available as a separate download since JDK 9. You can download the latest
version from here - <https://visualvm.github.io/download.html>
There is an Eclipse plug-in, which allow you to launch VisualVM when you run a
Java app on Eclipse. You can monitor CPU usage of the Java app, Memory usage
(heap/permgen), classes loaded, etc in GUI. You can also get basic profiling
information, such as CPU usage by class, memory allocations and generate heap
dump, force GC etc.

View file

@ -0,0 +1,87 @@
---
layout: default
title: Local tooling configs for git and Github
grand_parent: Setup for Contributors
parent: Source Code Setup
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Local tooling configs for git and Github
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
## git difftool & mergetool
The `git diff` command prints changes to stdout, normally to the terminal
screen.
Set up a visual diff and merge program for use with `git difftool` and `git
mergetool`.
Changes in binary files do not show well in common diff tools and can take a
long time for them to compute visual diffs.
This is easily avoided using the -d option: `git difftool -d`
This shows all changed files in the diff program, and you can view and skip
files there as appropriate.
### Linux example
[stackoverflow/.../setting-up-and-using-meld-as-your-git-difftool-and-mergetool](https://stackoverflow.com/questions/34119866/setting-up-and-using-meld-as-your-git-difftool-and-mergetool)
#### Linux meld
`gedit ~/.gitconfig`
```
[diff]
    tool = meld
[difftool]
    prompt = false
[difftool "meld"]
    cmd = meld "$LOCAL" "$REMOTE"
[merge]
    tool = meld
[mergetool "meld"]
    cmd = meld "$LOCAL" "$MERGED" "$REMOTE" --output "$MERGED"
```
## Auto-link from GitHub to Jira tickets
GitHub itself does not linkify text like "ICU-23456" to point to the Jira
ticket. You can get links via browser extensions.
### Chrome Jira HotLinker
Install the [Jira
HotLinker](https://chrome.google.com/webstore/detail/jira-hotlinker/lbifpcpomdegljfpfhgfcjdabbeallhk)
from the Chrome Web Store.
Configuration Options:
* Jira instance url: https://unicode-org.atlassian.net/
* Locations: https://github.com/
### Safari extension from SRL
<https://github.com/unicode-org/icu-jira-safari>
### Firefox extension from JefGen
Install from the Mozilla Firefox Add-ons site:
<https://addons.mozilla.org/en-US/firefox/addon/github-jira-issue-linkifier/>
Source:
<https://github.com/jefgen/github-jira-linkifier-webextension>

View file

@ -0,0 +1,161 @@
---
layout: default
title: Source Code Setup
parent: Setup for Contributors
has_children: true
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Source Code Setup
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
> Announcement 07/16/2018: The ICU source code repository has been migrated from
> Subversion to Git, and is now hosted on GitHub.
## Quick Start
You can view ICU source code online: <https://github.com/unicode-org/icu>
***Make sure you have git lfs installed.*** See the following section.
For read-only usage, create a local clone:
```
git clone https://github.com/unicode-org/icu.git
```
or
```
git clone git@github.com:unicode-org/icu.git
```
This will check out a new directory `icu` which contains **icu4c** and
**icu4j** subdirectories as detailed below.
*For ICU development*, do *not* work directly with the Unicode ICU `main` branch!
See the [git for ICU Developers](../../userguide/dev/gitdev) page instead.
For cloning from your own fork, replace `unicode-org` with your GitHub user
name.
**For fetching just the files for an ICU release tag**, you can use a shallow
clone:
```
git clone https://github.com/unicode-org/icu.git --depth=1 --branch=release-63-1
```
If you already have a clone of the ICU repository, you can add and extract
release files like this:
```
mkdir /tmp/extracted-icu  # or wherever you want to extract to
cd  local-git-repo-top-level-dir
git fetch upstream
git tag --list "*63*"  # List tags relevant to ICU 63, e.g., release-63-1
git archive release-63-1 | tar -x -C /tmp/extracted-icu
```
## Detailed Instructions
### Prerequisites: Git and Git LFS
(Note: you do not need a [GitHub](http://github.com) *account* to download the
ICU source code. However, you might want such an account to be able to
contribute to ICU.)
* Install a **git client**
* <https://git-scm.com/downloads>
* Linux: `sudo apt install git`
* Install **git-lfs** if your git client does not already have LFS support
(ICU uses git Large File Storage to store large binary content such as
\*.jar files.)
* <https://git-lfs.github.com/>
* See also
<https://help.github.com/articles/installing-git-large-file-storage/>
* Linux: `sudo apt install git-lfs`
* MacOS: Consider using Homebrew or MacPorts.
* The command `git lfs version` will indicate if LFS is installed.
* Setup git LFS for your local user account once on each machine:
* `git lfs install --skip-repo`
### Working with git
There are many resources available to help you work with git, here are a few:
* <https://git-scm.com/> - the homepage of the git project
* <https://help.github.com/> - GitHubs help page
* <https://try.github.io/> - Resources to learn Git
Want to contribute back to ICU? See
[How to contribute](../../userguide/processes/contribute.md).
## Repository Layout
The top level
[README.md](https://github.com/unicode-org/icu#international-components-for-unicode)
contains the latest information about the repositorys layout. Currently:
* **icu4c**/ ICU for C/C++
* **icu4j**/ ICU for Java
* **tools**/ Tools
* **vendor**/ Vendor dependencies (copied here for reference)
### Tags and Branches
The repository is **tagged** with different release versions of ICU.
For example,
[release-55-1](https://github.com/unicode-org/icu/tree/release-55-1) is the tag
which corresponds to version 55.1 of ICU (for both C and J).
Branches in the main fork are used for maintenance branches of ICU.
For example,
[maint/maint-61](https://github.com/unicode-org/icu/tree/maint/maint-61) is a
branch containing the latest maintenance work on the 61.x line of ICU.
There are other tags and branches which may be cleaned up/deleted at any time.
* branches/tags/releases from [before the icu4c and icu4j trees were
merged](https://unicode-org.atlassian.net/browse/ICU-12800) - items prefixed
with "icu-" are for icu4c, and "icu4j-" for icu4j, etc.
* old personal work branches (with a person's username, such as **andy/6910**)
* long running shared feature branches (In general, feature work is done on
personal forks of the repository.)
See also the [Tips (for developers)](repository/tips/index.md) subpage.
## A Bit of History
ICU was first open sourced in 1999 using CVS and Jitterbug. The source files
were imported from other source control systems internal to IBM at that time.
The ICU project moved to using a Subversion source code repository and a Trac
bug database on Nov 30, 2006. These replace our original CVS source code
repository and Jitterbug bug data base. All history from the older systems has
been migrated into the new, so there should normally be no need to refer back to
Jitterbug or CVS.
In July 2018, the ICU project [moved
again](http://blog.unicode.org/2018/07/icu-moves-to-github-and-jira.html), this
time from svn to git on GitHub, and from trac to Atlassian Cloud Jira. Many
tools and much effort was involved in migration and testing. There is a
[detailed blog post](https://srl295.github.io/2018/07/02/icu-infra/) on the
topic (not an official ICU-TC document!) for those interested in the technical
details of this move.

View file

@ -21,6 +21,14 @@ includes details that go beyond the C, C++, and Java API docs (and avoids some d
This is the new home of the User Guide (since 2020 August).
## ICU Site
The official ICU Site is located at <https://icu.unicode.org>.
It is the official landing page for the ICU project.
Some of the pages from the ICU Site have been migrated here.
The migrated sections and pages from the ICU Site are visible in the navigation bar of this site below the "ICU Site" section heading.
## ICU team member pages
Other documentation pages here are written by and for team members.

View file

@ -0,0 +1,47 @@
---
layout: default
title: Maintenance Release Procedure
parent: Release & Milestone Tasks
grand_parent: Contributors
nav_order: 75
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Maintenance Release Procedure
When a critical problem is found in ICU libraries, we try to fix the problem in
the latest development stream first. If there is a demand for the fix in a past
release, an ICU project developer may escalate the fix to be integrated in the
release to the ICU project management committee. Once the committee approved to
merge the fix into back level stream, the developer can merge the bug fix back
to the past release suggested by the committee. This merge activity must be
tracked by maintenance release place holder tickets and the developer should
provide original ticket number and description as the response in each
maintenance ticket. These fixes are automatically included in a future ICU
maintenance release.
## Place Holder Ticket
Once a major version of ICU library is released, we create maintenance release
place holder tickets for the major release (one for C, one for J). The ticket
should have subject: "ICU4\[C|J\] m.n.X". For example, after ICU 4.8 release, we
create two tickets - "ICU4C 4.8.X" and "ICU4J 4.8.X". These tickets must use the
target milestone - "maintenance-release".
## Maintenance Release
When the ICU project committee agree on releasing a new maintenance release, the
corresponding place holder ticket will be promoted to a real maintenance release
task ticket. This is done by following steps.
* Create the new actual maintenance release milestone (e.g. 4.8.1)
* Change the place holder ticket's subject to the actual version (e.g. "ICU4C
4.8.X" -> "ICU4C 4.8.1")
* Retarget the place holder ticket to the actual release (e.g.
"maintenance-release" -> "4.8.1")
* Create a new place holder ticket for future release (e.g. new ticket "ICU4C
4.8.X", milestone: "maintenance-release")

View file

@ -2,7 +2,6 @@
layout: default
title: Release & Milestone Tasks
parent: Contributors
nav_order: 10
has_children: true
---

View file

@ -1,7 +1,6 @@
---
layout: default
title: Coding Guidelines
nav_order: 1
parent: Contributors
---
<!--

View file

@ -1,35 +0,0 @@
---
layout: default
title: Contributions
nav_order: 4
parent: Contributors
---
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Contributions to the ICU library
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
## Why Contribute?
ICU is an open source library that is a de-facto industry standard for
internationalization libraries. Our goal is to provide top of the line i18n
support on all widely used platforms. By contributing your code to the ICU
library, you will get the benefit of continuing improvement by the ICU team and
the community, as well as testing and multi-platform portability. In addition,
it saves you from having to re-merge your own additions into ICU each time you
upgrade to a new ICU release.
## Current Process
See [CONTRIBUTING.md](https://github.com/unicode-org/icu/blob/main/CONTRIBUTING.md)

View file

@ -1,7 +1,6 @@
---
layout: default
title: User Guide Editing
nav_order: 5
parent: Contributors
---
<!--

View file

@ -1,22 +1,37 @@
---
layout: default
title: Developing Fuzzer Targets for ICU APIs
parent: Contributors
---
# Developing Fuzzer Targets for ICU APIs
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
<!--
© 2019 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
Developing Fuzzer Targets for ICU APIs
======================================
This documents describes how to develop a [fuzzer](https://opensource.google.com/projects/oss-fuzz)
target for an ICU API and its integration into the ICU build process.
### Directory and naming conventions
## Directory and naming conventions
Fuzzer targets are exclusively in directory
[`source/test/fuzzer/`](https://github.com/unicode-org/icu/tree/main/icu4c/source/test/fuzzer)
and end with `_fuzzer.cpp`. Only files with such ending are recognized and executed as fuzzer
targets by the OSS-Fuzz system.
### General structure of a fuzzer target
## General structure of a fuzzer target
As a minimum, a fuzzer target contains the function
@ -69,7 +84,7 @@ constructor. The code interprets the fuzzer data as UnicodeString and passes it
And that is all. Specific error handling or return value verification is not required because the
fuzzer will detect all memory issues by means of memory/address sanitizer findings.
### Makefile.in changes
## Makefile.in changes
ICU fuzzer targets are built and executed by the OSS-Fuzz project. On side of ICU they are compiled
to assure that the code is syntactically correct and, as a sanity check, executed in the most basic
@ -81,14 +96,14 @@ The new fuzzer target will then be built and executed as part of a normal ICU4C
that each fuzzer target becomes executable on its own. As such it is linked with the code in
`fuzzer_driver.cpp`, which contains the `main()` function.
### Fuzzer seed corpus
## Fuzzer seed corpus
Any fuzzer seed data for a fuzzer target goes into a file with name `<fuzzer_target>_seed_corpus.txt`.
In many cases the input parameter of the ICU API under test is of type `UnicodeString`, in case
of which the seed data should be in UTF-16 format. As an example,see
[collator_rulebased_fuzzer_seed_corpus.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/test/fuzzer/collator_rulebased_fuzzer_seed_corpus.txt).
### Guidelines and tips
## Guidelines and tips
* Leave all randomness to the fuzzer. If a random selection of any kind is needed (e.g., of a
locale), then use bytes from the fuzzer data to make the selection
@ -97,7 +112,7 @@ of which the seed data should be in UTF-16 format. As an example,see
under test requires a Unicode string then make sure that the seed data is in UTF-16 encoding.
This can be achieved with e.g. the 'iconv' command or using an editor that saves text in UTF-16.
### How to locally reproduce fuzzer findings
## How to locally reproduce fuzzer findings
At this time reproduction of fuzzer findings requires Docker installed on the local machine and the
OSS-Fuzz project downloaded in a local git client.

View file

@ -0,0 +1,606 @@
---
layout: default
title: git and Github for ICU Developers
parent: Contributors
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# git and Github for ICU Developers
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
For git & git lfs installation see the [Source Code Setup](../../devsetup/source/)
page.
For setup with language compilers and IDEs, see the [Setup for Contributors](../../devsetup/source/) page
and its subpages.
## Overview
ICU development is on GitHub, in the **main** branch of the git repository.
<https://github.com/unicode-org/icu>
In preparation for a release, we create a maintenance branch, such as
[maint/maint-62](https://github.com/unicode-org/icu/tree/maint/maint-62) for ICU
62 and its maintenance releases.
For each release we create a release tag.
[releases/tag/release-62-1](https://github.com/unicode-org/icu/releases/tag/release-62-1)
(GitHub project page > Releases > Tags > select one; a Release is a Tag with
metadata.)
There are additional branches that you can ignore. Some are old development
branches.
Also, when you edit a file directly on the GitHub source browser (for docs: API
comments, or .md/.html/.txt), it creates a branch for your pull request. Make
sure to delete this branch when you are done.
## Development
We do *not* develop directly on the main repository. Do *not* clone from
there to commit and push back into the main repository.
Instead, use the GitHub UI (top right) to create a fork of the repository in
your own GitHub account. Then clone that to your local machine. You need only
one fork for all of your ICU work.
```
mkdir -p icu/mine/src
git clone git@github.com:markusicu/icu.git icu/mine/src
cd icu/mine/src
```
You should be in the **main** branch of your fork's clone.
Do *not* do any development in your own **main** branch either! That would
lead to messy merging with the upstream **main** branch.
Instead, create a new branch in your local clone for each piece of work. You
need a separate branch for each pull request. More on that later.
```
git checkout -b mybranchname
```
Now you are in a new development branch in your local git repo. Confirm with
`git status`. Change stuff. Do `git status` again, use `git add` for staging and
`git commit -m 'ICU-23456 what I changed'` to commit, or use `git commit -a -m 'ICU-23456 what I changed'` if you want to commit everything that git status
shows as changed.
For looking at changes, you should set up a visual diff program for use with
`git difftool`. See the [Setup: git difftool & mergetool](../../devsetup/source/gittooling.md) page.
For new files: Remember to add the appropriate copyright lines. Copy from a file
of the same type, and set the copyright year to the current year (that is, the
year you are creating the file).
You should have a Jira ticket for each line of work. (See [Submitting ICU Bugs and Feature Requests](https://icu.unicode.org/bugs) and [ICU Ticket Life cycle](https://icu.unicode.org/processes/ticket-lifecycle).) You can have multiple pull
requests per ticket. Each pull request needs a ticket in Accepted state.
Always prefix your commit statements with the Jira ticket number using this
pattern (including the space after the number; note: no colon):
`**ICU-23456** what I changed`
Local commits are only on your local machine. If your local disk crashes, your
changes are gone. `git push` your commits to your GitHub fork.
**Tips for Branches**
Shane
[recommends](https://blog.sffc.xyz/post/185195398930/why-you-should-use-git-pull-ff-only)
setting the default behavior of `git pull` to `--ff-only`. Shane also
[prevents](https://stackoverflow.com/a/40465455/1407170) local commits to the
**main** branch via *.git/hooks/pre-commit*. These two measures make it easier
to do the right thing in Git.
## Trivial changes
For trivial changes, such as small fixes in API docs or text files, it is ok to
edit the file in the GitHub GUI, in the main unicode-org/icu repository.
You still need a Jira ticket.
Once you are done editing, the GUI lets you create a branch and a commit right
in the main repository. Use the usual **ICU-23456** what I changed pattern
for the commit message.
Pull request, review, merge as usual, see the next section.
*Remember to delete your branch after merging.*
## Review & commit to Unicode main
When you are ready for code review, go to your GitHub page and your ICU fork.
Select your dev branch (Branch drop-down on the left, search for your branch).
Click "New pull request" next to the Branch button, or "Pull request" on the
right near "Compare". *Make sure it compares with unicode-org/icu main on the
left and your own fork's dev branch on the right*.
Prefix the title of your pull request with the Jira ticket number, same format
as for a commit.
Follow the rest of the checklist in the PR template.
Set the PR assignee to your main reviewer. You may add more people as reviewers,
but there is normally just one assignee. Be somewhat judicious with additional
reviewers: Don't just add them because they were recommended by GitHub.
Nice to have: Optionally set the Jira ticket reviewer field for documentation.
Still possible to close the ticket if the field is empty.
Watch the PR status for build failures and other issues.
A PR reviewer (at least the assignee) should look to see if the PR does what the
ticket says.
Respond to review feedback. Make changes on your local machine, commit, push to
your fork. The GitHub PR will update automatically for your additional commits.
Try to not rebase, squash, or force-push until the reviewer gives you a green
light.
*You should normally squash multiple commits into one in your fork before
merging (after the reviewer is satisfied)*. For multiple commits, the reviewer
should first respond with something like "lgtm please squash" but not yet
GitHub-approve; after squashing, they should check that the changes are the
same, and then GitHub-approve. A bot will respond to the PR confirming whether
the squash succeeded without changing the file contents.
If you squash, since you are rewriting the commit message anyway, please append
the pull request number to the first line of the updated message, using the
format "` (#199)`".
When you squash, please keep the parent hash (sha) the same so that the squash
is nothing more than a squash. If you change the parent hash, you may also be
pulling in other people's changes, and it may be harder for the reviewer to
verify that the squash was done correctly.
### Options on how to squash
#### Option 1: Use the online PR commit checker bot
Please note: this makes the
change in your remote branch but not in your local branch. Click the "Details"
link in the GitHub status, which brings you to a page with a summary of your PR.
Find the "Squash..." button. Sign in using your GitHub account, and follow the
flow to squash your branch.
Warning: do not `git pull` after you use the remote tool! If you subsequently need
to update your local branch to the squash commit, you need to fetch and reset:
```
git fetch origin BRANCHNAME
git checkout BRANCHNAME
git reset origin/BRANCHNAME
```
#### Option 2: Use git rebase
This works as long as you have no merge commits with
conflicts in your history. Plenty of examples:
* <https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History#_squashing>
* <https://github.com/todotxt/todo.txt-android/wiki/Squash-All-Commits-Related-to-a-Single-Issue-into-a-Single-Commit>
* <https://blog.carbonfive.com/2017/08/28/always-squash-and-rebase-your-git-commits/>
* <http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html>
* <https://medium.com/@slamflipstrom/a-beginners-guide-to-squashing-commits-with-git-rebase-8185cf6e62ec>
* Several other options:
<https://stackoverflow.com/questions/5189560/squash-my-last-x-commits-together-using-git>
#### Option 3: Use git merge
This is a little tricker but works even if you have
merge commits with conflicts. Assuming your feature branch is called BRANCHNAME:
```
# Make sure your branch is up-to-date with main and that the tests pass:
git checkout BRANCHNAME
git merge main
git push
# At this point, wait for an LGTM from a reviewer before proceeding.
# Once confirmed, make your squash commit in a new temp branch.
# NOTE: In the first line, make sure to checkout the same sha as
# you most recently merged into your branch!
git checkout main
git checkout -b temp
git merge --squash BRANCHNAME
git commit
# Point your branch to the squash commit, and there should be no dirty files:
git checkout BRANCHNAME
git reset temp
git status  # should be empty! If it's not, you didn't check out the right sha.
# Push your squash commit and clean up:
git push -f
git branch -d temp
```
#### Option 4: Amend a small commit
When making code review changes on a small PR, you can amend your
previous commit rather than making a new commit. Instead of running "git
commit", just run "git commit --amend". You will need to force-push. The PR bot
will post a link for the reviewer to see the changes from your old commit to
your new commit.
Once the reviewer(s) has/have approved your (squashed) changes:
* If you are an ICU team member with main repo write access:
* Merge your commits into the Unicode main.
* We almost always want to "rebase and merge" the commits. We normally
want them pre-squashed for a simple, clean change history. We rarely
want to permanently keep intermediate commits.
* (For ICU 63 we used "squash merge" but ended up with some ill-formed
commit messages. "Rebase and merge" lets us review the commit
messages before merging.)
* After you click the Merge button, if you don't use "rebase and merge"
(although normally you should...), make sure that the commit message
includes the "ICU-23456 " prefix, and add a suffix like " (#65)" with
the pull request number (if it's not there already).
* Known limitation: We won't have the PR number in the commit
message(s) when using the recommended "rebase and merge" -- unless
you manually amend the commit message(s) and add it.
* You should probably check the box for deleting your dev branch after
merging.
* Remember one branch per PR. You can create multiple branches & PRs per
ticket.
* If this was the last commit to finish work on the ticket, then go to
Jira and close the ticket as Fixed.
* You can optionally have someone (probably the same person as your PR
assignee) review the ticket as well, but that's not normally necessary.
* (We normally use ticket reviews for non-code changes, such as a
non-coding task or a web site update for the User Guide etc.)
* Otherwise:
* The PR assignee should be an ICU team member, and they are responsible
both for reviewing and for merging your PR, and then also for closing
the ticket.
## Merge conflicts
When someone else has made changes that conflict with yours, then you can't
merge as is. (The GitHub pull request page will tell you if there is a
conflict.)
You need to update your fork's **main** via your local clone, rebase your local
dev branch with that, resolve conflicts as you go, and force-push to your fork.
As easy as it is in GitHub to *create* a fork, you would think that it would be
a simple button-click to *update* your fork's **main** with commits on the
Unicode **main**. If you find a way to do this, please update this section.
Switch to your local **main**.
```
git checkout main
```
### Pull from upstream
Pull updates from the Unicode main (rather than a vanilla `git pull` which
pulls form your out-of-date fork), push to your fork's main.
*Norberts version:*
```
git pull git@github.com:unicode-org/icu.git
git push
```
*Andys version:*
Once per local git repo, set up an additional "remote". Something like the
following, but this may be incomplete!
```
git remote add upstream https://github.com/unicode-org/icu.git
git pull upstream main
git push origin main
```
*Andy's Version, take 2:*
Set the local main to track the upstream (unicode-org) main instead of your
fork's main (orign). Your fork's main is effectively out of the loop.
```
# one time setup
git branch -u upstream/main
# subsequent pulls from upstream (unicode.org) main
git pull
```
### Resolve conflicts
There are two ways to do this. You can rebase, or you can create a merge commit.
The advantage of rebase is that it makes it somewhat easier to squash later on.
The advantage of creating a merge commit is that you don't have to force-push,
so it makes it easier to work across different workstations, you are less likely
to get something wrong, and it makes it easier for the reviewer because GitHub
keeps track of comment history better when shas don't change.
#### Option 1: Merge
Switch to your dev branch, then merge in main. I like to use
the --no-commit option:
```
git checkout mybranchname
git merge main --no-commit
```
If you have conflicts, resolve them. Then, review the merge commit. It should
have all changes from main that were not yet on your branch. If it looks good,
commit the merge. You can push the merge commit without having to use -f.
```
git commit
git push
```
#### Option 2: Rebase
First switch back to your dev branch (without the -b option
which is for creating a new branch).
```
git checkout mybranchname
```
Then rebase, which reapplies your branch changes on top of the new main commits.
```
git rebase main
```
Sometimes you need to manually resolve conflicts. Follow the instructions git
prints or look for help...
If it had stopped and you are done resolving conflicts, continue rebasing.
```
git rebase --continue
```
You might get conflicts at several stages; resolve & continue until done.
When done, push to your GitHub fork. You need to force-push after rebasing.
```
git push -f
```
## Update your fork
Once in a while, you should update your fork's main with changes from the
Unicode main, so that you don't fall too far behind and your new changes don't
create unnecessary merge conflicts.
Go to your local main, pull commits from the Unicode main, and push to your
GitHub fork. See the "Merge conflicts" section above for details. If you don't
have a current dev branch, you can skip the rebasing.
## Committing to Maintenance Branch
Follow these steps for adding a commit to a maintenance branch.
The process is different between when we are between RC and GA and when we are
after GA.
### Between RC and GA
When working on a commit that you know at the time of
authorship to be a candidate for the maintenance branch, write the commit and
send the PR directly against the maintenance branch. All commits on the maint
branch will be merged *from maint to main* as a BRS task (see the next section).
Check out the current maint branch:
```
git fetch upstream maint/maint-64
git checkout maint/maint-64
```
Next, make a local branch off of the maint branch. For example, to use the
branch name "ICU-12345-maint-64", you can do:
```
git checkout -b ICU-12345-maint-64
```
Now, write your change and send it for review. Open your PR against the maint
branch.
### After GA
Write the commit against the main branch, and send your own
cherry-pick commits to put it on the desired maint branches.
Update your local main from the Unicode main (see above). Otherwise your git
workspace won't recognize the commits you are trying to cherry-pick.
Make a note of the SHA hash/ID of your commit on the main branch. You will use
this later when cherry-picking into the maint branch.
* The commit ID is listed on the pull request page.
* You can use git log to see the SHA once your change is on main.
* You can look at the commit history on GitHub too.
Next, checkout the maintenance branch locally. For example, for the ICU 63
maintenance branch:
```
git fetch upstream maint/maint-64
git checkout maint/maint-64
```
Next, make a local branch off of the maint branch. This new branch will be used
for your cherry-pick.
For example, to use the branch name "ICU-12345-maint-64", you can do:
```
git checkout -b ICU-12345-maint-64
```
Next, cherry-pick the commit(s) you want to apply to the maintenance branch.
(Note: If you only have one commit to merge to the maint branch then you would
only have one command below).
```
git cherry-pick 7d99ba4
git cherry-pick e578f3f
...
```
This creates **new** commits directly onto your local branch.
Look at the output from each of these commands to double-check that you got the
intended commits.
Finally, push your branch to your fork (should be "origin"), and open a PR into
the Unicode ICU branch maint/maint-64.
```
git push -u your-fork ICU-12345-maint-64
```
The reviewer of the PR has the following special responsibilities:
1. Don't approve the PR unless ICU-TC has agreed that this should be a
maintenance fix.
2. Make sure that the PR is targeting the correct branch in the Unicode ICU
repro. (ex: maint/maint-64 ).
3. Make sure that the PR includes all commits associated with the fix, which
was already approved for main.
4. Use "Rebase and merge".
## Checking for Missing Commits (BRS Task)
It is not hard to accidentally make a commit against main that should have been
against maint. As a BRS task before tagging, you should check the list of
commits that are on main but not maint and make sure of them belong on
maint.
To get the list, run:
```
git fetch upstream
git cherry -v upstream/maint/maint-64 upstream/main
```
Commits prefixed with "+" are on main but not on the specified maint branch.
Commits prefixed with "-" are present on both branches.
Send the list to the team and discuss in the weekly meeting if there are any
problems.
## Merging from Maint to Main (BRS Task)
Merging from the maint branch to main might be as easy as opening a pull
request, without having to touch the command line. However, if there are merge
conflicts, more work will need to be done.
**The Easy Way (No Merge Conflicts):** Open a pull request on GitHub from the
maint branch to main. If it says there are no merge conflicts, congratulations!
Use a new ticket number for the PR (it is suggested to NOT use the main BRS
ticket). The new ticket should have the next release as its fix version, because
the merge commit used to pull the commits from maint to main will be in the next
release but not the current release.
You may need to add "DISABLE_JIRA_ISSUE_MATCH=true" and/or
"ALLOW_MANY_COMMITS=true" to the PR description to silence errors coming from
the Unicode bot.
You should use a MERGE COMMIT to merge from maint to main, NOT REBASE MERGE as
is normally recommended. You will need to go into the admin panel on GitHub,
enable merge commits, perform your merge commit, and then disable merge commits
again from the admin panel. When making your merge commit, remember to use the
correct commit message syntax: prefix the merge commit message with ICU-#####,
the new ticket number you created above.
**The Hard Way (Merge Conflicts):** At the end of the day, the goal is that main
should share the maint branch's history. This is done using merge commits. What
follows is an example of how to create merge commits that retain full branch
history.
Create a new branch based on the tag you want to merge:
```
git fetch upstream
git checkout main
git checkout -b 64-merge-branch  # use any name you like
```
*If you already have this branch from a previous release tag*, you could either
use a new branch, or merge the latest main into your branch:
```
git checkout 64-merge-branch
# DANGER: Please make sure your workspace is clean before proceeding!
# If it's not, you might sneak in unreviewed changes.
git merge --no-commit main
git commit -am "ICU-##### Merge tag 'main' into 64-merge-branch"
```
Now, merge in maint:
```
# DANGER: Please make sure your workspace is clean before proceeding!
# If it's not, you might sneak in unreviewed changes.
git merge --no-commit upstream/maint/maint-39
```
After running the final line, you will have the opportunity to resolve merge
conflicts. If the conflict is in a large binary file like the ICU4J data jar
files, you may need to re-generate them.
Remember to prefix your commit message with the ticket number:
```
git commit -am "ICU-##### Merge branch 'maint/maint-39' into 64-merge-branch"
git push -u origin 64-merge-branch
```
As in the Easy Way, you may need to add `DISABLE_JIRA_ISSUE_MATCH=true` and/or
`ALLOW_MANY_COMMITS=true` to the PR description to silence errors coming from
the Unicode bot.
Send the PR off for review. As in the Easy Way, **you should use the MERGE COMMIT option in GitHub to land the PR!!**
## Requesting an Exhaustive Test run on a Pull-Request (PR)
The ICU4C and ICU4J Exhaustive Tests run on the main branch after a pull-request
has been submitted. They do not run on pull-requests by default as they take 1-2
hours to run.
However, you can manually request the CI builds to run the exhaustive tests on a
PR by commenting with the following text:
```
/azp run CI-Exhaustive
```
This will trigger the test run on the PR. This is covered more in a separate
[document](https://docs.google.com/document/d/1kmcFFUozpWah_y7dk_Inlw_BIq3vG3-ZR2A28tIiXJc/edit?usp=sharing).

View file

@ -1,7 +1,7 @@
---
layout: default
title: Contributors
nav_order: 1800
nav_order: 9000
has_children: true
---
<!--

View file

@ -0,0 +1,68 @@
---
layout: default
title: Skipping Known Test Failures
parent: Contributors
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Skipping Known Test Failures (logKnownIssue)
If you need a test to be disabled temporarily, call `logKnownIssue`. The method
is defined as below:
```java
/**
* Log the known issue.
* This method returns true unless -prop:logKnownIssue=no is specified
* in the argument list.
*
* @param ticket A ticket number string. For an ICU ticket, use numeric
characters only,
* such as "10245". For a CLDR ticket, use prefix "cldrbug:" followed by ticket
number,
* such as "cldrbug:5013".
* @param comment Additional comment, or null
* @return true unless -prop:logKnownIssue=no is specified in the test command
line argument.
*/
public boolean logKnownIssue(String ticket, String comment)
```
Below is an example:
```java
if (logKnownIssue("1234", "New data is not integrated yet.")) {
return;
}
// test code below
```
By default, logKnownIssue returns true and emit a log line including the link to
the ticket and the comment.
When `-prop:logKnownIssue=no` is specified as a command line argument,
`logKnownIssue()` returns false, so you can temporary enable a test code skipped
by logKnownIssue.
Before ICU4J 52, we used to use isICUVersionBefore() method like below. The test
method is still available in the trunk, but developers are suggested to use
logKnownIssue() instead.
```java
if (isICUVersionBefore(50,0,2)) {
return;
}
```
Before ICU4J 49M2, we used to use the style below -
```java
if(skipIfBeforeICU(4, 5, 2)) {
return;
}
```

View file

@ -1,18 +1,33 @@
---
layout: default
title: Updating ICU's built-in Break Iterator rules
parent: Contributors
---
# Updating ICU's built-in Break Iterator rules
{: .no_toc }
## Contents
{: .no_toc .text-delta }
1. TOC
{:toc}
---
<!--
Copyright (C) 2016 and later: Unicode, Inc. and others.
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
Updating ICU's built-in Break Iterator rules
============================================
Here are instructions for updating ICU's built-in break iterator rules, for Grapheme, Word, Line and Sentence breaks.
The ICU rules implement the boundary behavior from Unicode [UAX-14](https://www.unicode.org/reports/tr14/) and [UAX-29](https://www.unicode.org/reports/tr29/), with tailorings from CLDR and some ICU-specific enhancements. ICU rules updates are needed in response to changes from Unicode or CLDR, or for bug fixes. Often ideas for CLDR or UAX updates are prototyped in ICU first, before becoming official.
This is not a cook book process. Familiarity with ICU break iterator behavior and rules is needed. Sets of break rules often interact in subtle and difficult to understand ways. Expect some bumps.
### Have clear specifications for the change.
## Have clear specifications for the change.
The changes will typically come from a proposed update to Unicode UAX 29 or UAX 14,
or from CLDR based tailorings to these specifications.
@ -21,7 +36,7 @@ As an example, see [CLDR proposal for Extended Indic Grapheme Clusters](https://
Often ICU will implement draft versions of proposed specification updates, to check that they are complete and consistent, and to identify any issues before they are released.
### Files that typically will need to be updated:
## Files that typically will need to be updated:
| File | Contents |
@ -40,7 +55,7 @@ Often ICU will implement draft versions of proposed specification updates, to ch
| .../main/tests/core/src/com/ibm/icu/dev/test/rbbi/RBBITestMonkey.java | Monkey test w rules as code. Port from ICU4C.
### ICU4C
## ICU4C
The rule updates are done first for ICU4C, and then ported (code changes) or moved (data changes) to ICU4J. This order is easiest because the the break rule source files are part of the ICU4C project, as is the rule builder.
@ -225,7 +240,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
As with the main rules, after everything appears to be working, run the rule based monkey test for an extended period of time (with loop=-1).
### ICU4J
## ICU4J
1. **Copy the Data Driven Test File to ICU4J**

View file

@ -1,7 +1,6 @@
---
layout: default
title: Custom ICU4C Synchronization
nav_order: 3
parent: Contributors
---
<!--

View file

@ -1,7 +1,6 @@
---
layout: default
title: Synchronization
nav_order: 2
parent: Contributors
---
<!--

View file

@ -0,0 +1,98 @@
---
layout: default
title: Why Use ICU4J?
nav_order: 100
parent: ICU4J
---
<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->
# Why Use ICU4J?
## Summary
* Fully implements current standards
* Unicode collation, normalization, break iteration
* Updated more frequently than Java
* Full CLDR Locale data
* Improved performance
## Details
* Normalization
* Addresses lack of Unicode normalization support in Java 5
* Addresses outdated Unicode normalization support in Java 6
* Up-To-Date Unicode version
* Java 5 & 6 are Unicode 4.0, while ICU 4.0 is Unicode 5.1
* Characters added after Unicode 4.0 do not have character properties in
Java
* IDNA and StringPrep
* Addresses lack of Internationalized Domain Name support in Java 5
* Addresses generic stringprep (RFC3454) support. stringprep is required
for supporting various internet protocols (NFS, LDAP...)
* Collation
* Provides Unicode standard compliant collation support
* ICU Collator fully implements UTR#10, while the Java implementation is
outdated and not compatible.
* Provides ICU UnicodeSet for easy character range validation
* much more flexible and convenient for validating identifiers/text tokens
with a given syntax
* full boolean operations (union, intersection, difference)
* all Unicode properties supported
* Locales
* BCP47 (language tag) support in locale class (supporting "script",
3-letter language codes, 3-digit region codes)
* Locale data coverage - much better, many more locales, up-to-date
* Broader charset converter coverage
* In ICU4J 4.2, also output charset selection
* Custom fallback in charset converter
* Other features missing in the JDK
* Dates:
* Many more date formats: month+day, year+month,...
* Date interval formats: "Dec 15-17, 2009"
* APIs for returning time zone transitions
* Other formatting
* Plural formatting, including units: "1 hour" / "2 hours"
* Rule based number format ("three thousand two hundred")
* Extensive Non-Gregorian calendar support
* Transliterator (for flexible text/script transformations)
* Collation-sensitive string search
* Same data as ICU4C, allowing same behavior across programming languages
* All Unicode character properties - over 80, Java provides access to only
about 10
* Thai wordbreak
## Performance & Size
* Instantiation times are comparable
* Common instantiate and reuse model
* ICU4J and Java both use caches to limit impact
* Collation performance *many times* faster
* sorting: 2 to 20 times faster
* sort key generation: 1.5 to 4 times faster
* sort key length: 2/3 to 1/4 the length of Java sort keys
* Property access much faster (isLetter, isWhitespace,...)
* Can easily produce scaled-down version (removing data)
## API
* Subclasses of JDK classes where possible
* Drop-in (change of import) if not
## Summary
* **ICU4J is not for you if**
* you have tight size constraints
* you require the Java runtime behavior
* **ICU4J is for you if**
* you need full compliance with current standards
* you need current or additional locale and property data
* you need customizability
* you need features missing from Java (normalization, collation,...)
* you need better performance