mirror of
https://github.com/unicode-org/icu.git
synced 2025-04-04 13:05:31 +00:00
ICU-21697 Convert ICU Site pages to markdown for Github Pages
See #1785
This commit is contained in:
parent
de26ea8c6a
commit
5435007e6a
47 changed files with 5950 additions and 59 deletions
74
docs/demos/index.md
Normal file
74
docs/demos/index.md
Normal file
|
@ -0,0 +1,74 @@
|
|||
---
|
||||
layout: default
|
||||
title: Demos
|
||||
nav_order: 350
|
||||
description: Demos
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Demos
|
||||
|
||||
## ICU4C Demos
|
||||
|
||||
[List of ICU Demonstrations](https://icu4c-demos.unicode.org/icu-bin/icudemos)
|
||||
|
||||
## ICU4J Demos
|
||||
|
||||
### Server Side Demos
|
||||
|
||||
#### Web Demos
|
||||
|
||||
These demos are running on the ICU server, and are implemented as Java Servlets
|
||||
and JSP pages.
|
||||
|
||||
* [Browse the Demos](http://demo.icu-project.org/icu4jweb/)
|
||||
* [View Demo Source](https://github.com/unicode-org/icu-demos/tree/master/icu4jweb/)
|
||||
|
||||
### Client Side demos
|
||||
|
||||
#### To build the client side samples:
|
||||
|
||||
1. Download the ICU4J source code ( see [Source Code Setup](../devsetup/source) )
|
||||
2. Run `ant jar` to build ICU4J jar
|
||||
3. Run `ant jarDemos` to build the demos
|
||||
4. Run `cp icu4j.jar demos/out/lib`
|
||||
5. Finally, run `java -jar demos/out/lib/icu4j-demos.jar` to launch the demos
|
||||
|
||||
**CalendarApp** This demo compares two calendars against each other. Choose the
|
||||
two calendar types, and the display language, from the pop-up menus. Navigate by
|
||||
days using the < and > buttons, or by years using the << and >> buttons.
|
||||
|
||||
**Translit** This demonstration shows ICU Transliteration. The transliteration
|
||||
mode chosen in the menu will be used as you type.
|
||||
|
||||
**HolidayCalendarDemo** This demo displays holidays from a certain locale,
|
||||
localized into the display language of your choice. Navigate by days using the <
|
||||
and > buttons, or by years using the << and >> buttons.
|
||||
|
||||
**RbnfDemo** This demo shows Rule Based Number Formatting. Please expand the
|
||||
window to show the entire demo. A number may be entered in the top left corner,
|
||||
or the navigation buttons may be used. The pop-up menus in the top right corner
|
||||
will pick the rule and the variant used.
|
||||
|
||||
**DetectingViewer** By opening a document using the Open file or Open URL menu
|
||||
items, this demo will statistically detect the probable file encoding of a file.
|
||||
Use the DetectedEncodings menu to see which encodings were detected.
|
||||
|
||||
*Note:* Due to security constraints, you must use the Downloadable Demo Jar in
|
||||
order to use these demos with files on your local disk. The Java Web Start
|
||||
application will not have permission to read local files.
|
||||
|
||||
---
|
||||
|
||||
### ICU Introduction Applets
|
||||
|
||||
#### About the Applets
|
||||
|
||||
This is a paper introducing ICU calendars, which has live applets throughout the
|
||||
text to demonstrate various features.
|
||||
|
||||
The paper is now archived, see <https://github.com/unicode-org/icu-demos/pull/5>
|
14
docs/design/index.md
Normal file
14
docs/design/index.md
Normal file
|
@ -0,0 +1,14 @@
|
|||
---
|
||||
layout: default
|
||||
title: Design Docs
|
||||
nav_order: 8000
|
||||
has_children: true
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Design Docs
|
||||
|
333
docs/design/props/ppucd.md
Normal file
333
docs/design/props/ppucd.md
Normal file
|
@ -0,0 +1,333 @@
|
|||
---
|
||||
layout: default
|
||||
title: Preparsed UCD
|
||||
parent: Design Docs
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Preparsed UCD
|
||||
|
||||
## What
|
||||
|
||||
A text file with preparsed UCD ([Unicode Character
|
||||
Database](http://www.unicode.org/ucd/)) data.
|
||||
|
||||
* Preparser script:
|
||||
[tools/unicode/py/**preparseucd.py**](https://github.com/unicode-org/icu/blob/master/tools/unicode/py/preparseucd.py)
|
||||
* ppucd.txt output:
|
||||
[icu4c/source/data/unidata/**ppucd.txt**](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt)
|
||||
([raw text
|
||||
version](https://raw.githubusercontent.com/unicode-org/icu/master/icu4c/source/data/unidata/ppucd.txt))
|
||||
* Parser for ppucd.txt:
|
||||
[icu4c/source/tools/toolutil/**ppucd.h**](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.h)
|
||||
&
|
||||
[.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.cpp)
|
||||
* genprops tool rewritten to use that:
|
||||
[tools/unicode/c/**genprops**](https://github.com/unicode-org/icu/tree/master/tools/unicode/c/genprops)
|
||||
|
||||
## Syntax
|
||||
|
||||
```
|
||||
# Preparsed UCD generated by ICU preparseucd.py
|
||||
```
|
||||
|
||||
Only whole-line comments starting with #, no inline comments.
|
||||
|
||||
```
|
||||
ucd;10.0.0
|
||||
```
|
||||
|
||||
Data lines start with a type keyword. Data fields are semicolon-separated. The
|
||||
number of fields per line is highly variable.
|
||||
|
||||
The ucd line should be the first data line. It provides the Unicode version
|
||||
number.
|
||||
|
||||
```
|
||||
property;Binary;Alpha;Alphabetic
|
||||
property;Enumerated;bc;Bidi_Class
|
||||
```
|
||||
|
||||
Property lines define properties with a type and two or more aliases.
|
||||
|
||||
```
|
||||
binary;N;No;F;False
|
||||
binary;Y;Yes;T;True
|
||||
value;bc;ON;Other_Neutral
|
||||
```
|
||||
|
||||
Property value lines define the values of enumerated and catalog properties,
|
||||
with the property short name and two or more aliases for each value.
|
||||
|
||||
There is only one shared definition of the values and aliases for binary
|
||||
properties.
|
||||
|
||||
```
|
||||
defaults;0000..10FFFF;age=NA;bc=L;blk=NB;bpt=n;cf=<code point>;dm=<code point>;dt=None;ea=N;FC_NFKC=<code point>;gc=Cn;GCB=XX;gcm=Cn;hst=NA;InPC=NA;InSC=Other;jg=No_Joining_Group;jt=U;lb=XX;lc=<code point>;NFC_QC=Y;NFD_QC=Y;NFKC_CF=<code point>;NFKC_QC=Y;NFKD_QC=Y;nt=None;SB=XX;sc=Zzzz;scf=<code point>;scx=<script>;slc=<code point>;stc=<code point>;suc=<code point>;tc=<code point>;uc=<code point>;vo=R;WB=XX
|
||||
```
|
||||
|
||||
After the version, property, and property value lines, and before other data
|
||||
lines, the defaults line defines default values for all code points
|
||||
(corresponding to @missing data in the UCD). Any properties not mentioned here
|
||||
default to null values according to their type, such as False or the empty
|
||||
string.
|
||||
|
||||
The general syntax of this line is the same as for the following data lines:
|
||||
|
||||
1. Line type keyword.
|
||||
2. Code point or start..end range (inclusive end).
|
||||
3. Zero or more property values.
|
||||
* Binary values are given by their property name alone if True ("Alpha"),
|
||||
or with a minus sign prepended ("-Alpha").
|
||||
* Other values are given as "pname=value" pairs, where pname is the
|
||||
property name.
|
||||
* In the ppucd.txt file, short names of properties and values are used,
|
||||
but parsers should be prepared to accept any of the aliases according to
|
||||
the earlier sections of the file.
|
||||
* In the ppucd.txt file, properties are listed in sorted order, but this
|
||||
is not required by the syntax.
|
||||
|
||||
```
|
||||
block;20000..2A6DF;age=3.1;Alpha;blk=CJK_Ext_B;ea=W;gc=Lo;Gr_Base;IDC;Ideo;IDS;lb=ID;SB=LE;sc=Hani;UIdeo;vo=U;XIDC;XIDS
|
||||
# 20000..2A6D6 CJK Unified Ideographs Extension B
|
||||
algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-
|
||||
cp;20001;nt=Nu;nv=7
|
||||
cp;20064;nt=Nu;nv=4
|
||||
unassigned;2A6D7..2A6DF;ea=W;lb=ID;vo=U
|
||||
# No block
|
||||
unassigned;2A6E0..2A6FF;ea=W;lb=ID;vo=U
|
||||
algnamesrange;AC00..D7A3;hangul
|
||||
```
|
||||
|
||||
Block lines specify a Unicode Block and provide an opportunity for compact data
|
||||
lines for ranges inside the block, by listing common property values once for
|
||||
the whole block. Block properties override the defaults for cp and unassigned
|
||||
lines with code point ranges inside the block. The file syntax and parser do not
|
||||
require the presence of block lines.
|
||||
|
||||
cp lines provide the data for a code point or range. They override the
|
||||
default+block properties. Properties that are not mentioned fall back to the
|
||||
block, then to the defaults.
|
||||
|
||||
Unassigned lines (new in ICU 60 for Unicode 10) provide the data for an
|
||||
unassigned code point or range (gc=Cn). They override only the default
|
||||
properties, except for the blk=Block property (if the range is inside a block).
|
||||
Properties that are not mentioned fall back to the defaults, except that the
|
||||
blk=Block property applies to unassigned lines as well.
|
||||
|
||||
A range is considered inside a block if it is fully inside the range of the last
|
||||
defined block. Otherwise it is considered outside a block and falls back only to
|
||||
the defaults. This is the case even if the range is inside an earlier block, to
|
||||
simplify parsing & processing (such data lines should be avoided).
|
||||
|
||||
A range inside the block for which there is no data line inherits all of the
|
||||
default+block properties (see Han blocks). Note that this is very different from
|
||||
the behavior of an unassigned line, in particular since such blocks typically
|
||||
default to gc!=Cn.
|
||||
|
||||
Non-default properties for unassigned ranges inside and outside of blocks are
|
||||
typically for [complex
|
||||
defaults](http://www.unicode.org/reports/tr44/#Default_Values_Table) and for
|
||||
noncharacters.
|
||||
|
||||
ppucd.txt data lines are in code point order, although this should not be
|
||||
strictly required.
|
||||
|
||||
Assigned characters normally have their unique na=Name property value. For
|
||||
Hangul syllables with their algorithmically computed names, the entire range is
|
||||
covered by the line "algnamesrange;AC00..D7A3;hangul". For ranges of ideographic
|
||||
characters, a line like "algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-"
|
||||
provides a Name prefix which is to be followed by the code point (in hex like
|
||||
%04lX).
|
||||
|
||||
## Why not UCD .txt files?
|
||||
|
||||
See [UAX #44 "Unicode Character Database"](http://www.unicode.org/reports/tr44/)
|
||||
|
||||
Nontrivial parsing:
|
||||
|
||||
* The UCD has grown from a couple of semicolon-delimited files plus an
|
||||
informative "Property dump" (early PropList.txt) to a collection of dozens
|
||||
of files with a variety of (now more regular) formats.
|
||||
* Related properties are scattered over several files.
|
||||
* Full information for Numeric_Value and Numeric_Type requires parsing two
|
||||
files.
|
||||
* Default values are "hidden" in comments.
|
||||
* The UCD folder structure (which file where) has changed over time.
|
||||
* UCD filenames change during each Unicode beta period. (A detailed version
|
||||
number is inserted into each filename.)
|
||||
* Many files are bloated with comments that show the General Category and name
|
||||
of each character or range start/end; if the data were combined into a
|
||||
single file, then all properties for a character or range would be listed
|
||||
together, without need for such comments.
|
||||
|
||||
Nontrivial patching: Adding characters (e.g., PUA or proposed/draft) requires
|
||||
adding data in many of the UCD files.
|
||||
|
||||
ICU already preprocesses some of the UCD .txt files. We strip comments from some
|
||||
files (because they are huge) and in some files merge adjacent same-property
|
||||
code points into ranges.
|
||||
|
||||
Some changes are manual, such as updating and adding ranges of algorithmic
|
||||
character names.
|
||||
|
||||
Then we run several tools, most of them twice, to parse different sets of .txt
|
||||
files and write several output files. We use several Python and shell scripts,
|
||||
and a "log" (unidata/changes.txt) with details of what was changed and run in
|
||||
each Unicode version upgrade.
|
||||
|
||||
Markus has done ICU Unicode updates since about 2002. Someone else might have a
|
||||
hard time picking this up for maintenance and future Unicode version updates.
|
||||
|
||||
### Why not UCD XML files?
|
||||
|
||||
See [UAX #42 "Unicode Character Database in
|
||||
XML"](http://www.unicode.org/reports/tr42/)
|
||||
|
||||
Good: The UCD XML file format stores all properties in a single file with a
|
||||
relatively simple structure, with property values as XML attributes.
|
||||
|
||||
Issues:
|
||||
|
||||
* **Missing data** which is needed for ICU
|
||||
* Name_Alias added in UCD 5.0 but missing in UCD XML as of UCD 6.1 beta.
|
||||
* Script_Extensions added in UCD 6.0 but not "blessed" as a Unicode
|
||||
property as of UCD 6.1. Useful, used in ICU, but not available in UCD
|
||||
XML.
|
||||
* Adopting UCD XML would require to either still also parse some UCD .txt
|
||||
files or write another tool to merge more data into the XML.
|
||||
* Dependency on third party
|
||||
* Lag time between UCD .txt vs. XML availability during beta.
|
||||
* Unable to fix/update/extend XML generator tools.
|
||||
* For new properties, need to wait for standardization (UAX #42), tool
|
||||
update, and XML publication.
|
||||
* Will not support custom/nonstandard data.
|
||||
* Could be simpler: Parsing XML is easy in Java, Python, etc. and doable in
|
||||
C++ (we have a "poor man's" XML parser), but not as easy as
|
||||
`line.split(";")`.
|
||||
* There is no need for complex structure for the UCD.
|
||||
* Could be easier to read for humans: By not storing defaults for all of
|
||||
Unicode in one place, each `<group>` carries them, making it hard to see which
|
||||
values are specific to each group. "Fluffy" XML makes for longer text lines,
|
||||
more horizontal scrolling.
|
||||
* Hard to diff: The XML format can be used in different ways, and Unicode
|
||||
publishes different forms of the same data. Also, the precise XML text
|
||||
depends on the XML formatting code used.
|
||||
* For diffing, a special tool needs to be run, parse old & new XML data,
|
||||
compare values and generate a diff report. Unicode publishes some of
|
||||
those too.
|
||||
* Some data still requires nontrivial parsing.
|
||||
* For algorithmic character names, the range needs to be determined by
|
||||
collecting a contiguous sequence of elements with a shared name pattern.
|
||||
There is not even any special notation for the algorithmic names for
|
||||
Hangul syllables.
|
||||
* Minor: Unnecessary data (for ICU)
|
||||
* Precomputed Hangul syllable names
|
||||
* Irrelevant contributory properties like "Other_Xyz"
|
||||
* Properties not used by ICU
|
||||
* Minor, just awkward: Blocks are treated as auxiliary data, rather than as a
|
||||
core means to organize and store the data. On the other hand, the "grouped"
|
||||
XML files also use them as the basis for the `<group>` elements and associated
|
||||
compaction. (The "flat" files don't.)
|
||||
|
||||
## Goals
|
||||
|
||||
* Single file with all data relevant for ICU.
|
||||
* Very easy to parse and use the data in C/C++ tools.
|
||||
* Easily human readable.
|
||||
* Easy-to-read diffs from standard diff tools.
|
||||
* Compact file format.
|
||||
* Conversion tool easy to write, maintain, extend.
|
||||
* Convert from UCD .txt files because those are maintained directly by the UTC
|
||||
& editorial committee. No waiting for third party to convert the files.
|
||||
* Able to extend for new kinds of data.
|
||||
* Easy format for manual data fixes/additions (e.g., PUA or proposed/draft).
|
||||
* Move much of the parsing from scattered C code into one Python script.
|
||||
|
||||
## Details
|
||||
|
||||
* All-Unicode defaults in one place, but only list non-null default values.
|
||||
(`blk=No_Block, cf=<code point>, ...`)
|
||||
* Line-oriented, always semicolon-separated, with type-of-line in the first
|
||||
field.
|
||||
* Block properties override defaults; only for few properties where properties
|
||||
in the block have common, non-default values.
|
||||
* Effective because blocks represent actual allocation & organization of
|
||||
Unicode. Maintained by UTC.
|
||||
* Code point/range properties override default+block properties.
|
||||
* Algorithmic names stored as ranges with type & shared name prefixes (for
|
||||
CJK).
|
||||
* No gratuitous white space or syntax characters.
|
||||
* Mostly key=value, simpler format for binary properties. Easy to read.
|
||||
* Comment lines with headings from NamesList.txt further improve readability.
|
||||
(There are few of them, so no significant size bloat.)
|
||||
* Simple, stable file generation allows diffing.
|
||||
* E.g., list properties in sorted order of property names.
|
||||
* No need to implement/store properties that are not used in ICU. (But format
|
||||
& tool are easy to extend.)
|
||||
|
||||
## Plan
|
||||
|
||||
* (done) Write Python tool to preparse UCD .txt files and generate one output
|
||||
ppucd.txt file.
|
||||
* (done) Subsume existing ucdcopy.py.
|
||||
* (done) Write toolutil C++ parser for ppucd.txt, add ppucd.txt to the unidata
|
||||
folder.
|
||||
* (done) Merge genbidi, gencase, gennames, gennorm into genprops
|
||||
* Replace scattered many-.txt parsers with calls to the toolutil ppucd.txt
|
||||
parser.
|
||||
* Generate all output files in one genprops invocation.
|
||||
* Update makeprops.sh (delete half of it) & changes.txt.
|
||||
* (done) Make preparseucd.py also parse uchar.h & uscript.h and write the
|
||||
property names data header file. (was: ~~Change genpname/preparse.pl to read
|
||||
ppucd.txt rather than Property\[Value\]Aliases.txt.~~)
|
||||
* (done) Consider changing pnames_data.h so that minor changes don't change
|
||||
most of the file contents.
|
||||
* (done) Write wiki/Markus/ReviewTicket8972 with diff links.
|
||||
* 2019-sep-27: The old Trac server is going away. I copied the wiki page
|
||||
contents into a comment on
|
||||
[ICU-8972](https://unicode-org.atlassian.net/browse/ICU-8972).
|
||||
* Move UCD tests from cintltst to intltest, change to use the toolutil
|
||||
ppucd.txt parser. ([ticket
|
||||
#9041](https://unicode-org.atlassian.net/browse/ICU-9041))
|
||||
* Change Java UCD tests to parse & use ppucd.txt. (ticket #9041)
|
||||
* (partially done) Change Python preparser to not copy input UCD .txt files
|
||||
any more, delete them from unidata & Java. (ticket #9041)
|
||||
|
||||
## Other tool improvements
|
||||
|
||||
**Bad**: Until **ICU 4.8**, the process is
|
||||
|
||||
build & install ICU -> build Unicode tools -> run genpname -> build & install
|
||||
ICU (now with updated property names) -> build Unicode tools -> run UCD parsers
|
||||
-> build & install ICU (now also with case properties & normalization etc.) ->
|
||||
build Unicode tools -> run genuca -> build & install ICU
|
||||
|
||||
It should be possible to
|
||||
|
||||
1. merge the Unicode tools into one binary
|
||||
2. parameterize the relevant properties code (property name lookup, case & some
|
||||
other properties, NFC)
|
||||
3. inject newly built data into the common library for the next part of the
|
||||
merged Unicode tool's processing.
|
||||
|
||||
**ICU 49**:
|
||||
|
||||
build & install ICU -> build Unicode tools -> run genprops -> build & install
|
||||
ICU (now with updated properties) -> build Unicode tools -> run genuca -> build
|
||||
& install ICU
|
||||
|
||||
genprops builds the property (value) names data and injects it into the live
|
||||
ppucd.txt parser for further processing.
|
||||
|
||||
**Goal**:
|
||||
|
||||
build & install ICU -> build Unicode tool -> run it -> build & install ICU (now
|
||||
with all updated Unicode data)
|
||||
|
||||
Requires [ticket #9040](https://unicode-org.atlassian.net/browse/ICU-9040),
|
||||
could be "hard".
|
20
docs/design/struct/index.md
Normal file
20
docs/design/struct/index.md
Normal file
|
@ -0,0 +1,20 @@
|
|||
---
|
||||
layout: default
|
||||
title: Data Structures
|
||||
parent: Design Docs
|
||||
has_children: true
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Data Structures
|
||||
|
||||
## Subpage Listing
|
||||
|
||||
* [ICU Code Point Tries](./utrie)
|
||||
* [ICU String Tries](./tries/)
|
||||
* [BytesTrie](./tries/bytestrie/)
|
||||
* [UCharsTrie](./tries/ucharstrie)
|
358
docs/design/struct/tries/bytestrie/bytetrie.h
Normal file
358
docs/design/struct/tries/bytestrie/bytetrie.h
Normal file
|
@ -0,0 +1,358 @@
|
|||
// © 2016 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
/*
|
||||
*******************************************************************************
|
||||
* Copyright (C) 2010, International Business Machines
|
||||
* Corporation and others. All Rights Reserved.
|
||||
*******************************************************************************
|
||||
* file name: bytetrie.h
|
||||
* encoding: US-ASCII
|
||||
* tab size: 8 (not used)
|
||||
* indentation:4
|
||||
*
|
||||
* created on: 2010sep25
|
||||
* created by: Markus W. Scherer
|
||||
*/
|
||||
|
||||
#ifndef __BYTETRIE_H__
|
||||
#define __BYTETRIE_H__
|
||||
|
||||
/**
|
||||
* \file
|
||||
* \brief C++ API: Dictionary trie for mapping arbitrary byte sequences
|
||||
* to integer values.
|
||||
*/
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/uobject.h"
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
||||
class ByteTrieBuilder;
|
||||
class ByteTrieIterator;
|
||||
|
||||
/**
|
||||
* Light-weight, non-const reader class for a ByteTrie.
|
||||
* Traverses a byte-serialized data structure with minimal state,
|
||||
* for mapping byte sequences to non-negative integer values.
|
||||
*/
|
||||
class /*U_COMMON_API*/ ByteTrie : public UMemory {
|
||||
public:
|
||||
ByteTrie(const void *trieBytes)
|
||||
: bytes(reinterpret_cast<const uint8_t *>(trieBytes)),
|
||||
pos(bytes), remainingMatchLength(-1), value(0) {}
|
||||
|
||||
ByteTrie &reset() {
|
||||
pos=bytes;
|
||||
remainingMatchLength=-1;
|
||||
return *this;
|
||||
}
|
||||
|
||||
/**
|
||||
* Traverses the trie from the current state for this input byte.
|
||||
* @return TRUE if the byte continues a matching byte sequence.
|
||||
*/
|
||||
UBool next(int inByte);
|
||||
|
||||
/**
|
||||
* @return TRUE if the trie contains the byte sequence so far.
|
||||
* In this case, an immediately following call to getValue()
|
||||
* returns the byte sequence's value.
|
||||
*/
|
||||
UBool contains();
|
||||
|
||||
/**
|
||||
* Traverses the trie from the current state for this byte sequence,
|
||||
* calls next(b) for each byte b in the sequence,
|
||||
* and calls contains() at the end.
|
||||
*/
|
||||
UBool containsNext(const char *s, int32_t length);
|
||||
|
||||
/**
|
||||
* Returns a byte sequence's value if called immediately after contains()
|
||||
* returned TRUE. Otherwise undefined.
|
||||
*/
|
||||
int32_t getValue() const { return value; }
|
||||
|
||||
// TODO: For startsWith() functionality, add
|
||||
// UBool getRemainder(ByteSink *remainingBytes, &value);
|
||||
// Returns TRUE if exactly one byte sequence can be reached from the current iterator state.
|
||||
// The remainingBytes sink will receive the remaining bytes of that one sequence.
|
||||
// It might receive some bytes even when the function returns FALSE.
|
||||
|
||||
private:
|
||||
friend class ByteTrieBuilder;
|
||||
friend class ByteTrieIterator;
|
||||
|
||||
inline void stop() {
|
||||
pos=NULL;
|
||||
}
|
||||
|
||||
// Reads a compact 32-bit integer and post-increments pos.
|
||||
// pos is already after the leadByte.
|
||||
// Returns TRUE if the integer is a final value.
|
||||
inline UBool readCompactInt(int32_t leadByte);
|
||||
inline UBool readCompactInt() {
|
||||
int32_t leadByte=*pos++;
|
||||
return readCompactInt(leadByte);
|
||||
}
|
||||
|
||||
// pos is on the leadByte.
|
||||
inline void skipCompactInt(int32_t leadByte);
|
||||
inline void skipCompactInt() { skipCompactInt(*pos); }
|
||||
|
||||
// Reads a fixed-width integer and post-increments pos.
|
||||
inline int32_t readFixedInt(int32_t bytesPerValue);
|
||||
|
||||
// Node lead byte values.
|
||||
|
||||
// 0..3: Branch node with one comparison byte, 1..4 bytes for less-than jump delta,
|
||||
// and compact int for equality.
|
||||
|
||||
// 04..0b: Branch node with a list of 2..9 bytes comparison bytes, each except last one
|
||||
// followed by compact int as final value or jump delta.
|
||||
static const int32_t kMinListBranch=4;
|
||||
// 0c..1f: Node with 1..20 bytes to match.
|
||||
static const int32_t kMinLinearMatch=0xc;
|
||||
// 20..ff: Intermediate value or jump delta, or final value, with 0..4 bytes following.
|
||||
static const int32_t kMinValueLead=0x20;
|
||||
// It is a final value if bit 0 is set.
|
||||
static const int32_t kValueIsFinal=1;
|
||||
// Compact int: After testing bit 0, shift right by 1 and then use the following thresholds.
|
||||
static const int32_t kMinOneByteLead=0x10;
|
||||
static const int32_t kMinTwoByteLead=0x51;
|
||||
static const int32_t kMinThreeByteLead=0x6d;
|
||||
static const int32_t kFourByteLead=0x7e;
|
||||
static const int32_t kFiveByteLead=0x7f;
|
||||
|
||||
static const int32_t kMaxOneByteValue=0x40; // At least 6 bits in the first byte.
|
||||
static const int32_t kMaxTwoByteValue=0x1bff;
|
||||
static const int32_t kMaxThreeByteValue=0x11ffff; // A little more than Unicode code points.
|
||||
|
||||
static const int32_t kMaxListBranchLength=kMinLinearMatch-kMinListBranch+1; // 9
|
||||
static const int32_t kMaxLinearMatchLength=kMinValueLead-kMinLinearMatch; // 20
|
||||
|
||||
// Map a shifted-right compact-int lead byte to its number of bytes.
|
||||
static const int8_t bytesPerLead[kFiveByteLead+1];
|
||||
|
||||
// Fixed value referencing the ByteTrie bytes.
|
||||
const uint8_t *bytes;
|
||||
|
||||
// Iterator variables.
|
||||
|
||||
// Pointer to next trie byte to read. NULL if no more matches.
|
||||
const uint8_t *pos;
|
||||
// Remaining length of a linear-match node, minus 1. Negative if not in such a node.
|
||||
int32_t remainingMatchLength;
|
||||
// Value for a match, after contains() returned TRUE.
|
||||
int32_t value;
|
||||
};
|
||||
|
||||
const int8_t ByteTrie::bytesPerLead[kFiveByteLead+1]={
|
||||
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
|
||||
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
|
||||
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
|
||||
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
|
||||
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
|
||||
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
|
||||
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
|
||||
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 5
|
||||
};
|
||||
|
||||
UBool
|
||||
ByteTrie::readCompactInt(int32_t leadByte) {
|
||||
UBool isFinal=(UBool)(leadByte&kValueIsFinal);
|
||||
leadByte>>=1;
|
||||
int numBytes=bytesPerLead[leadByte]-1; // -1: lead byte was already consumed.
|
||||
switch(numBytes) {
|
||||
case 0:
|
||||
value=leadByte-kMinOneByteLead;
|
||||
break;
|
||||
case 1:
|
||||
value=((leadByte-kMinTwoByteLead)<<8)|*pos;
|
||||
break;
|
||||
case 2:
|
||||
value=((leadByte-kMinThreeByteLead)<<16)|(pos[0]<<8)|pos[1];
|
||||
break;
|
||||
case 3:
|
||||
value=(pos[0]<<16)|(pos[1]<<8)|pos[2];
|
||||
break;
|
||||
case 4:
|
||||
value=(pos[0]<<24)|(pos[1]<<16)|(pos[2]<<8)|pos[3];
|
||||
break;
|
||||
}
|
||||
pos+=numBytes;
|
||||
return isFinal;
|
||||
}
|
||||
|
||||
void
|
||||
ByteTrie::skipCompactInt(int32_t leadByte) {
|
||||
pos+=bytesPerLead[leadByte>>1];
|
||||
}
|
||||
|
||||
int32_t
|
||||
ByteTrie::readFixedInt(int32_t bytesPerValue) {
|
||||
int32_t fixedInt;
|
||||
switch(bytesPerValue) { // Actually number of bytes minus 1.
|
||||
case 0:
|
||||
fixedInt=*pos;
|
||||
break;
|
||||
case 1:
|
||||
fixedInt=(pos[0]<<8)|pos[1];
|
||||
break;
|
||||
case 2:
|
||||
fixedInt=(pos[0]<<16)|(pos[1]<<8)|pos[2];
|
||||
break;
|
||||
case 3:
|
||||
fixedInt=(pos[0]<<24)|(pos[1]<<16)|(pos[2]<<8)|pos[3];
|
||||
break;
|
||||
}
|
||||
pos+=bytesPerValue+1;
|
||||
return fixedInt;
|
||||
}
|
||||
|
||||
UBool
|
||||
ByteTrie::next(int inByte) {
|
||||
if(pos==NULL) {
|
||||
return FALSE;
|
||||
}
|
||||
int32_t length=remainingMatchLength; // Actual remaining match length minus 1.
|
||||
if(length>=0) {
|
||||
// Remaining part of a linear-match node.
|
||||
if(inByte==*pos) {
|
||||
remainingMatchLength=length-1;
|
||||
++pos;
|
||||
return TRUE;
|
||||
} else {
|
||||
// No match.
|
||||
stop();
|
||||
return FALSE;
|
||||
}
|
||||
}
|
||||
int32_t node=*pos++;
|
||||
if(node>=kMinValueLead) {
|
||||
if(node&kValueIsFinal) {
|
||||
// No further matching bytes.
|
||||
stop();
|
||||
return FALSE;
|
||||
} else {
|
||||
// Skip intermediate value.
|
||||
skipCompactInt(node);
|
||||
// The next node must not also be a value node.
|
||||
node=*pos++;
|
||||
// TODO: U_ASSERT(node<kMinValueLead);
|
||||
}
|
||||
}
|
||||
if(node<kMinLinearMatch) {
|
||||
// Branch according to the current byte.
|
||||
while(node<kMinListBranch) {
|
||||
// Branching on a byte value,
|
||||
// with a jump delta for less-than, a compact int for equals,
|
||||
// and continuing for greater-than.
|
||||
// The less-than and greater-than branches must lead to branch nodes again.
|
||||
uint8_t trieByte=*pos++;
|
||||
if(inByte<trieByte) {
|
||||
int32_t delta=readFixedInt(node);
|
||||
pos+=delta;
|
||||
} else {
|
||||
pos+=node+1; // Skip fixed-width integer.
|
||||
node=*pos;
|
||||
if(inByte==trieByte) {
|
||||
// TODO: U_ASSERT(node>=KMinValueLead);
|
||||
if(node&kValueIsFinal) {
|
||||
// Leave the final value for contains() to read.
|
||||
} else {
|
||||
// Use the non-final value as the jump delta.
|
||||
++pos;
|
||||
readCompactInt(node);
|
||||
pos+=value;
|
||||
}
|
||||
return TRUE;
|
||||
} else { // inByte>trieByte
|
||||
skipCompactInt(node);
|
||||
}
|
||||
}
|
||||
node=*pos++;
|
||||
// TODO: U_ASSERT(node<kMinLinearMatch);
|
||||
}
|
||||
// Branch node with a list of key-value pairs where
|
||||
// values are compact integers: either final values or jump deltas.
|
||||
// If the last key byte matches, just continue after it rather
|
||||
// than jumping.
|
||||
length=node-(kMinListBranch-1); // Actual list length minus 1.
|
||||
for(;;) {
|
||||
uint8_t trieByte=*pos++;
|
||||
// U_ASSERT(listLength==0 || *pos>=KMinValueLead);
|
||||
if(inByte==trieByte) {
|
||||
if(length>0) {
|
||||
node=*pos;
|
||||
if(node&kValueIsFinal) {
|
||||
// Leave the final value for contains() to read.
|
||||
} else {
|
||||
// Use the non-final value as the jump delta.
|
||||
++pos;
|
||||
readCompactInt(node);
|
||||
pos+=value;
|
||||
}
|
||||
}
|
||||
return TRUE;
|
||||
}
|
||||
if(inByte<trieByte || length--==0) {
|
||||
stop();
|
||||
return FALSE;
|
||||
}
|
||||
skipCompactInt();
|
||||
}
|
||||
} else {
|
||||
// Match the first of length+1 bytes.
|
||||
length=node-kMinLinearMatch; // Actual match length minus 1.
|
||||
if(inByte==*pos) {
|
||||
remainingMatchLength=length-1;
|
||||
++pos;
|
||||
return TRUE;
|
||||
} else {
|
||||
// No match.
|
||||
stop();
|
||||
return FALSE;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
UBool
|
||||
ByteTrie::contains() {
|
||||
int32_t node;
|
||||
if(pos!=NULL && remainingMatchLength<0 && (node=*pos)>=kMinValueLead) {
|
||||
// Deliver value for the matching bytes.
|
||||
++pos;
|
||||
if(readCompactInt(node)) {
|
||||
stop();
|
||||
}
|
||||
return TRUE;
|
||||
}
|
||||
return FALSE;
|
||||
}
|
||||
|
||||
UBool
|
||||
ByteTrie::containsNext(const char *s, int32_t length) {
|
||||
if(length<0) {
|
||||
// NUL-terminated
|
||||
int b;
|
||||
while((b=(uint8_t)*s++)!=0) {
|
||||
if(!next(b)) {
|
||||
return FALSE;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
while(length>0) {
|
||||
if(!next((uint8_t)*s++)) {
|
||||
return FALSE;
|
||||
}
|
||||
--length;
|
||||
}
|
||||
}
|
||||
return contains();
|
||||
}
|
||||
|
||||
U_NAMESPACE_END
|
||||
|
||||
#endif // __BYTETRIE_H__
|
536
docs/design/struct/tries/bytestrie/bytetriebuilder.h
Normal file
536
docs/design/struct/tries/bytestrie/bytetriebuilder.h
Normal file
|
@ -0,0 +1,536 @@
|
|||
// © 2016 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
/*
|
||||
*******************************************************************************
|
||||
* Copyright (C) 2010, International Business Machines
|
||||
* Corporation and others. All Rights Reserved.
|
||||
*******************************************************************************
|
||||
* file name: bytetriebuilder.h
|
||||
* encoding: US-ASCII
|
||||
* tab size: 8 (not used)
|
||||
* indentation:4
|
||||
*
|
||||
* created on: 2010sep25
|
||||
* created by: Markus W. Scherer
|
||||
*
|
||||
* Builder class for ByteTrie dictionary trie.
|
||||
*/
|
||||
|
||||
#ifndef __BYTETRIEBUILDER_H__
|
||||
#define __BYTETRIEBUILDER_H__
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/stringpiece.h"
|
||||
#include "bytetrie.h"
|
||||
#include "charstr.h"
|
||||
#include "cmemory.h"
|
||||
#include "uarrsort.h"
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
||||
class ByteTrieElement;
|
||||
|
||||
class /*U_TOOLUTIL_API*/ ByteTrieBuilder : public UMemory {
|
||||
public:
|
||||
ByteTrieBuilder()
|
||||
: elements(NULL), elementsCapacity(0), elementsLength(0),
|
||||
bytes(NULL), bytesCapacity(0), bytesLength(0) {}
|
||||
~ByteTrieBuilder();
|
||||
|
||||
ByteTrieBuilder &add(const StringPiece &s, int32_t value, UErrorCode &errorCode);
|
||||
|
||||
StringPiece build(UErrorCode &errorCode);
|
||||
|
||||
ByteTrieBuilder &clear() {
|
||||
strings.clear();
|
||||
elementsLength=0;
|
||||
bytesLength=0;
|
||||
return *this;
|
||||
}
|
||||
|
||||
private:
|
||||
void makeNode(int32_t start, int32_t limit, int32_t byteIndex);
|
||||
void makeListBranchNode(int32_t start, int32_t limit, int32_t byteIndex, int32_t length);
|
||||
void makeThreeWayBranchNode(int32_t start, int32_t limit, int32_t byteIndex, int32_t length);
|
||||
|
||||
UBool ensureCapacity(int32_t length);
|
||||
void write(int32_t byte);
|
||||
void write(const char *b, int32_t length);
|
||||
void writeCompactInt(int32_t i, UBool final);
|
||||
int32_t writeFixedInt(int32_t i); // Returns number of bytes.
|
||||
|
||||
CharString strings;
|
||||
ByteTrieElement *elements;
|
||||
int32_t elementsCapacity;
|
||||
int32_t elementsLength;
|
||||
|
||||
// Byte serialization of the trie.
|
||||
// Grows from the back: bytesLength measures from the end of the buffer!
|
||||
char *bytes;
|
||||
int32_t bytesCapacity;
|
||||
int32_t bytesLength;
|
||||
};
|
||||
|
||||
/*
|
||||
* Note: This builder implementation stores (bytes, value) pairs with full copies
|
||||
* of the byte sequences, until the ByteTrie is built.
|
||||
* It might(!) take less memory if we collected the data in a temporary, dynamic trie.
|
||||
*/
|
||||
|
||||
class ByteTrieElement : public UMemory {
|
||||
public:
|
||||
// Use compiler's default constructor, initializes nothing.
|
||||
|
||||
void setTo(const StringPiece &s, int32_t val, CharString &strings, UErrorCode &errorCode);
|
||||
|
||||
StringPiece getString(const CharString &strings) const {
|
||||
int32_t offset=stringOffset;
|
||||
int32_t length;
|
||||
if(offset>=0) {
|
||||
length=(uint8_t)strings[offset++];
|
||||
} else {
|
||||
offset=~offset;
|
||||
length=((int32_t)(uint8_t)strings[offset]<<8)|(uint8_t)strings[offset+1];
|
||||
offset+=2;
|
||||
}
|
||||
return StringPiece(strings.data()+offset, length);
|
||||
}
|
||||
int32_t getStringLength(const CharString &strings) const {
|
||||
int32_t offset=stringOffset;
|
||||
if(offset>=0) {
|
||||
return (uint8_t)strings[offset];
|
||||
} else {
|
||||
offset=~offset;
|
||||
return ((int32_t)(uint8_t)strings[offset]<<8)|(uint8_t)strings[offset+1];
|
||||
}
|
||||
}
|
||||
|
||||
char charAt(int32_t index, const CharString &strings) const { return data(strings)[index]; }
|
||||
|
||||
int32_t getValue() const { return value; }
|
||||
|
||||
int32_t compareStringTo(const ByteTrieElement &o, const CharString &strings) const;
|
||||
|
||||
private:
|
||||
const char *data(const CharString &strings) const {
|
||||
int32_t offset=stringOffset;
|
||||
if(offset>=0) {
|
||||
++offset;
|
||||
} else {
|
||||
offset=~offset+2;
|
||||
}
|
||||
return strings.data()+offset;
|
||||
}
|
||||
|
||||
// If the stringOffset is non-negative, then the first strings byte contains
|
||||
// the string length.
|
||||
// If the stringOffset is negative, then the first two strings bytes contain
|
||||
// the string length (big-endian), and the offset needs to be bit-inverted.
|
||||
// (Compared with a stringLength field here, this saves 3 bytes per string for most strings.)
|
||||
int32_t stringOffset;
|
||||
int32_t value;
|
||||
};
|
||||
|
||||
void
|
||||
ByteTrieElement::setTo(const StringPiece &s, int32_t val,
|
||||
CharString &strings, UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) {
|
||||
return;
|
||||
}
|
||||
int32_t length=s.length();
|
||||
if(length>0xffff) {
|
||||
// Too long: We store the length in 1 or 2 bytes.
|
||||
errorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return;
|
||||
}
|
||||
int32_t offset=strings.length();
|
||||
if(length>0xff) {
|
||||
offset=~offset;
|
||||
strings.append((char)(length>>8), errorCode);
|
||||
}
|
||||
strings.append((char)length, errorCode);
|
||||
stringOffset=offset;
|
||||
value=val;
|
||||
strings.append(s, errorCode);
|
||||
}
|
||||
|
||||
int32_t
|
||||
ByteTrieElement::compareStringTo(const ByteTrieElement &other, const CharString &strings) const {
|
||||
// TODO: add StringPiece.compareTo()
|
||||
StringPiece thisString=getString(strings);
|
||||
StringPiece otherString=other.getString(strings);
|
||||
int32_t lengthDiff=thisString.length()-otherString.length();
|
||||
int32_t commonLength;
|
||||
if(lengthDiff<=0) {
|
||||
commonLength=thisString.length();
|
||||
} else {
|
||||
commonLength=otherString.length();
|
||||
}
|
||||
int32_t diff=uprv_memcmp(thisString.data(), otherString.data(), commonLength);
|
||||
return diff!=0 ? diff : lengthDiff;
|
||||
}
|
||||
|
||||
ByteTrieBuilder::~ByteTrieBuilder() {
|
||||
delete[] elements;
|
||||
uprv_free(bytes);
|
||||
}
|
||||
|
||||
ByteTrieBuilder &
|
||||
ByteTrieBuilder::add(const StringPiece &s, int32_t value, UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) {
|
||||
return *this;
|
||||
}
|
||||
if(bytesLength>0) {
|
||||
// Cannot add elements after building.
|
||||
errorCode=U_NO_WRITE_PERMISSION;
|
||||
return *this;
|
||||
}
|
||||
bytesCapacity+=s.length()+1; // Crude bytes preallocation estimate.
|
||||
if(elementsLength==elementsCapacity) {
|
||||
int32_t newCapacity;
|
||||
if(elementsCapacity==0) {
|
||||
newCapacity=1024;
|
||||
} else {
|
||||
newCapacity=4*elementsCapacity;
|
||||
}
|
||||
ByteTrieElement *newElements=new ByteTrieElement[newCapacity];
|
||||
if(newElements==NULL) {
|
||||
errorCode=U_MEMORY_ALLOCATION_ERROR;
|
||||
}
|
||||
if(elementsLength>0) {
|
||||
uprv_memcpy(newElements, elements, elementsLength*sizeof(ByteTrieElement));
|
||||
}
|
||||
delete[] elements;
|
||||
elements=newElements;
|
||||
}
|
||||
elements[elementsLength++].setTo(s, value, strings, errorCode);
|
||||
return *this;
|
||||
}
|
||||
|
||||
U_CDECL_BEGIN
|
||||
|
||||
static int32_t U_CALLCONV
|
||||
compareElementStrings(const void *context, const void *left, const void *right) {
|
||||
const CharString *strings=reinterpret_cast<const CharString *>(context);
|
||||
const ByteTrieElement *leftElement=reinterpret_cast<const ByteTrieElement *>(left);
|
||||
const ByteTrieElement *rightElement=reinterpret_cast<const ByteTrieElement *>(right);
|
||||
return leftElement->compareStringTo(*rightElement, *strings);
|
||||
}
|
||||
|
||||
U_CDECL_END
|
||||
|
||||
StringPiece
|
||||
ByteTrieBuilder::build(UErrorCode &errorCode) {
|
||||
StringPiece result;
|
||||
if(U_FAILURE(errorCode)) {
|
||||
return result;
|
||||
}
|
||||
if(bytesLength>0) {
|
||||
// Already built.
|
||||
result.set(bytes+(bytesCapacity-bytesLength), bytesLength);
|
||||
return result;
|
||||
}
|
||||
if(elementsLength==0) {
|
||||
errorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return result;
|
||||
}
|
||||
uprv_sortArray(elements, elementsLength, (int32_t)sizeof(ByteTrieElement),
|
||||
compareElementStrings, &strings,
|
||||
FALSE, // need not be a stable sort
|
||||
&errorCode);
|
||||
if(U_FAILURE(errorCode)) {
|
||||
return result;
|
||||
}
|
||||
// Duplicate strings are not allowed.
|
||||
StringPiece prev=elements[0].getString(strings);
|
||||
for(int32_t i=1; i<elementsLength; ++i) {
|
||||
StringPiece current=elements[i].getString(strings);
|
||||
if(prev==current) {
|
||||
errorCode=U_ILLEGAL_ARGUMENT_ERROR;
|
||||
return result;
|
||||
}
|
||||
prev=current;
|
||||
}
|
||||
// Create and byte-serialize the trie for the elements.
|
||||
if(bytesCapacity<1024) {
|
||||
bytesCapacity=1024;
|
||||
}
|
||||
bytes=reinterpret_cast<char *>(uprv_malloc(bytesCapacity));
|
||||
if(bytes==NULL) {
|
||||
errorCode=U_MEMORY_ALLOCATION_ERROR;
|
||||
return result;
|
||||
}
|
||||
makeNode(0, elementsLength, 0);
|
||||
if(bytes==NULL) {
|
||||
errorCode=U_MEMORY_ALLOCATION_ERROR;
|
||||
} else {
|
||||
result.set(bytes+(bytesCapacity-bytesLength), bytesLength);
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
// Requires start<limit,
|
||||
// and all strings of the [start..limit[ elements must be sorted and
|
||||
// have a common prefix of length firstByteIndex.
|
||||
void
|
||||
ByteTrieBuilder::makeNode(int32_t start, int32_t limit, int32_t byteIndex) {
|
||||
if(byteIndex==elements[start].getStringLength(strings)) {
|
||||
// An intermediate or final value.
|
||||
int32_t value=elements[start++].getValue();
|
||||
UBool final= start==limit;
|
||||
if(!final) {
|
||||
makeNode(start, limit, byteIndex);
|
||||
}
|
||||
writeCompactInt(value, final);
|
||||
return;
|
||||
}
|
||||
// Now all [start..limit[ strings are longer than byteIndex.
|
||||
int32_t minByte=(uint8_t)elements[start].charAt(byteIndex, strings);
|
||||
int32_t maxByte=(uint8_t)elements[limit-1].charAt(byteIndex, strings);
|
||||
if(minByte==maxByte) {
|
||||
// Linear-match node: All strings have the same character at byteIndex.
|
||||
int32_t lastByteIndex=byteIndex;
|
||||
int32_t length=0;
|
||||
do {
|
||||
++lastByteIndex;
|
||||
++length;
|
||||
} while(length<ByteTrie::kMaxLinearMatchLength &&
|
||||
elements[start].getStringLength(strings)>lastByteIndex &&
|
||||
elements[start].charAt(lastByteIndex, strings)==
|
||||
elements[limit-1].charAt(lastByteIndex, strings));
|
||||
makeNode(start, limit, lastByteIndex);
|
||||
write(elements[start].getString(strings).data()+byteIndex, length);
|
||||
write(ByteTrie::kMinLinearMatch+length-1);
|
||||
return;
|
||||
}
|
||||
// Branch node.
|
||||
int32_t length=0; // Number of different bytes at byteIndex.
|
||||
int32_t i=start;
|
||||
do {
|
||||
char byte=elements[i++].charAt(byteIndex, strings);
|
||||
while(i<limit && byte==elements[i].charAt(byteIndex, strings)) {
|
||||
++i;
|
||||
}
|
||||
++length;
|
||||
} while(i<limit);
|
||||
// length>=2 because minByte!=maxByte.
|
||||
if(length<=ByteTrie::kMaxListBranchLength) {
|
||||
makeListBranchNode(start, limit, byteIndex, length);
|
||||
} else {
|
||||
makeThreeWayBranchNode(start, limit, byteIndex, length);
|
||||
}
|
||||
}
|
||||
|
||||
// start<limit && all strings longer than byteIndex &&
|
||||
// 2..kMaxListBranchLength different bytes at byteIndex
|
||||
void
|
||||
ByteTrieBuilder::makeListBranchNode(int32_t start, int32_t limit, int32_t byteIndex, int32_t length) {
|
||||
// List of byte-value pairs where values are either final values
|
||||
// or jumps to other parts of the trie.
|
||||
int32_t starts[ByteTrie::kMaxListBranchLength-1];
|
||||
UBool final[ByteTrie::kMaxListBranchLength-1];
|
||||
// For each byte except the last one, find its elements array start and its value if final.
|
||||
int32_t byteNumber=0;
|
||||
do {
|
||||
int32_t i=starts[byteNumber]=start;
|
||||
char byte=elements[i++].charAt(byteIndex, strings);
|
||||
while(byte==elements[i].charAt(byteIndex, strings)) {
|
||||
++i;
|
||||
}
|
||||
final[byteNumber]= start==i-1 && byteIndex+1==elements[start].getStringLength(strings);
|
||||
start=i;
|
||||
} while(++byteNumber<length-1);
|
||||
// byteNumber==length-1, and the maxByte elements range is [start..limit[
|
||||
|
||||
// Write the sub-nodes in reverse order: The jump lengths are deltas from
|
||||
// after their own positions, so if we wrote the minByte sub-node first,
|
||||
// then its jump delta would be larger.
|
||||
// Instead we write the minByte sub-node last, for a shorter delta.
|
||||
int32_t jumpTargets[ByteTrie::kMaxListBranchLength-1];
|
||||
byteNumber-=2;
|
||||
do {
|
||||
if(!final[byteNumber]) {
|
||||
makeNode(starts[byteNumber], starts[byteNumber+1], byteIndex+1);
|
||||
jumpTargets[byteNumber]=bytesLength;
|
||||
}
|
||||
} while(--byteNumber>=0);
|
||||
// The maxByte sub-node is written as the very last one because we do
|
||||
// not jump for it at all.
|
||||
byteNumber=length-1;
|
||||
makeNode(start, limit, byteIndex+1);
|
||||
write(elements[start].charAt(byteIndex, strings));
|
||||
// Write the rest of this node's byte-value pairs.
|
||||
while(--byteNumber>=0) {
|
||||
start=starts[byteNumber];
|
||||
int32_t value;
|
||||
if(final[byteNumber]) {
|
||||
// Write the final value for the one string ending with this byte.
|
||||
value=elements[start].getValue();
|
||||
} else {
|
||||
// Write the delta to the start position of the sub-node.
|
||||
value=bytesLength-jumpTargets[byteNumber];
|
||||
}
|
||||
writeCompactInt(value, final[byteNumber]);
|
||||
write(elements[start].charAt(byteIndex, strings));
|
||||
}
|
||||
// Write the node lead byte.
|
||||
write(ByteTrie::kMinListBranch+length-2);
|
||||
}
|
||||
|
||||
// start<limit && all strings longer than byteIndex &&
|
||||
// at least three different bytes at byteIndex
|
||||
void
|
||||
ByteTrieBuilder::makeThreeWayBranchNode(int32_t start, int32_t limit, int32_t byteIndex, int32_t length) {
|
||||
// Three-way branch on the middle byte.
|
||||
// Find the middle byte.
|
||||
length/=2; // >=1
|
||||
int32_t i=start;
|
||||
do {
|
||||
char byte=elements[i++].charAt(byteIndex, strings);
|
||||
while(byte==elements[i].charAt(byteIndex, strings)) {
|
||||
++i;
|
||||
}
|
||||
} while(--length>0);
|
||||
// Encode the less-than branch first.
|
||||
// Unlike in the list-branch node (see comments above) where
|
||||
// all jumps are encoded in compact integers, in this node type the
|
||||
// less-than jump is more efficient
|
||||
// (because it is only ever a jump, with a known number of bytes)
|
||||
// than the equals jump (where a jump needs to be distinguished from a final value).
|
||||
makeNode(start, i, byteIndex);
|
||||
int32_t leftNode=bytesLength;
|
||||
// Find the elements range for the middle byte.
|
||||
start=i;
|
||||
char byte=elements[i++].charAt(byteIndex, strings);
|
||||
while(byte==elements[i].charAt(byteIndex, strings)) {
|
||||
++i;
|
||||
}
|
||||
// Encode the equals branch.
|
||||
int32_t value;
|
||||
UBool final;
|
||||
if(start==i-1 && byteIndex+1==elements[start].getStringLength(strings)) {
|
||||
// Store the final value for the one string ending with this byte.
|
||||
value=elements[start].getValue();
|
||||
final=TRUE;
|
||||
} else {
|
||||
// Store the start position of the sub-node.
|
||||
makeNode(start, i, byteIndex+1);
|
||||
value=bytesLength;
|
||||
final=FALSE;
|
||||
}
|
||||
// Encode the greater-than branch last because we do not jump for it at all.
|
||||
makeNode(i, limit, byteIndex);
|
||||
// Write this node.
|
||||
if(!final) {
|
||||
value=bytesLength-value;
|
||||
}
|
||||
writeCompactInt(value, final); // equals
|
||||
int32_t bytesForJump=writeFixedInt(bytesLength-leftNode); // less-than
|
||||
write(byte);
|
||||
write(bytesForJump-1);
|
||||
}
|
||||
|
||||
UBool
|
||||
ByteTrieBuilder::ensureCapacity(int32_t length) {
|
||||
if(bytes==NULL) {
|
||||
return FALSE; // previous memory allocation had failed
|
||||
}
|
||||
if(length>bytesCapacity) {
|
||||
int32_t newCapacity=bytesCapacity;
|
||||
do {
|
||||
newCapacity*=2;
|
||||
} while(newCapacity<=length);
|
||||
char *newBytes=reinterpret_cast<char *>(uprv_malloc(newCapacity));
|
||||
if(newBytes==NULL) {
|
||||
// unable to allocate memory
|
||||
uprv_free(bytes);
|
||||
bytes=NULL;
|
||||
return FALSE;
|
||||
}
|
||||
uprv_memcpy(newBytes+(newCapacity-bytesLength),
|
||||
bytes+(bytesCapacity-bytesLength), bytesLength);
|
||||
uprv_free(bytes);
|
||||
bytes=newBytes;
|
||||
bytesCapacity=newCapacity;
|
||||
}
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
void
|
||||
ByteTrieBuilder::write(int32_t byte) {
|
||||
int32_t newLength=bytesLength+1;
|
||||
if(ensureCapacity(newLength)) {
|
||||
bytesLength=newLength;
|
||||
bytes[bytesCapacity-bytesLength]=(char)byte;
|
||||
}
|
||||
}
|
||||
|
||||
void
|
||||
ByteTrieBuilder::write(const char *b, int32_t length) {
|
||||
int32_t newLength=bytesLength+length;
|
||||
if(ensureCapacity(newLength)) {
|
||||
bytesLength=newLength;
|
||||
uprv_memcpy(bytes+(bytesCapacity-bytesLength), b, length);
|
||||
}
|
||||
}
|
||||
|
||||
void
|
||||
ByteTrieBuilder::writeCompactInt(int32_t i, UBool final) {
|
||||
char intBytes[5];
|
||||
int32_t length=1;
|
||||
if(i<0 || i>0xffffff) {
|
||||
intBytes[0]=(char)(ByteTrie::kFiveByteLead);
|
||||
intBytes[1]=(char)(i>>24);
|
||||
intBytes[2]=(char)(i>>16);
|
||||
intBytes[3]=(char)(i>>8);
|
||||
intBytes[4]=(char)(i);
|
||||
length=5;
|
||||
} else if(i<=ByteTrie::kMaxOneByteValue) {
|
||||
intBytes[0]=(char)(ByteTrie::kMinOneByteLead+i);
|
||||
} else {
|
||||
if(i<=ByteTrie::kMaxTwoByteValue) {
|
||||
intBytes[0]=(char)(ByteTrie::kMinTwoByteLead+(i>>8));
|
||||
} else {
|
||||
if(i<=ByteTrie::kMaxThreeByteValue) {
|
||||
intBytes[0]=(char)(ByteTrie::kMinThreeByteLead+(i>>16));
|
||||
} else {
|
||||
intBytes[0]=(char)(ByteTrie::kFourByteLead);
|
||||
intBytes[1]=(char)(i>>16);
|
||||
length=2;
|
||||
}
|
||||
intBytes[length++]=(char)(i>>8);
|
||||
}
|
||||
intBytes[length++]=(char)(i);
|
||||
}
|
||||
intBytes[0]=(char)((intBytes[0]<<1)|final);
|
||||
write(intBytes, length);
|
||||
}
|
||||
|
||||
int32_t
|
||||
ByteTrieBuilder::writeFixedInt(int32_t i) {
|
||||
char intBytes[4];
|
||||
int32_t length;
|
||||
if(i<0 || i>0xffffff) {
|
||||
intBytes[0]=(char)(i>>24);
|
||||
intBytes[1]=(char)(i>>16);
|
||||
intBytes[2]=(char)(i>>8);
|
||||
length=3; // last byte below
|
||||
} else {
|
||||
if(i<=0xffff) {
|
||||
length=0;
|
||||
} else {
|
||||
intBytes[0]=(char)(i>>16);
|
||||
length=1;
|
||||
}
|
||||
if(i>0xff) {
|
||||
intBytes[length++]=(char)(i>>8);
|
||||
}
|
||||
}
|
||||
intBytes[length++]=(char)(i);
|
||||
write(intBytes, length);
|
||||
return length;
|
||||
}
|
||||
|
||||
U_NAMESPACE_END
|
||||
|
||||
#endif // __BYTETRIEBUILDER_H__
|
137
docs/design/struct/tries/bytestrie/bytetriedemo.cpp
Normal file
137
docs/design/struct/tries/bytestrie/bytetriedemo.cpp
Normal file
|
@ -0,0 +1,137 @@
|
|||
// © 2016 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
/*
|
||||
*******************************************************************************
|
||||
* Copyright (C) 2010, International Business Machines
|
||||
* Corporation and others. All Rights Reserved.
|
||||
*******************************************************************************
|
||||
* file name: bytetriedemo.cpp
|
||||
* encoding: US-ASCII
|
||||
* tab size: 8 (not used)
|
||||
* indentation:4
|
||||
*
|
||||
* created on: 2010nov05
|
||||
* created by: Markus W. Scherer
|
||||
*/
|
||||
|
||||
#include <stdio.h>
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/stringpiece.h"
|
||||
|
||||
#include "bytetrie.h"
|
||||
#include "bytetriebuilder.h"
|
||||
#include "bytetrieiterator.h"
|
||||
#include "denseranges.h"
|
||||
#include "toolutil.h"
|
||||
|
||||
#define LENGTHOF(array) (int32_t)(sizeof(array)/sizeof((array)[0]))
|
||||
|
||||
static void
|
||||
printBytes(const char *name, const StringPiece &bytes) {
|
||||
printf("%18s [%3d]", name, (int)bytes.length());
|
||||
for(int32_t i=0; i<bytes.length(); ++i) {
|
||||
printf(" %02x", bytes.data()[i]&0xff); // TODO: Add StringPiece::operator[] const
|
||||
}
|
||||
puts("");
|
||||
}
|
||||
|
||||
static void
|
||||
printTrie(const StringPiece &bytes) {
|
||||
IcuToolErrorCode errorCode("printTrie");
|
||||
ByteTrieIterator iter(bytes.data(), errorCode);
|
||||
while(iter.next(errorCode)) {
|
||||
printf(" '%s': %d\n", iter.getString().data(), (int)iter.getValue());
|
||||
}
|
||||
}
|
||||
|
||||
static void printRanges(const int32_t ranges[][2], int32_t length) {
|
||||
printf("ranges[%d]", (int)length);
|
||||
for(int32_t i=0; i<length; ++i) {
|
||||
printf(" [%ld..%ld]", (long)ranges[i][0], (long)ranges[i][1]);
|
||||
}
|
||||
puts("");
|
||||
}
|
||||
|
||||
extern int main(int argc, char* argv[]) {
|
||||
IcuToolErrorCode errorCode("bytetriedemo");
|
||||
ByteTrieBuilder builder;
|
||||
StringPiece sp=builder.add("", 0, errorCode).build(errorCode);
|
||||
printBytes("empty string", sp);
|
||||
ByteTrie empty(sp.data());
|
||||
UBool contains=empty.contains();
|
||||
printf("empty.next() %d %d\n", contains, (int)empty.getValue());
|
||||
printTrie(sp);
|
||||
|
||||
sp=builder.clear().add("a", 1, errorCode).build(errorCode);
|
||||
printBytes("a", sp);
|
||||
ByteTrie a(sp.data());
|
||||
contains=a.next('a') && a.contains();
|
||||
printf("a.next(a) %d %d\n", contains, (int)a.getValue());
|
||||
printTrie(sp);
|
||||
|
||||
sp=builder.clear().add("ab", -1, errorCode).build(errorCode);
|
||||
printBytes("ab", sp);
|
||||
ByteTrie ab(sp.data());
|
||||
contains=ab.next('a') && ab.next('b') && ab.contains();
|
||||
printf("ab.next(ab) %d %d\n", contains, (int)ab.getValue());
|
||||
printTrie(sp);
|
||||
|
||||
sp=builder.clear().add("a", 1, errorCode).add("ab", 100, errorCode).build(errorCode);
|
||||
printBytes("a+ab", sp);
|
||||
ByteTrie a_ab(sp.data());
|
||||
contains=a_ab.next('a') && a_ab.contains();
|
||||
printf("a_ab.next(a) %d %d\n", contains, (int)a_ab.getValue());
|
||||
contains=a_ab.next('b') && a_ab.contains();
|
||||
printf("a_ab.next(b) %d %d\n", contains, (int)a_ab.getValue());
|
||||
contains=a_ab.contains();
|
||||
printf("a_ab.next() %d %d\n", contains, (int)a_ab.getValue());
|
||||
printTrie(sp);
|
||||
|
||||
sp=builder.clear().add("a", 1, errorCode).add("b", 2, errorCode).add("c", 3, errorCode).build(errorCode);
|
||||
printBytes("a+b+c", sp);
|
||||
ByteTrie a_b_c(sp.data());
|
||||
contains=a_b_c.next('a') && a_b_c.contains();
|
||||
printf("a_b_c.next(a) %d %d\n", contains, (int)a_b_c.getValue());
|
||||
contains=a_b_c.next('b') && a_b_c.contains();
|
||||
printf("a_b_c.next(b) %d %d\n", contains, (int)a_b_c.getValue());
|
||||
contains=a_b_c.reset().next('b') && a_b_c.contains();
|
||||
printf("a_b_c.r.next(b) %d %d\n", contains, (int)a_b_c.getValue());
|
||||
contains=a_b_c.reset().next('c') && a_b_c.contains();
|
||||
printf("a_b_c.r.next(c) %d %d\n", contains, (int)a_b_c.getValue());
|
||||
contains=a_b_c.reset().next('d') && a_b_c.contains();
|
||||
printf("a_b_c.r.next(d) %d %d\n", contains, (int)a_b_c.getValue());
|
||||
printTrie(sp);
|
||||
|
||||
builder.clear().add("a", 1, errorCode).add("b", 2, errorCode).add("c", 3, errorCode);
|
||||
builder.add("d", 10, errorCode).add("e", 20, errorCode).add("f", 30, errorCode);
|
||||
builder.add("g", 100, errorCode).add("h", 200, errorCode).add("i", 300, errorCode);
|
||||
builder.add("j", 1000, errorCode).add("k", 2000, errorCode).add("l", 3000, errorCode);
|
||||
sp=builder.build(errorCode);
|
||||
printBytes("a-l", sp);
|
||||
ByteTrie a_l(sp.data());
|
||||
for(char c='`'; c<='m'; ++c) {
|
||||
contains=a_l.reset().next(c) && a_l.contains();
|
||||
printf("a_l.r.next(%c) %d %d\n", c, contains, (int)a_l.getValue());
|
||||
}
|
||||
printTrie(sp);
|
||||
|
||||
static const int32_t values[]={
|
||||
-1, 0, 1, 2,
|
||||
4, 5, 6, 7,
|
||||
12, 13, 14,
|
||||
24, 25, 26
|
||||
};
|
||||
int32_t ranges[3][2];
|
||||
int32_t length;
|
||||
length=uprv_makeDenseRanges(values, LENGTHOF(values), 1, ranges, LENGTHOF(ranges));
|
||||
printRanges(ranges, length);
|
||||
length=uprv_makeDenseRanges(values, LENGTHOF(values), 0xc0, ranges, LENGTHOF(ranges));
|
||||
printRanges(ranges, length);
|
||||
length=uprv_makeDenseRanges(values, LENGTHOF(values), 0xf0, ranges, LENGTHOF(ranges));
|
||||
printRanges(ranges, length);
|
||||
length=uprv_makeDenseRanges(values, LENGTHOF(values), 0x100, ranges, LENGTHOF(ranges));
|
||||
printRanges(ranges, length);
|
||||
|
||||
return 0;
|
||||
}
|
199
docs/design/struct/tries/bytestrie/bytetrieiterator.h
Normal file
199
docs/design/struct/tries/bytestrie/bytetrieiterator.h
Normal file
|
@ -0,0 +1,199 @@
|
|||
// © 2016 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
/*
|
||||
*******************************************************************************
|
||||
* Copyright (C) 2010, International Business Machines
|
||||
* Corporation and others. All Rights Reserved.
|
||||
*******************************************************************************
|
||||
* file name: bytetrieiterator.h
|
||||
* encoding: US-ASCII
|
||||
* tab size: 8 (not used)
|
||||
* indentation:4
|
||||
*
|
||||
* created on: 2010nov03
|
||||
* created by: Markus W. Scherer
|
||||
*/
|
||||
|
||||
#ifndef __BYTETRIEITERATOR_H__
|
||||
#define __BYTETRIEITERATOR_H__
|
||||
|
||||
/**
|
||||
* \file
|
||||
* \brief C++ API: ByteTrie iterator for all of its (byte sequence, value) pairs.
|
||||
*/
|
||||
|
||||
// Needed if and when we change the .dat package index to a ByteTrie,
|
||||
// so that icupkg can work with an input package.
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/stringpiece.h"
|
||||
#include "bytetrie.h"
|
||||
#include "charstr.h"
|
||||
#include "uvectr32.h"
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
||||
/**
|
||||
* Iterator for all of the (byte sequence, value) pairs in a a ByteTrie.
|
||||
*/
|
||||
class /*U_TOOLUTIL_API*/ ByteTrieIterator : public UMemory {
|
||||
public:
|
||||
ByteTrieIterator(const void *trieBytes, UErrorCode &errorCode)
|
||||
: trie(trieBytes), value(0), stack(errorCode) {}
|
||||
|
||||
/**
|
||||
* Finds the next (byte sequence, value) pair if there is one.
|
||||
* @return TRUE if there is another element.
|
||||
*/
|
||||
UBool next(UErrorCode &errorCode);
|
||||
|
||||
/**
|
||||
* @return TRUE if there are more elements.
|
||||
*/
|
||||
UBool hasNext() const { return trie.pos!=NULL || !stack.isEmpty(); }
|
||||
|
||||
/**
|
||||
* @return the NUL-terminated byte sequence for the last successful next()
|
||||
*/
|
||||
const StringPiece &getString() const { return sp; }
|
||||
/**
|
||||
* @return the value for the last successful next()
|
||||
*/
|
||||
const int32_t getValue() const { return value; }
|
||||
|
||||
private:
|
||||
// The stack stores pairs of integers for backtracking to another
|
||||
// outbound edge of a branch node.
|
||||
// The first integer is an offset from ByteTrie.bytes.
|
||||
// The second integer has the str.length() from before the node in bits 27..0,
|
||||
// and the state in bits 31..28.
|
||||
// Except for the following values for a three-way-branch node,
|
||||
// the lower values indicate how many branches of a list-branch node
|
||||
// are left to be visited.
|
||||
static const int32_t kThreeWayBranchEquals=0xe;
|
||||
static const int32_t kThreeWayBranchGreaterThan=0xf;
|
||||
|
||||
ByteTrie trie;
|
||||
|
||||
CharString str;
|
||||
StringPiece sp;
|
||||
int32_t value;
|
||||
|
||||
UVector32 stack;
|
||||
};
|
||||
|
||||
UBool
|
||||
ByteTrieIterator::next(UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) {
|
||||
return FALSE;
|
||||
}
|
||||
if(trie.pos==NULL) {
|
||||
if(stack.isEmpty()) {
|
||||
return FALSE;
|
||||
}
|
||||
// Read the top of the stack and continue with the next outbound edge of
|
||||
// the branch node.
|
||||
// The last outbound edge causes the branch node to be popped off the stack
|
||||
// and the iteration to continue from the trie.pos there.
|
||||
int32_t stackSize=stack.size();
|
||||
int32_t state=stack.elementAti(stackSize-1);
|
||||
trie.pos=trie.bytes+stack.elementAti(stackSize-2);
|
||||
str.truncate(state&0xfffffff);
|
||||
state=(state>>28)&0xf;
|
||||
if(state==kThreeWayBranchEquals) {
|
||||
int32_t node=*trie.pos; // Known to be a three-way-branch node.
|
||||
uint8_t trieByte=trie.pos[1];
|
||||
trie.pos+=node+3; // Skip node, trie byte and fixed-width integer.
|
||||
UBool isFinal=trie.readCompactInt();
|
||||
// Rewrite the top of the stack for the greater-than branch.
|
||||
stack.setElementAt((int32_t)(trie.pos-trie.bytes), stackSize-2);
|
||||
stack.setElementAt((kThreeWayBranchGreaterThan<<28)|str.length(), stackSize-1);
|
||||
str.append((char)trieByte, errorCode);
|
||||
if(isFinal) {
|
||||
value=trie.value;
|
||||
trie.stop();
|
||||
sp.set(str.data(), str.length());
|
||||
return TRUE;
|
||||
} else {
|
||||
trie.pos+=trie.value;
|
||||
}
|
||||
} else if(state==kThreeWayBranchGreaterThan) {
|
||||
// Pop the state.
|
||||
stack.setSize(stackSize-2);
|
||||
} else {
|
||||
// Remainder of a list-branch node.
|
||||
// Read the next key byte.
|
||||
str.append((char)*trie.pos++, errorCode);
|
||||
if(state>0) {
|
||||
UBool isFinal=trie.readCompactInt();
|
||||
// Rewrite the top of the stack for the next branch.
|
||||
stack.setElementAt((int32_t)(trie.pos-trie.bytes), stackSize-2);
|
||||
stack.setElementAt(((state-1)<<28)|(str.length()-1), stackSize-1);
|
||||
if(isFinal) {
|
||||
value=trie.value;
|
||||
trie.stop();
|
||||
sp.set(str.data(), str.length());
|
||||
return TRUE;
|
||||
} else {
|
||||
trie.pos+=trie.value;
|
||||
}
|
||||
} else {
|
||||
// Pop the state.
|
||||
stack.setSize(stackSize-2);
|
||||
}
|
||||
}
|
||||
}
|
||||
for(;;) {
|
||||
int32_t node=*trie.pos++;
|
||||
if(node>=ByteTrie::kMinValueLead) {
|
||||
// Deliver value for the byte sequence so far.
|
||||
if(trie.readCompactInt(node)) {
|
||||
value=trie.value;
|
||||
trie.stop();
|
||||
}
|
||||
sp.set(str.data(), str.length());
|
||||
return TRUE;
|
||||
} else if(node<ByteTrie::kMinLinearMatch) {
|
||||
// Branch node, needs to take the first outbound edge and push state for the rest.
|
||||
if(node<ByteTrie::kMinListBranch) {
|
||||
// Branching on a byte value,
|
||||
// with a jump delta for less-than, a compact int for equals,
|
||||
// and continuing for greater-than.
|
||||
stack.addElement((int32_t)(trie.pos-1-trie.bytes), errorCode);
|
||||
stack.addElement((kThreeWayBranchEquals<<28)|str.length(), errorCode);
|
||||
// For the less-than branch, ignore the trie byte.
|
||||
++trie.pos;
|
||||
// Jump.
|
||||
int32_t delta=trie.readFixedInt(node);
|
||||
trie.pos+=delta;
|
||||
} else {
|
||||
// Branch node with a list of key-value pairs where
|
||||
// values are compact integers: either final values or jump deltas.
|
||||
int32_t length=node-ByteTrie::kMinListBranch; // Actual list length minus 2.
|
||||
// Read the first (key, value) pair.
|
||||
uint8_t trieByte=*trie.pos++;
|
||||
UBool isFinal=trie.readCompactInt();
|
||||
stack.addElement((int32_t)(trie.pos-trie.bytes), errorCode);
|
||||
stack.addElement((length<<28)|str.length(), errorCode);
|
||||
str.append((char)trieByte, errorCode);
|
||||
if(isFinal) {
|
||||
value=trie.value;
|
||||
trie.stop();
|
||||
sp.set(str.data(), str.length());
|
||||
return TRUE;
|
||||
} else {
|
||||
trie.pos+=trie.value;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Linear-match node, append length bytes to str.
|
||||
int32_t length=node-ByteTrie::kMinLinearMatch+1;
|
||||
str.append(reinterpret_cast<const char *>(trie.pos), length, errorCode);
|
||||
trie.pos+=length;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
U_NAMESPACE_END
|
||||
|
||||
#endif // __BYTETRIEITERATOR_H__
|
164
docs/design/struct/tries/bytestrie/denseranges.h
Normal file
164
docs/design/struct/tries/bytestrie/denseranges.h
Normal file
|
@ -0,0 +1,164 @@
|
|||
// © 2016 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
/*
|
||||
*******************************************************************************
|
||||
* Copyright (C) 2010, International Business Machines
|
||||
* Corporation and others. All Rights Reserved.
|
||||
*******************************************************************************
|
||||
* file name: denseranges.h
|
||||
* encoding: US-ASCII
|
||||
* tab size: 8 (not used)
|
||||
* indentation:4
|
||||
*
|
||||
* created on: 2010sep25
|
||||
* created by: Markus W. Scherer
|
||||
*
|
||||
* Helper code for finding a small number of dense ranges.
|
||||
*/
|
||||
|
||||
#ifndef __DENSERANGES_H__
|
||||
#define __DENSERANGES_H__
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
|
||||
// Definitions in the anonymous namespace are invisible outside this file.
|
||||
namespace {
|
||||
|
||||
/**
|
||||
* Collect up to 15 range gaps and sort them by ascending gap size.
|
||||
*/
|
||||
class LargestGaps {
|
||||
public:
|
||||
LargestGaps(int32_t max) : maxLength(max<=kCapacity ? max : kCapacity), length(0) {}
|
||||
|
||||
void add(int32_t gapStart, int64_t gapLength) {
|
||||
int32_t i=length;
|
||||
while(i>0 && gapLength>gapLengths[i-1]) {
|
||||
--i;
|
||||
}
|
||||
if(i<maxLength) {
|
||||
// The new gap is now one of the maxLength largest.
|
||||
// Insert the new gap, moving up smaller ones of the previous
|
||||
// length largest.
|
||||
int32_t j= length<maxLength ? j=length++ : j=(maxLength-1);
|
||||
while(j>i) {
|
||||
gapStarts[j]=gapStarts[j-1];
|
||||
gapLengths[j]=gapLengths[j-1];
|
||||
--j;
|
||||
}
|
||||
gapStarts[i]=gapStart;
|
||||
gapLengths[i]=gapLength;
|
||||
}
|
||||
}
|
||||
|
||||
void truncate(int32_t newLength) {
|
||||
if(newLength<length) {
|
||||
length=newLength;
|
||||
}
|
||||
}
|
||||
|
||||
int32_t count() const { return length; }
|
||||
int32_t gapStart(int32_t i) const { return gapStarts[i]; }
|
||||
int64_t gapLength(int32_t i) const { return gapLengths[i]; }
|
||||
|
||||
int32_t firstAfter(int32_t value) const {
|
||||
if(length==0) {
|
||||
return -1;
|
||||
}
|
||||
int32_t minValue=0;
|
||||
int32_t minIndex=-1;
|
||||
for(int32_t i=0; i<length; ++i) {
|
||||
if(value<gapStarts[i] && (minIndex<0 || gapStarts[i]<minValue)) {
|
||||
minValue=gapStarts[i];
|
||||
minIndex=i;
|
||||
}
|
||||
}
|
||||
return minIndex;
|
||||
}
|
||||
|
||||
private:
|
||||
static const int32_t kCapacity=15;
|
||||
|
||||
int32_t maxLength;
|
||||
int32_t length;
|
||||
int32_t gapStarts[kCapacity];
|
||||
int64_t gapLengths[kCapacity];
|
||||
};
|
||||
|
||||
} // namespace
|
||||
|
||||
/**
|
||||
* Does it make sense to write 1..capacity ranges?
|
||||
* Returns 0 if not, otherwise the number of ranges.
|
||||
* @param values Sorted array of signed-integer values.
|
||||
* @param length Number of values.
|
||||
* @param density Minimum average range density, in 256th. (0x100=100%=perfectly dense.)
|
||||
* Should be 0x80..0x100, must be 1..0x100.
|
||||
* @param ranges Output ranges array.
|
||||
* @param capacity Maximum number of ranges.
|
||||
* @return Minimum number of ranges (at most capacity) that have the desired density,
|
||||
* or 0 if that density cannot be achieved.
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
uprv_makeDenseRanges(const int32_t values[], int32_t length,
|
||||
int32_t density,
|
||||
int32_t ranges[][2], int32_t capacity) {
|
||||
if(length<=2) {
|
||||
return 0;
|
||||
}
|
||||
int32_t minValue=values[0];
|
||||
int32_t maxValue=values[length-1]; // Assume minValue<=maxValue.
|
||||
// Use int64_t variables for intermediate-value precision and to avoid
|
||||
// signed-int32_t overflow of maxValue-minValue.
|
||||
int64_t maxLength=(int64_t)maxValue-(int64_t)minValue+1;
|
||||
if(length>=(density*maxLength)/0x100) {
|
||||
// Use one range.
|
||||
ranges[0][0]=minValue;
|
||||
ranges[0][1]=maxValue;
|
||||
return 1;
|
||||
}
|
||||
if(length<=4) {
|
||||
return 0;
|
||||
}
|
||||
// See if we can split [minValue, maxValue] into 2..capacity ranges,
|
||||
// divided by the 1..(capacity-1) largest gaps.
|
||||
LargestGaps gaps(capacity-1);
|
||||
int32_t i;
|
||||
int32_t expectedValue=minValue;
|
||||
for(i=1; i<length; ++i) {
|
||||
++expectedValue;
|
||||
int32_t actualValue=values[i];
|
||||
if(expectedValue!=actualValue) {
|
||||
gaps.add(expectedValue, (int64_t)actualValue-(int64_t)expectedValue);
|
||||
expectedValue=actualValue;
|
||||
}
|
||||
}
|
||||
// We know gaps.count()>=1 because we have fewer values (length) than
|
||||
// the length of the [minValue..maxValue] range (maxLength).
|
||||
// (Otherwise we would have returned with the one range above.)
|
||||
int32_t num;
|
||||
for(i=0, num=2;; ++i, ++num) {
|
||||
if(i>=gaps.count()) {
|
||||
// The values are too sparse for capacity or fewer ranges
|
||||
// of the requested density.
|
||||
return 0;
|
||||
}
|
||||
maxLength-=gaps.gapLength(i);
|
||||
if(length>num*2 && length>=(density*maxLength)/0x100) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
// Use the num ranges with the num-1 largest gaps.
|
||||
gaps.truncate(num-1);
|
||||
ranges[0][0]=minValue;
|
||||
for(i=0; i<=num-2; ++i) {
|
||||
int32_t gapIndex=gaps.firstAfter(minValue);
|
||||
int32_t gapStart=gaps.gapStart(gapIndex);
|
||||
ranges[i][1]=gapStart-1;
|
||||
ranges[i+1][0]=minValue=(int32_t)(gapStart+gaps.gapLength(gapIndex));
|
||||
}
|
||||
ranges[num-1][1]=maxValue;
|
||||
return num;
|
||||
}
|
||||
|
||||
#endif // __DENSERANGES_H__
|
1451
docs/design/struct/tries/bytestrie/genpname.cpp
Normal file
1451
docs/design/struct/tries/bytestrie/genpname.cpp
Normal file
File diff suppressed because it is too large
Load diff
135
docs/design/struct/tries/bytestrie/index.md
Normal file
135
docs/design/struct/tries/bytestrie/index.md
Normal file
|
@ -0,0 +1,135 @@
|
|||
---
|
||||
layout: default
|
||||
title: BytesTrie
|
||||
parent: Data Structures
|
||||
grand_parent: Design Docs
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# BytesTrie
|
||||
|
||||
This is an idea for a trie that is intended to be fairly simple but also fairly
|
||||
efficient and versatile. It maps from arbitrary byte sequences to 32-bit
|
||||
integers. (Small non-negative integers are stored more efficiently. Negative
|
||||
integers are the least efficient.)
|
||||
|
||||
Input strings would be mapped to byte sequences. Invariant-character strings
|
||||
could be used directly, if the trie was built for the appropriate charset
|
||||
family, or we could map EBCDIC input to ASCII (while lowercasing for
|
||||
case-insensitive matching).
|
||||
|
||||
For Thai DBBI, each of U+0E00..U+0EFF could be mapped to its low byte.
|
||||
|
||||
For CJK DBBI, we could use UTF-16BE or a slight variant of it. For general
|
||||
Unicode strings (e.g., time zone names), we could devise a simple encoding that
|
||||
maps printable ASCII to single bytes, Unihan & Hangul and some other ranges to
|
||||
two bytes per character, and the rest to three bytes per character. (We could
|
||||
also use this for CJK DBBI, to reduce the number of such "converters".) Or, we
|
||||
use a [UCharsTrie](../ucharstrie.md) for those.
|
||||
|
||||
Sample code is linked below.
|
||||
|
||||
See the [UCharsTrie](../ucharstrie.md) sibling page for some more details. The
|
||||
BytesTrie and UCharsTrie structures are nearly the same, except that the
|
||||
UCharsTrie uses fewer, larger units.
|
||||
|
||||
See also the [diagram of a BytesTrie for a sample set of string-to-value
|
||||
mappings](https://docs.google.com/drawings/edit?id=1-doZNpcByYItcDAcvKmIpwJMWFgXpYCm43GnUrbat3g).
|
||||
|
||||
## Design points
|
||||
|
||||
* The BytesTrie and UCharsTrie are designed to be
|
||||
byte-serialized/UChar-serialized, for trivial platform swapping.
|
||||
* Compact: Small values and jump deltas should be encoded in few bytes. This
|
||||
requires variable-length encodings.
|
||||
* The length of each value/delta is encoded either in a preceding node or in
|
||||
its own lead unit. This makes skipping values efficient, and fewer units
|
||||
need to be range-checked while reading variable-length values.
|
||||
* Nodes with small values are encoded in single units.
|
||||
* Linear-match nodes match a sequence of units without choice/selection.
|
||||
* Branches
|
||||
* Branches store relative deltas to "jump" to following nodes. Small
|
||||
deltas are encoded in single units; encoding deltas is much more
|
||||
efficient than encoding absolute offsets.
|
||||
* Variable-width values make binary search on branch nodes infeasible.
|
||||
Therefore, branches with lists of (key, value) pairs are limited to
|
||||
short list lengths for linear search.
|
||||
* For large branches, branch nodes contain one unit, for branching to the
|
||||
left (less-than) or to the right (greater-or-equal). This encodes a
|
||||
binary search into the data structure.
|
||||
* Initially, I had an equals edge in split-branch sub-nodes as well,
|
||||
but that slowed down matching significantly (9% in one case) without
|
||||
noticeably helping with the serialized size (0.2% in that case).
|
||||
* At the end of each node (except for a final-value node), matching
|
||||
continues with the next node, rather than using another jump to a
|
||||
different location.
|
||||
* Each branch head node encodes the length of the branch (the number of
|
||||
units to select from). The split-branch and list-branch sub-nodes do not
|
||||
have node heads. Instead, the code tracks the remaining length of the
|
||||
branch, halving it for each split-branch edge and counting down in a
|
||||
list-branch sub-node.
|
||||
* The maximum length of a list-branch sub-node is fixed, that is, part of
|
||||
the serialized data format and cannot be changed compatibly. This
|
||||
constant is used in the branching code to decide whether to split
|
||||
less-than/greater-or-equal vs. walk a list of key-value pairs.
|
||||
* This constant must be at least 3 so that split-branch sub-nodes have a
|
||||
length of at least 4 so that the following list-branch nodes have a
|
||||
length of at least 2 and can use a do-while loop rather than a while
|
||||
loop. (Saving one length check.)
|
||||
* I explored an alternative, with only split-branch nodes down to length 1
|
||||
and then a final match unit with continuing matching after that. It was
|
||||
fast but also significantly larger. A branch like this is about twice
|
||||
the size of a key-value pair list. If the average list-branch length is
|
||||
n, a branch has (length/n)-1 split-branch sub-nodes. This experiment
|
||||
corresponds to n=1.
|
||||
* API
|
||||
* The API is simple and low-level. At the core, next(unit) "turns the
|
||||
crank" and returns basically a 2-bit result that encodes matches() (this
|
||||
unit continues a matching sequence), hasNext() (another unit can
|
||||
continue a matching sequence) and hasValue() (the units so far are a
|
||||
matching string).
|
||||
* Higher-level functions that handle different input (e.g., normalize
|
||||
units on the fly) and provide variations of functionality (e.g., longest
|
||||
match, startsWith, find all matches from some point in text, ...) can be
|
||||
built on top of the low-level functions without cluttering the API or
|
||||
pulling in further dependencies.
|
||||
* The next(unit) function stops on a value node rather than decoding the
|
||||
value, saving time until the value is requested (via getValue()). The
|
||||
following next(unit2) call will then skip over the value node.
|
||||
* There is enough API to serve a variety of uses, including
|
||||
matching/mapping whole strings, finding out if a prefix belongs only to
|
||||
strings with the same value, getting all units that can continue from
|
||||
some point, and getting all (string, value) pairs. This should be able
|
||||
to support lookups, parsing with abbreviations, word segmentation, etc.
|
||||
* The "fast" builder code is simple. The builder builds, it need not use a
|
||||
trie structure until writing the serialized form, and it need not provide
|
||||
any of the trie runtime API.
|
||||
* There is builder code that makes a "small" trie, attempting to avoid writing
|
||||
duplicate nodes. This is possible when whole trees of nodes are the same and
|
||||
at least one is reached via a "jump" delta which can "jump" to the
|
||||
previously written serialization of such a tree.
|
||||
|
||||
## Sample Code
|
||||
|
||||
The following demo code was last updaed Nov. 2010:
|
||||
|
||||
* [`bytetrie.h`](./bytetrie.h)
|
||||
* [`bytetriebuilder.h`](./bytetriebuilder.h)
|
||||
* [`bytetriedemo.cpp`](./bytetriedemo.cpp)
|
||||
* [`bytetrieiterator.h`](./bytetrieiterator.h)
|
||||
* [`denseranges.h`](./denseranges.h)
|
||||
* [`genpname.cpp`](./genpname.cpp)
|
||||
|
||||
### Latest versions of source code
|
||||
|
||||
The latest versions of the above sample code (except for `bytetriedemo.cpp`) exist in the ICU repository, sometimes under slightly different names and reorganized:
|
||||
|
||||
* [icu4c/source/common/unicode/**bytestrie.h**](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/bytestrie.h)
|
||||
* [icu4c/source/common/unicode/**bytestriebuilder.h**](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/bytestriebuilder.h)
|
||||
* [icu4c/source/tools/toolutil/**denseranges.h**](https://github.com/unicode-org/icu/blob/main/icu4c/source/tools/toolutil/denseranges.h)
|
||||
* [tools/unicode/c/genprops/**pnamesbuilder.cpp**](https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops/pnamesbuilder.cpp)
|
||||
|
38
docs/design/struct/tries/index.md
Normal file
38
docs/design/struct/tries/index.md
Normal file
|
@ -0,0 +1,38 @@
|
|||
---
|
||||
layout: default
|
||||
title: ICU String Tries
|
||||
parent: Data Structures
|
||||
grand_parent: Design Docs
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# ICU String Tries
|
||||
|
||||
We have several implementations of string tries, mapping strings to boolean or
|
||||
integer values: Currently for time zone name parsing and DBBI. Other areas might
|
||||
also benefit from tries: Property names, character names, UnicodeSetStringSpan,
|
||||
.dat package file TOC.
|
||||
|
||||
We should have a small number of common map-from-string trie implementations;
|
||||
fairly compact, fairly efficient, easily serializable, and well-tested.
|
||||
|
||||
See the subpages for ideas.
|
||||
|
||||
For a UnicodeSetStringSpan, we would want to find each next match starting from
|
||||
some point in the text, rather than passing each unit of text and finding out if
|
||||
the units so far match.
|
||||
|
||||
Note: In terms of whole-string-lookup performance, the fastest data structure is
|
||||
a hash map. Where whole-string-lookup is the only relevant operation, we could
|
||||
consider implementing an easily serialized hash map.
|
||||
|
||||
See also [ICU Code Point Tries](../utrie.md).
|
||||
|
||||
Implementations:
|
||||
|
||||
* [BytesTrie](./bytestrie/)
|
||||
* [UCharsTrie](./ucharstrie)
|
32
docs/design/struct/tries/ucharstrie.md
Normal file
32
docs/design/struct/tries/ucharstrie.md
Normal file
|
@ -0,0 +1,32 @@
|
|||
---
|
||||
layout: default
|
||||
title: UCharsTrie
|
||||
parent: Data Structures
|
||||
grand_parent: Design Docs
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# UCharsTrie
|
||||
|
||||
Same design as a [BytesTrie](bytestrie/index.md), but mapping any UnicodeString
|
||||
(any sequence of 16-bit units) to 32-bit integer values. This can use somewhat
|
||||
simpler code because there are more bits to work with in each unit, and it is
|
||||
probably more appropriate and faster than a BytesTrie for collation
|
||||
contractions/prefixes, CJK dictionaries, and maybe for use with Unicode strings
|
||||
in general when it is not known that we work with a small script or mostly with
|
||||
ASCII.
|
||||
|
||||
The code and data structure are quite similar to the BytesTrie. In general,
|
||||
larger units are used to store larger values and deltas in single units than
|
||||
possible in a BytesTrie, and fewer variable-length units are needed in all
|
||||
cases.
|
||||
|
||||
In addition, some of the bits of match-nodes (linear-match and branch nodes) are
|
||||
used for intermediate values (small values or most significant bits), rather
|
||||
than separate intermediate-value nodes in a BytesTrie. Larger intermediate
|
||||
values have one or two units following the match node head, then followed by the
|
||||
match node's contents.
|
312
docs/design/struct/utrie.md
Normal file
312
docs/design/struct/utrie.md
Normal file
|
@ -0,0 +1,312 @@
|
|||
---
|
||||
layout: default
|
||||
title: ICU Code Point Tries
|
||||
parent: Data Structures
|
||||
grand_parent: Design Docs
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# ICU Code Point Tries
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
## Fast lookup in arrays
|
||||
|
||||
For fast lookup by code point, we store data in arrays. It costs too much space
|
||||
to use a single array indexed directly by code points: There are about 1.1M of
|
||||
them (max 0x10ffff, about 20.1 bits), and about 90% are unassigned or private
|
||||
use code points. For some uses, there are non-default values only for a few
|
||||
hundred characters.
|
||||
|
||||
We use a form of "trie" adapted to single code points. The bits in the code
|
||||
point integer are divided into two or more parts. The first part is used as an
|
||||
array offset, the value there is used as a start offset into another array. The
|
||||
next code point bit field is used as an additional offset into that array, to
|
||||
fetch another value. The final part yields the data for the code point.
|
||||
Non-final arrays are called index arrays or tables.
|
||||
|
||||
> See also [ICU String Tries](tries/index.md).
|
||||
|
||||
For lookup of arbitrary code points, we need at least three successive arrays,
|
||||
so that the first index table is not too large.
|
||||
|
||||
For all but the first index table, different blocks of code points with the same
|
||||
values can overlap. A special block contains only default values and is shared
|
||||
among all blocks of code points that map there.
|
||||
|
||||
Block sharing works better, and thus leads to smaller data structures, the
|
||||
smaller the blocks are, that is, the fewer bits in the code point bit fields
|
||||
used as intra-block offsets.
|
||||
|
||||
On the other hand, shorter bit fields require more bit fields and more
|
||||
successive arrays and lookups, which adds code size and makes lookups slower.
|
||||
|
||||
(Until about 2001, all ICU data structures only handled BMP code points.
|
||||
"Compact arrays" split 16-bit code points into fields of 9 and 7 bits.)
|
||||
|
||||
We tend to make compromises including additional index tables for smaller parts
|
||||
of the Unicode code space, for simpler, faster lookup there.
|
||||
|
||||
For a general-purpose structure, we want to be able to be able to store a unique
|
||||
value for every character. This determines the number of bits needed in the last
|
||||
index table. With 136,690 characters assigned in Unicode 10, we need at least 18
|
||||
bits. We allocate data values in blocks aligned at multiples of 4, and we use
|
||||
16-bit index words shifted left by 2 bits. This leads to a small loss in how
|
||||
densely the data table can be used, and how well it can be compacted, but not
|
||||
nearly as much as if we were using 32-bit index words.
|
||||
|
||||
## Character conversion
|
||||
|
||||
The ICU conversion code uses several variants of code point tries with data
|
||||
values of 1, 2, 3, or 4 bytes corresponding to the number of bytes in the output
|
||||
encoding.
|
||||
|
||||
## UTrie
|
||||
|
||||
The original "UTrie" structure was developed for Unicode Normalization for all
|
||||
of Unicode. It was then generalized for collation, character properties, and
|
||||
eventually almost every Unicode data lookup. Values are 16 or 32 bits wide.
|
||||
|
||||
It was designed for fast UTF-16 lookup with a special, complicated structure for
|
||||
supplementary code points using custom values for lead surrogate units. This
|
||||
custom data and code made this structure relatively hard to use.
|
||||
|
||||
11:5 bits for the BMP and effectively 5:5:5:5 bits for supplementary code points
|
||||
provide for good compaction. The BMP index table is always 2<sup>11</sup> uint16_t = 4kB.
|
||||
Small index blocks for the supplementary range are added as needed.
|
||||
|
||||
The structure stores different values for lead surrogate code *units* (for fast
|
||||
moving through UTF-16_ vs. code *points* (for lookup by code point).
|
||||
|
||||
The first 256 data table entries are a fixed-size, linear table for Latin-1 (up
|
||||
to U+00FF).
|
||||
|
||||
## UTrie2
|
||||
|
||||
The "UTrie2" structure, developed in 2008, was designed to enable fast lookup
|
||||
from UTF-8 without always having to assemble whole code points and to split them
|
||||
again into the trie bit fields.
|
||||
|
||||
It retains separate lookups for lead surrogate code units vs. code points.
|
||||
|
||||
It retains the same 11:5 lookup for BMP code points, for good compaction and
|
||||
good performance.
|
||||
|
||||
There is a special small index for lead bytes of two-byte UTF-8 sequences (up to
|
||||
U+07FF), for 5:6 lookup. These index values are not shifted left by 2.
|
||||
|
||||
Lookup for three-byte UTF-8 uses the BMP index, which is clumsy.
|
||||
|
||||
Lookup for supplementary code points is much simpler than with UTrie, without
|
||||
custom data values or code. Two index tables are used for 9:6:5 code point bits.
|
||||
The first index table omits the BMP part. The structure stores the a code point
|
||||
after which every one maps to the default value, and the first index is
|
||||
truncated to below that.
|
||||
|
||||
With the fixed BMP index table and other required structures, an empty UTrie2 is
|
||||
about 5kB large.
|
||||
|
||||
The UTF-8 lookup was also designed for the original handling of ill-formed
|
||||
UTF-8: The first 192 data table entries are a linear table for ASCII plus the 64
|
||||
trail bytes, to look up "single" bytes 0..BF without further checking, with
|
||||
error values for the trail bytes. Lookup of two-byte non-shortest forms (C0
|
||||
80..C1 BF) also yields error values. These error values became unused in 2017
|
||||
when ICU 60 changed to handling ill-formed UTF-8 compatible with the W3C
|
||||
Encoding standard (substituting maximal subparts of valid sequences). C0 and C1
|
||||
are no longer recognized as lead bytes, requiring full byte sequence validation
|
||||
separate from the data lookup.
|
||||
|
||||
## Ideas
|
||||
|
||||
Possible goals: Simpler code, smaller data especially for sparse tries, maybe
|
||||
faster UTF-8, not much slower UTF-16.
|
||||
|
||||
We should try to store only one set of values for surrogates. Unicode property
|
||||
APIs use only by-code point lookup without special lead surrogate values.
|
||||
Collation uses special lead surrogate data but does not use code point lookup.
|
||||
Normalization does both, but the per-code point lookup could test for surrogate
|
||||
code points first and return trivial values for all of them. UTF-16 string
|
||||
lookup should map unpaired surrogates to the error value.
|
||||
|
||||
We should remove the special data for old handling of ill-formed UTF-8, the
|
||||
error values for trail bytes and two-byte non-shortest forms.
|
||||
|
||||
If we use 6 bits for the last code point bit field, then we can use the same
|
||||
index table for code point/UTF-16 lookup as well as UTF-8 lookup. Compaction
|
||||
will be less effective, so data will grow some. This would be somewhat
|
||||
compensated by the smaller BMP index table.
|
||||
|
||||
If we also continue to use 6 bits for the second-to-last table, that is, 8:6:6
|
||||
bits, then we can simplify the code for three- and four-byte UTF-8.
|
||||
|
||||
If we always include the BMP in the first index table, then we can also simplify
|
||||
enumeration code a bit, and use smaller code for code point lookups where code
|
||||
size is more important than maximum speed.
|
||||
|
||||
Alternatively, we could improve compaction and speed for the BMP by using no
|
||||
index shift-left for BMP indexes (and keep omitting the BMP part of the first
|
||||
index table). In order to ensure that BMP data can be indexed directly with
|
||||
16-bit index values, the builder would probably have to copy at least the BMP
|
||||
data into a new array for compaction, before adding data for supplementary code
|
||||
points. When some of the indexes are not shifted, and their data is compacted to
|
||||
arbitrary offsets, then that data cannot also be addressed with uniform
|
||||
double-index lookup. We may or may not store unused first-index entries. If not
|
||||
the whole BMP is indexed differently, then UTF-16 and three-byte UTF-8 lookups
|
||||
need another code branch. (Size vs. simplicity & speed.)
|
||||
|
||||
The more tries we use, the higher the total cost of the size overhead. (For
|
||||
example, many of the 100 or so collation tailorings carry a UTrie2.) The less
|
||||
overhead, the more we could use separate tries where we currently combine them
|
||||
or avoid them. Smaller overhead would make it more attractive to offer a public
|
||||
code point map structure.
|
||||
|
||||
Going to 10:6 bits for the BMP cuts the fixed-size index in half, to 2kB.
|
||||
|
||||
We could reduce the fixed-size index table much further by using two-index
|
||||
lookup for some or most of the BMP, trading off data size for speed and
|
||||
simplicity. The index must be at least 32 uint16_t's for two-byte UTF-8, for up
|
||||
to U+07FF including Cyrillic and Arabic. We could experiment with length 64 for
|
||||
U+0FFF including Indic scripts and Thai, 208 entries for U+33FF (most small
|
||||
scripts and Kana), or back to 1024 entries for the whole BMP. We could configure
|
||||
a different value at build time for different services (optimizing for speed vs.
|
||||
size). If we use the faster lookup for three-byte UTF-8, then the boundaries
|
||||
should be multiples of 0x1000 (up to U+3FFF instead of U+33FF).
|
||||
|
||||
## UCPTrie / CodePointTrie
|
||||
|
||||
Added as public API in ICU 63. Developed between the very end of 2017 and
|
||||
mid-2018.
|
||||
|
||||
Based on many of the ideas above and experimentation.
|
||||
|
||||
Continued linear data array lookup for ASCII.
|
||||
|
||||
No more separate values for lead surrogate code points vs. code units.
|
||||
|
||||
* Normalization switched to UCPTrie, working around this: Storing special lead
|
||||
surrogate values for UTF-16 forward iteration; for code point lookup, the
|
||||
normalization code checks for lead surrogates and returns an "inert" value
|
||||
for them; for code point range iteration uses special API that treats lead
|
||||
surrogates as "inert" as well.
|
||||
* Otherwise simpler API, easier to explain.
|
||||
* UTF-16 string lookup maps unpaired surrogates to the error value.
|
||||
|
||||
For low-code point lookup, uses 6 bits for the last code point field.
|
||||
|
||||
* No more need for special UTF-8 2/3-byte lookup structures.
|
||||
* Smaller BMP index reduces size overhead.
|
||||
|
||||
No more data structures for non-shortest UTF-8 sequences.
|
||||
|
||||
"Fast" type uses two-stage lookup for all of the BMP (10:6 bits). "Small" type
|
||||
uses two-stage lookup only up to U+0FFF to trade off size vs. speed. (fastLimit
|
||||
U+10000 vs. U+1000)
|
||||
|
||||
Continued use of highStart for the start of the last range (ending at U+10FFFF),
|
||||
and highValue for the value of all of its code points.
|
||||
|
||||
For code points between fastLimit and highStart, a four-stage lookup is used
|
||||
(compared with three stages in UTrie2), with small bit fields (6:5:5:4 bits).
|
||||
"Fast" type: Only for supplementary code points below highStart, if any. "Small"
|
||||
type: For all code points below highStart; this means that for U+0000..U+0FFF in
|
||||
a "small" trie data can be accessed with either the two-stage or the four-stage
|
||||
lookup (and for ASCII also with linear access).
|
||||
|
||||
Experimentation confirmed that larger bit fields, especially for the last one or
|
||||
two stages, lead to poor compaction of sparse data. 6 bits for the data offset
|
||||
work well for UTF-8 lookup and are a reasonable compromise for the BMP, but for
|
||||
the large supplementary area which tends to have more sparse data, using a 4 bit
|
||||
data offset was useful. The drawback is that then the index blocks get larger
|
||||
and compact less well. Four-byte UTF-8 lookup (for supplementary
|
||||
|
||||
* Started with 8:6:6 bits, but some tries were 30% larger than with UTrie2.
|
||||
* Went to 10:6:4 bits which saved 12% overall, with only one trie larger than
|
||||
UTrie2 (by 8%).
|
||||
* Experimented with a "gap", omitting parts of the index for another range
|
||||
like highStart for a typically large range of code points with a single
|
||||
common value. This helped
|
||||
* Experimented with 10:6:4 vs. 11:5:4 vs. 9:6:5 vs. 10:5:5 bits plus the gap.
|
||||
\*:4 were smaller than \*:5, but the bit distribution within the index
|
||||
stages had little effect. 11:5:4 yielded the smallest data, indicating that
|
||||
small bit fields are useful for index stages as well.
|
||||
* Replaced the gap with splitting the first index bit field into two, for a
|
||||
four-stage 6:5:5:4 lookup. Just slightly smaller data than 11:5:4+gap, but
|
||||
less complicated than checking for the gap and working around it; replaces
|
||||
gap start/limit reads and comparisons with unconditional index array
|
||||
accesses. 14% smaller overall than UTrie2.
|
||||
* Added the "small" type where the two-stage lookup only extends to U+0FFF
|
||||
(6:6 bits) and the four-stage lookup covers all code points below highStart.
|
||||
34% smaller overall than UTrie2.
|
||||
|
||||
The normalization code also lazy-builds a trie with CanonicalIterator data which
|
||||
is very sparse even in the BMP. With a "fast" UCPTrie it is significantly larger
|
||||
than with UTrie2, with a "small" UCPTrie it is significantly smaller. Switched
|
||||
the code to use a "small" trie because it is less performance-sensitive than the
|
||||
trie used for normalizing strings.
|
||||
|
||||
In order to cover up to 256k data values, UTrie2 always shifts 16-bit data block
|
||||
start offsets left by 2. UCPTrie abandons this, which simplifies two-stage
|
||||
lookups slightly and improves compaction (no more granularity of 4 for data
|
||||
block alignment).
|
||||
|
||||
* For a "fast" trie to always reach all BMP data values with 16-bit index
|
||||
entries, the data array is always accessed via a separate pointer, rather
|
||||
than UTrie2's sharing of the index array with 16-bit data via offsetting by
|
||||
the length of the index. This also simplifies code slightly and makes access
|
||||
uniform for all data value widths.
|
||||
* There are now at most 64k data values for BMP code points because there is
|
||||
no separate data for lead surrogates any more. The builder code writes data
|
||||
blocks in code point order to ensure that low code points have low data
|
||||
block offsets.
|
||||
* For supplementary code points, data block offsets may need 18 bits. This is
|
||||
very unusual but possible. (It currently happens only in the collation root
|
||||
data with Han radical-stroke order, and in a unit test.)
|
||||
* UCPTrie uses the high bit of the index-2 entry to indicate that the index-3
|
||||
block stores 18-bit data block offsets rather than 16-bit ones. (This limits
|
||||
somewhat the length of the index.) In this case, groups of 8 index-3 entries
|
||||
(= data block start offsets) share an additional entry that stores the two
|
||||
high bits of each of the eight entries. More complicated lookup, but almost
|
||||
never used, and keeps BMP lookups always simple.
|
||||
* A possible alternative could have used a bit per entry, or per small group
|
||||
of entries, to indicate that a common data value should be returned for
|
||||
"unused" parts of a sparse data block. There could have been a common value
|
||||
per index-3 block, per index-2 block, or for the whole trie, etc. Rejected
|
||||
as much too complicated.
|
||||
|
||||
UTrie2 stores a whole block of 64 error values for UTF-8 non-shortest-form
|
||||
lookup. UCPTrie does not have this block any more; it stores the error value at
|
||||
the end of the data array, at dataLength-1.
|
||||
|
||||
UTrie2 stores the highValue at dataLength-4. UCPTrie stores it at dataLength-2.
|
||||
|
||||
Comparison: [UTrie2 vs.
|
||||
UCPTrie/CodePointTrie](https://docs.google.com/document/d/e/2PACX-1vTbwdDe2tVJ6pACMpOq7uKW_FgvyyjvPVdgZYsIwSoFJj-27cXR20wAO9qHVoaKOIoo-d8iHnsFOCdc/pub)
|
||||
|
||||
Sizes for BreakIterator & Collator tries, UTrie2 vs. UTrie3 experiments: [In
|
||||
this
|
||||
spreadsheet](https://docs.google.com/spreadsheets/d/e/2PACX-1vTgL260NFgmbiUAtptKj4fNf9wNm-OJ6Q0TbWzFWvhV7wVZk2Qe-gk2pbJh0pHY9XVsObZ3YaoOnb3I/pubhtml)
|
||||
see the "nocid" sheet (no CanonicalIterator data).
|
||||
|
||||
The last columns on the "nocid" sheet, highlighted in green and blue, correspond
|
||||
to the final UCPTrie/CodePointTrie. For these tries, the "fast" type (green)
|
||||
yields 14% smaller data than UTrie2; the "small" type (blue) yields 34% smaller
|
||||
data.
|
||||
|
||||
The simplenormperf sheets show performance comparison data between UTrie2 and
|
||||
"fast" UCPTrie. There should be little difference for BMP characters; the
|
||||
numbers are too inconsistent to show a significant difference.
|
||||
|
||||
UCPTrie has an option of storing 8-bit values, in addition to 16-bit and 32-bit
|
||||
values that UTrie2 supports. It would be possible to add 12-bit or 64-bit values
|
||||
etc. later.
|
96
docs/devsetup/callgrind/index.md
Normal file
96
docs/devsetup/callgrind/index.md
Normal file
|
@ -0,0 +1,96 @@
|
|||
---
|
||||
layout: default
|
||||
title: Profiling ICU4C with callgrind
|
||||
grand_parent: Setup for Contributors
|
||||
parent: C++ Setup
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Profiling ICU4C with callgrind
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Valgrind, callgrind and kcachegrind together proved performance profiling of C++
|
||||
code, including annotated source code with time consumption at each line.
|
||||
|
||||
Prequisites:
|
||||
|
||||
* Linux with the clang compiler.
|
||||
* Valgrind. If not already installed, from the command line,
|
||||
* `sudo apt install valgrind`
|
||||
* kcachegrind. To install:
|
||||
* `sudo apt install kcachegrind`
|
||||
|
||||
Build ICU. An optimized build with debug symbols is generally best for
|
||||
profiling:
|
||||
|
||||
```
|
||||
cd icu4c/source
|
||||
./runConfigureICU --enable-debug Linux
|
||||
make -j6 check
|
||||
```
|
||||
|
||||
## Run test code
|
||||
|
||||
Prepare the test code you wish to measure. Valgrind is very slow, so be wary of
|
||||
long running tests. Because Valgrind tracks every last machine instruction (it's
|
||||
not a sampling profiler), getting good results does not require a long run.
|
||||
|
||||
Run the test code under valgrind with callgrind. The example below runs a test
|
||||
from intltest, but that is not a requirement; valgrind will profile any
|
||||
executable. The differences from a normal (non-profile) invocation are
|
||||
highlighted.
|
||||
|
||||
Without the `LD_BIND_NOW=y` the output is polluted by symbol lookups.
|
||||
|
||||
```
|
||||
LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY_PATH
|
||||
LD_BIND_NOW=y valgrind --tool=callgrind
|
||||
--callgrind-out-file=callgrind.out ./intltest
|
||||
translit/TransliteratorTest/TestAllCodepoints
|
||||
```
|
||||
|
||||
The raw profiling data will be left in a callgrind.out file,
|
||||
|
||||
```
|
||||
ls -l callgrind*
|
||||
-rw------- 1 aheninger eng 325779 Oct 3 15:51 callgrind.out
|
||||
```
|
||||
|
||||
## View in kcachegrind
|
||||
|
||||
Run kcachegrind to view the results.
|
||||
|
||||
```
|
||||
kcachegrind callgrind.out
|
||||
```
|
||||
|
||||
Explore. Lots of interesting data is available.
|
||||
|
||||
[kcachegrind docs](https://kcachegrind.github.io/html/Documentation.html)
|
||||
|
||||
For the above run, here are the top functions, ordered by cumulative time
|
||||
(including calls out) spent in each.
|
||||
|
||||

|
||||
|
||||
Time spent in each function, self time only. `UnicodeSet::add()` is hot.
|
||||
|
||||

|
||||
|
||||
Annotated source for `UnicodeSet::add()`
|
||||
|
||||

|
BIN
docs/devsetup/callgrind/kcache-cumulative.png
Normal file
BIN
docs/devsetup/callgrind/kcache-cumulative.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 266 KiB |
BIN
docs/devsetup/callgrind/kcache-flat.png
Normal file
BIN
docs/devsetup/callgrind/kcache-flat.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 254 KiB |
BIN
docs/devsetup/callgrind/kcache-source.png
Normal file
BIN
docs/devsetup/callgrind/kcache-source.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 232 KiB |
|
@ -1,3 +1,5 @@
|
|||
// © 2016 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
{
|
||||
"configurations": [
|
||||
{
|
125
docs/devsetup/cpp/index.md
Normal file
125
docs/devsetup/cpp/index.md
Normal file
|
@ -0,0 +1,125 @@
|
|||
---
|
||||
layout: default
|
||||
title: C++ Setup
|
||||
parent: Setup for Contributors
|
||||
has_children: true
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# C++ Setup
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
|
||||
## C/C++ workspace structure
|
||||
|
||||
It is best to keep the source file tree and the build-output files separate
|
||||
("out-of-source build"). It keeps your source tree clean, and you can build
|
||||
multiple configurations from the same source tree (e.g., debug build, release
|
||||
build, build with special flags such as no-using-namespace). You could keep the
|
||||
source and build trees in parallel folders.
|
||||
|
||||
**Important:** If you use runConfigureICU together with CXXFLAGS or similar, the
|
||||
*custom flags must be before the runConfigureICU invocation*. (So that they
|
||||
are visible as environment variables in the runConfigureICU shell script, rather
|
||||
than just options text.) See the sample runConfigureICU invocations below.
|
||||
|
||||
See the ICU4C readme's [Recommended Build
|
||||
Options](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#RecBuild).
|
||||
|
||||
For example:
|
||||
|
||||
* `~/icu/mine/**src**`
|
||||
* source tree including icu (ICU4C) & icu4j folders
|
||||
* setup: mkdir + git clone your fork (see the [Linux Tips
|
||||
subpage](linux.md)) + cd to here.
|
||||
* Use `git checkout <branch>` to switch between branches.
|
||||
* Use `git checkout -b <newbranchname>` to create a new branch and switch
|
||||
to it.
|
||||
* After switching branches, remember to update your IDE's view of the
|
||||
source tree.
|
||||
* For C++ code, you may want to `make clean` *before* switching to a
|
||||
different branch.
|
||||
* `~/icu/mine/icu4c/**bld**`
|
||||
* release build output
|
||||
* not-using-namespace is always recommended
|
||||
* setup: mkdir+cd to here, then something like
|
||||
`CXXFLAGS="-DU_USING_ICU_NAMESPACE=0"
|
||||
CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1"
|
||||
../../src/icu4c/source/**runConfigureICU** Linux
|
||||
--prefix=/home/*your_user_name*/icu/mine/inst > config.out 2>&1`
|
||||
* build: `make -j5 check > out.txt 2>&1`
|
||||
* `~/icu/mine/icu4c/**dbg**`
|
||||
* debug build output
|
||||
* not-using-namespace is always recommended
|
||||
* setup: mkdir+cd to here, then something like
|
||||
`CXXFLAGS="-DU_USING_ICU_NAMESPACE=0"
|
||||
CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1"
|
||||
../../src/icu4c/source/**runConfigureICU** --enable-debug
|
||||
--disable-release Linux --prefix=/home/*your_user_name*/icu/mine/inst >
|
||||
config.out 2>&1`
|
||||
* build: make -j5 check > out.txt 2>&1
|
||||
* Be sure to test with gcc and g++ too! `CC=gcc CXX=g++
|
||||
CXXFLAGS="-DU_USING_ICU_NAMESPACE=0"
|
||||
CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1"
|
||||
../../src/icu4c/source/runConfigureICU --enable-debug --disable-release
|
||||
Linux`
|
||||
* `~/icu/mine/icu4c/**nm_utf8**`
|
||||
* not-using-namespace and default-hardcoded-UTF-8
|
||||
* setup: mkdir+cd to here, then something like
|
||||
`../../src/icu4c/source/**configure**
|
||||
CXXFLAGS="-DU_USING_ICU_NAMESPACE=0" CPPFLAGS="-DU_CHARSET_IS_UTF8=1
|
||||
-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1"
|
||||
--prefix=/home/*your_user_name*/icu/mine/inst > config.out 2>&1`
|
||||
* ~/icu/mine/icu4c/static
|
||||
* gcc with static linking
|
||||
* setup: mkdir+cd to here, then something like
|
||||
`../../src/icu4c/source/**configure**
|
||||
CXXFLAGS="-DU_USING_ICU_NAMESPACE=0"
|
||||
CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -O2 -ffunction-sections
|
||||
-fdata-sections" LDFLAGS="-Wl,--gc-sections" --enable-static
|
||||
--disable-shared --prefix=/home/*your_user_name*/icu/mine/inst >
|
||||
config.out 2>&1`
|
||||
* `~/icu/mine/`**`inst`**
|
||||
* “make install” destination (don’t clobber your platform ICU during
|
||||
development)
|
||||
* `~/icu/**msg48**/src`
|
||||
* Optional: You could have multiple parallel workspaces, each with their
|
||||
own git clones, to reduce switching a single workspace (and the IDE
|
||||
looking at it) from one branch to another.
|
||||
|
||||
### Run individual test suites
|
||||
|
||||
* `cd ~/icu/mine/icu4c/dbg/test/intltest`
|
||||
* `export LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw`
|
||||
* `make -j5 && ./intltest utility/ByteTrieTest utility/UCharTrieTest`
|
||||
* `cd ~/icu/mine/icu4c/dbg/test/cintltst`
|
||||
* same relative `LD_LIBRARY_PATH` as for intltest
|
||||
* `make -j5 && ./cintltst`
|
||||
|
||||
## gdb pretty-printing
|
||||
|
||||
Shane wrote this gdb script in 2017: It pretty-prints UnicodeString in GDB.
|
||||
Instead of seeing the raw internals of UnicodeString, you will see the length,
|
||||
storage type, and content of the UnicodeString in your debugger. There are
|
||||
installation instructions in the top comment on the file (it's a matter of
|
||||
downloading the file and adding a line to `~/.gdbinit`).
|
||||
|
||||
<https://gist.github.com/sffc/7b3826fd67cb78057a9e66f2b350a647>
|
||||
|
||||
This also works in anything that wraps GDB, like CLion and Visual Studio Code.
|
||||
|
||||
## Linux Tips
|
||||
|
||||
For more Linux-specific tips see the [Linux Tips subpage](linux.md).
|
178
docs/devsetup/cpp/linux.md
Normal file
178
docs/devsetup/cpp/linux.md
Normal file
|
@ -0,0 +1,178 @@
|
|||
---
|
||||
layout: default
|
||||
title: C++ Setup on Linux
|
||||
grand_parent: Setup for Contributors
|
||||
parent: C++ Setup
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# C++ Setup on Linux
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
|
||||
## Compiler
|
||||
|
||||
For ICU4C 50 or newer the `configure` script picks `clang` if it is installed,
|
||||
or else `gcc`. Clang produces superior error messages and warnings.
|
||||
|
||||
Most Linuxes should have clang available to install. On Ubuntu or other
|
||||
Debian-based systems, install it with
|
||||
|
||||
```
|
||||
sudo apt-get install clang
|
||||
```
|
||||
|
||||
Debug builds must use compiler option `-g` and should not optimize (`-O0` is the
|
||||
default). A future version of `gcc` might support `-Og` as the recommended
|
||||
optimization level for debugging.
|
||||
|
||||
Release builds can use `-O3` for best performance. See
|
||||
<http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html>
|
||||
|
||||
`clang` might even benefit from `-O4` where "whole program optimization is done
|
||||
at link time". See
|
||||
<http://developer.apple.com/library/mac/#documentation/Darwin/Reference/Manpages/man1/clang.1.html>
|
||||
|
||||
## Other build flags
|
||||
|
||||
On a modern Linux you can configure with `CPPFLAGS="-DU_CHARSET_IS_UTF8=1"`.
|
||||
|
||||
## Debugging
|
||||
|
||||
`gdb` should work with both out-of-source and in-source builds. If not,
|
||||
double-check with "`make VERBOSE=1`" that both .c and .cpp files are compiled
|
||||
with `-g` and either `-O0` or no `-O*anything*` at all.
|
||||
|
||||
`kdbg` is a reasonable GUI frontend for gdb. It keeps the source code in sync
|
||||
and updates views of variables & memory etc.
|
||||
|
||||
* kdbg versions below 2.5.2 do not work with gdb 7.5; you get a message box
|
||||
with "GDB: Reading symbols from..."
|
||||
* As a workaround,
|
||||
* Create a `~/.gdbinit` file with `set print symbol-loading off`
|
||||
* Start kdbg, open `Settings/Global options` and remove the `--nx`
|
||||
argument to gdb.
|
||||
|
||||
## Portability Testing
|
||||
|
||||
GitHub pull requests are automatically tested on Windows, Linux with both clang
|
||||
& gcc, and Macintosh. The build results show up as check results on the status
|
||||
page.
|
||||
|
||||
Build errors will block the pull request. It's also useful to check the build
|
||||
logs for new warnings on platforms other than the one used for development.
|
||||
|
||||
## Clang sanitizers
|
||||
|
||||
Clang has built-in santizers to check for several classes of problems. Here are
|
||||
the configure options for building ICU with the address checker:
|
||||
|
||||
```
|
||||
CPPFLAGS=-fsanitize=address LDFLAGS=-fsanitize=address ./runConfigureICU
|
||||
--enable-debug --disable-release Linux --disable-renaming
|
||||
```
|
||||
|
||||
The other available sanitizers are `thread`, `memory` and `undefined` behavior.
|
||||
At the time of this writing, thread and address run cleanly, the others show
|
||||
warnings that have not yet been resolved.
|
||||
|
||||
## Heap Usage (ICU4C)
|
||||
|
||||
HeapTrack is a useful tool for analyzing heap usage of a test program, to check
|
||||
the total heap activity of a particular function or object creation, for
|
||||
example. It will show totals by line in the source, and can move up and down the
|
||||
stack to see more detail.
|
||||
|
||||
<https://github.com/KDE/heaptrack>
|
||||
|
||||
To install on Linux,
|
||||
|
||||
```
|
||||
sudo apt install heaptrack
|
||||
sudo apt install heaptrack-gui
|
||||
```
|
||||
|
||||
## Quick Scripts for small test programs
|
||||
|
||||
I use the following simple scripts to simplify building and debugging small
|
||||
stand-alone programs against ICU, without needing to set up makefiles. They
|
||||
assume a program with a single .cpp file with the same name as the directory in
|
||||
which it resides.
|
||||
|
||||
```
|
||||
b: build
|
||||
|
||||
r: run
|
||||
|
||||
d: debug
|
||||
|
||||
v: run under valgrind
|
||||
```
|
||||
|
||||
You will probably need to modify them to reflect where you keep your most
|
||||
commonly used ICU build, and whether you routinely use an out-of-source ICU
|
||||
build.
|
||||
|
||||
```
|
||||
$ cat \`which b\`
|
||||
|
||||
#! /bin/sh
|
||||
|
||||
if \[\[ -z "${ICU_HOME}" \]\] ; then
|
||||
|
||||
ICU_HOME=$HOME/icu/icu4c
|
||||
|
||||
fi
|
||||
|
||||
DIR=\`pwd\`
|
||||
|
||||
PROG=\`basename $DIR\`
|
||||
|
||||
clang++ -g -I $ICU_HOME/source/common -I $ICU_HOME/source/i18n -I
|
||||
$ICU_HOME/source/io -L$ICU_HOME/source/lib -L$ICU_HOME/source/stubdata -licuuc
|
||||
-licui18n -licudata -o $PROG $PROG.cpp
|
||||
|
||||
$ cat \`which r\`
|
||||
|
||||
#! /bin/sh
|
||||
if \[\[ -z "${ICU_HOME}" \]\] ; then
|
||||
ICU_HOME=$HOME/icu/icu/icu4c
|
||||
fi
|
||||
DIR=\`pwd\`
|
||||
PROG=\`basename $DIR\`
|
||||
LD_LIBRARY_PATH=$ICU_HOME/source/lib:$ICU_HOME/source/stubdata
|
||||
ICU_DATA=$ICU_HOME/source/data/out ./$PROG
|
||||
|
||||
cat \`which d\`
|
||||
#! /bin/sh
|
||||
if \[\[ -z "${ICU_HOME}" \]\] ; then
|
||||
ICU_HOME=$HOME/icu/icu/icu4c
|
||||
fi
|
||||
DIR=\`pwd\`
|
||||
PROG=\`basename $DIR\`
|
||||
LD_LIBRARY_PATH=$ICU_HOME/source/lib:$ICU_HOME/source/stubdata
|
||||
ICU_DATA=$ICU_HOME/source/data/out gdb ./$PROG
|
||||
|
||||
$ cat \`which v\`
|
||||
|
||||
#! /bin/sh
|
||||
if \[\[ -z "${ICU_HOME}" \]\] ; then
|
||||
ICU_HOME=$HOME/icu/icu/icu4c
|
||||
fi
|
||||
DIR=\`pwd\`
|
||||
PROG=\`basename $DIR\`
|
||||
LD_LIBRARY_PATH=$ICU_HOME/source/lib:$ICU_HOME/source/stubdata
|
||||
ICU_DATA=$ICU_HOME/source/data/out valgrind --leak-check=full ./$PROG
|
||||
```
|
|
@ -1,10 +1,28 @@
|
|||
---
|
||||
layout: default
|
||||
title: Configuring VS Code for ICU4C
|
||||
grand_parent: Setup for Contributors
|
||||
parent: C++ Setup
|
||||
---
|
||||
|
||||
<!--- © 2020 and later: Unicode, Inc. and others. --->
|
||||
<!--- License & terms of use: http://www.unicode.org/copyright.html --->
|
||||
|
||||
# Configuring VS Code for ICU4C
|
||||
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
|
||||
- Create a `.vscode` folder in icu4c/source
|
||||
- Copy the `tasks.json`, `launch.json` and `c_cpp_properties.json` files into
|
||||
- Copy the [`tasks.json`](tasks.json), [`launch.json`](launch.json) and [`c_cpp_properties.json`](c_cpp_properties.json) files into
|
||||
the `.vscode` folder.
|
||||
- To test only specific test targets, specify them under `args` in
|
||||
`launch.json`.
|
14
docs/devsetup/index.md
Normal file
14
docs/devsetup/index.md
Normal file
|
@ -0,0 +1,14 @@
|
|||
---
|
||||
layout: default
|
||||
title: Setup for Contributors
|
||||
nav_order: 10000
|
||||
has_children: true
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Setup for Contributors
|
||||
|
BIN
docs/devsetup/java/ant/Capture.png
Normal file
BIN
docs/devsetup/java/ant/Capture.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 69 KiB |
246
docs/devsetup/java/ant/index.md
Normal file
246
docs/devsetup/java/ant/index.md
Normal file
|
@ -0,0 +1,246 @@
|
|||
---
|
||||
layout: default
|
||||
title: Ant Setup for Java
|
||||
grand_parent: Setup for Contributors
|
||||
parent: Java Setup
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Ant Setup for Java
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
ICU4J source layout was changed after 4.2. There are several ways to set up the
|
||||
ICU4J development environment.
|
||||
|
||||
Get the source code by following the [Quick Start
|
||||
instruction](http://site.icu-project.org/repository). Go into the icu4j/
|
||||
directory to see the build.xml file. You can run targets displayed by `ant -p`.
|
||||
|
||||
Main targets:
|
||||
|
||||
* `all` Build all primary targets
|
||||
* `apireport` Run API report generator tool
|
||||
* `apireportOld` Run API report generator tool (Pre Java 5 Style)
|
||||
* `build-tools` Build build-tool classes
|
||||
* `charset` Build charset classes
|
||||
* `charset-tests` Build charset tests
|
||||
* `charsetCheck` Run only the charset tests
|
||||
* `check` Run the standard ICU4J test suite
|
||||
* `checkDeprecated` Check consistency between javadoc @deprecated and @Deprecated annotation
|
||||
* `checkTest` Run only the specified tests of the specified test class or, if no arguments are given, the standard ICU4J test suite.
|
||||
* `checktags` Check API tags before release
|
||||
* `cldrUtil` Build Utilities for CLDR tooling
|
||||
* `clean` Clean up build outputs
|
||||
* `collate` Build collation classes
|
||||
* `collate-tests` Build core tests
|
||||
* `collateCheck` Run only the collation tests
|
||||
* `core` Build core classes
|
||||
* `core-tests` Build core tests
|
||||
* `coreCheck` Run only the core tests
|
||||
* `coverageJaCoCo` Run the ICU4J unit tests and generate code coverage report
|
||||
* `currdata` Build currency data classes
|
||||
* `demos` Build demo classes
|
||||
* `docs` Build API documents
|
||||
* `docsStrict` Build API documents with all doclint check enabled
|
||||
* `draftAPIs` Run API collector tool and generate draft API report
|
||||
* `exhaustiveCheck` Run the standard ICU4J test suite in exhaustive mode
|
||||
* `findbugs` Run FindBugs on all library sub projects.
|
||||
* `gatherapi` Run API database generator tool
|
||||
* `gatherapiOld` Run API database generator tool (Pre Java 5 style)
|
||||
* `icu4jJar` Build ICU4J all-in-one core jar
|
||||
* `icu4jSrcJar` Build icu4j-src.jar
|
||||
* `icu4jtestsJar` Build ICU4J all-in-one test jar
|
||||
* `indicIMEJar` Build indic IME 'icuindicime.jar' jar file
|
||||
* `info` Display the build environment information
|
||||
* `init` Initialize the environment for build and test. May require internet access.
|
||||
* `jar` Build ICU4J runtime library jar files
|
||||
* `jarDemos` Build ICU4J demo jar file
|
||||
* `jdktzCheck` Run the standard ICU4J test suite with JDK TimeZone
|
||||
* `langdata` Build language data classes
|
||||
* `localespi` Build Locale SPI classes
|
||||
* `localespi-tests` Build Locale SPI tests
|
||||
* `localespiCheck` Run the ICU4J Locale SPI test suite
|
||||
* `main` Build ICU4J runtime library classes
|
||||
* `packaging-tests` Build packaging tests
|
||||
* `packagingCheck` Run packaging tests
|
||||
* `perf-tests` Build performance test classes
|
||||
* `regiondata` Build region data classes
|
||||
* `release` Build all ICU4J release files for distribution
|
||||
* `releaseBinaries` Build ICU4J binary files for distribution
|
||||
* `releaseCLDR` Build release files for CLDR tooling
|
||||
* `releaseDocs` Build ICU4J API reference doc jar file for distribution
|
||||
* `releaseSourceArchiveTgz` Build ICU4J source release archive (.tgz)
|
||||
* `releaseSourceArchiveZip` Build ICU4J source release archive (.zip)
|
||||
* `releaseSrcJars` Build ICU4J src jar files for distribution
|
||||
* `releaseVer` Build all ICU4J release files for distribution with versioned file names
|
||||
* `runTest` Run the standard ICU4J test suite without calling any other build targets
|
||||
* `samples` Build sample classes
|
||||
* `secure` (Deprecated)Build ICU4J API and test classes for running the ICU4J test suite with Java security manager enabled
|
||||
* `secureCheck` Run the secure (applet-like) ICU4J test suite
|
||||
* `test-framework` Build test framework classes
|
||||
* `tests` Build ICU4J test classes
|
||||
* `timeZoneCheck` Run the complete test for TimeZoneRoundTripAll
|
||||
* `tools` Build tool classes
|
||||
* `translit` Build translit classes
|
||||
* `translit-tests` Build translit tests
|
||||
* `translitCheck` Run the ICU4J Translit test suite
|
||||
* `translitIMEJar` Build transliterator IME 'icutransime.jar' jar file
|
||||
* `xliff` Build xliff converter tool
|
||||
|
||||
Default target: main
|
||||
The typical usage is `ant check`, which will build main ICU4J libraries and
|
||||
run the standard unit test suite.
|
||||
|
||||
For running ant you may need to set up some environment variables first. For
|
||||
example, on Windows:
|
||||
|
||||
```
|
||||
set ANT_HOME=C:\\ant\\apache-ant-1.7.1
|
||||
|
||||
set JAVA_HOME=C:\\Program Files\\Java\\jdk1.5.0_07
|
||||
|
||||
set PATH=%JAVA_HOME%\\bin;%ANT_HOME%\\bin;%PATH%
|
||||
```
|
||||
|
||||
## Test arguments and running just one test or the tests of just one test class
|
||||
|
||||
You can pass arguments to the test system by using the 'testclass' and
|
||||
'testnames' variables and the 'checkTest' target. For example:
|
||||
|
||||
|Command Line|Meaning|
|
||||
|------------|--------|
|
||||
|`ant checkTest -Dtestclass='com.ibm.icu.dev.test.lang.TestUScript'` | Runs all the tests in test class 'TestUScript'.|
|
||||
|`ant checkTest -Dtestclass='com.ibm.icu.dev.test.lang.TestUScript' -Dtestnames='TestNewCode,TestHasScript'` | Runs the tests `TestNewCode` and `TestHasScript` in test class `TestUScript`. |
|
||||
|`ant checkTest -Dtestnames='TestNewCode,TestHasScript'` | Error: test class not specified.|
|
||||
|`ant checkTest` | Runs the standard ICU4J test suite (same as 'ant check').|
|
||||
|
||||
The JUnit-generated test result reports are in out/junit-results/checkTest. Go
|
||||
into the `html/` subdirectory and load `index.html` into a browser.
|
||||
|
||||
## Generating Test Code Coverage Report
|
||||
|
||||
[#10513](http://bugs.icu-project.org/trac/ticket/10513) added code coverage
|
||||
target "coverageJaCoCo" in the ICU4J ant build.xml. To run the target:
|
||||
|
||||
1. Download JaCoCo library from [EclEmma
|
||||
site](http://eclemma.org/jacoco/index.html).
|
||||
2. Extract library files to your local system - e.g. `C:\jacoco-0.7.6`
|
||||
3. Set environment variable JACOCO_DIR pointing to the directory where JaCoCo
|
||||
files are extracted - e.g. `set JACOCO_DIR=C:\jacoco-0.7.6`
|
||||
4. Set up ICU4J ant build environment.
|
||||
5. Run the ant target "coverageJaCoCo" in the top-level ICU4J build.xml
|
||||
|
||||
Following output report files will be generated in /out/jacoco directory.
|
||||
|
||||
* report.csv
|
||||
* report.xml
|
||||
* report_html.zip
|
||||
|
||||
## Building ICU4J API Reference Document with JCite
|
||||
|
||||
Since ICU4J 49M2, JCite (Java Source Code Citation System) is integrated into
|
||||
ICU4J documentation build. To build the API documentation for public release,
|
||||
you must use JCite for embedding some coding examples in the API documentation.
|
||||
To set up the environment:
|
||||
|
||||
1. <http://arrenbrecht.ch/jcite/>Download JCite binary (you need 1.13.0+ for JDK 7 support) from
|
||||
* Note that JCite no longer is available for download from the official
|
||||
web site, which links to Google Code, which was closed down in 2016.
|
||||
* The Internet Archive has a copy of the last version of JCite found on
|
||||
Google Code before it was closed down:
|
||||
[jcite-1.13.0-bin.zip](https://web.archive.org/web/20160710183051/http://jcite.googlecode.com/files/jcite-1.13.0-bin.zip)
|
||||
2. Extract JCite file to your local system - e.g. `C:\jcite-1.13.0`
|
||||
3. Set environment variable `JCITE_DIR` pointing to the directory where JCite
|
||||
files are extracted. - e.g. `set JCITE_DIR=C:\jcite-1.13.0`
|
||||
4. Set up ICU4J ant build environment.
|
||||
5. Run the ant target "docs" in the top-level ICU4J build.xml
|
||||
6. If the build (on Linux) fails because package com.sun.javadoc is not found
|
||||
then set the JAVA_HOME environment variable to point to `<path>/java/jdk`. The
|
||||
Javadoc package is in `<path>/java/jdk/lib/tools.jar`.
|
||||
|
||||
*Note: The ant target "docs" checks if `JCITE_DIR` is defined or not. If not
|
||||
defined, it will build ICU4J API docs without JCite. In this case, JCite taglet
|
||||
"{@.jcite ....}" won't be resolved and the embedded tag is left unchanged in the
|
||||
output files.*
|
||||
|
||||
## Build and test ICU4J Eclipse Plugin
|
||||
|
||||
Building Eclipse ICU4J plugin
|
||||
|
||||
1. Download and install the latest Eclipse release from
|
||||
<http://www.eclipse.org/> (The latest stable milestone is desired, but the
|
||||
latest official release should be OK).
|
||||
2. cd to `<icu4j root>` directory, and make sure `$ ant releaseVer` runs clean.
|
||||
3. cd to` <icu4j root>/eclipse-build` directory.
|
||||
4. Copy `build-local.properties.template` to `build-local.properties`, edit the
|
||||
properties files
|
||||
* eclipse.home pointing to the directory where the latest Eclipse version
|
||||
is installed (the directory contains configuration, dropins, features,
|
||||
p2 and others)
|
||||
* java.rt - see the explanation in the properties file
|
||||
5. Run the default ant target - $ ant The output ICU4J plugin jar file is
|
||||
included in `<icu4j
|
||||
root>/eclipse-build/out/projects/ICU4J.com.ibm.icu/com.ibm.icu-com.ibm.icu.zip`
|
||||
|
||||
Plugin integration test
|
||||
|
||||
1. Backup Eclipse installation (if you want to keep it - just copy the entire
|
||||
Eclipse installation folder)
|
||||
2. Delete ICU4J plugin included in Eclipse installation -
|
||||
`<eclipse>/plugins/com.ibm.icu_XX.Y.Z.vYYYYMMDD-HHMM.jar` XX.YY.Z is the ICU
|
||||
version, and YYYYMMDD-HHMM is build date. For example,
|
||||
com.ibm.icu_58.2.0.v20170418-1837.jar
|
||||
3. Copy the new ICU4J plugin jar file built by previous steps (e.g.
|
||||
com.ibm.icu_61.1.0.v20180502.jar) to the same folder.
|
||||
4. Search a text "`com.ibm.icu"` in files under `<eclipse>/features`. The RCP
|
||||
feature has a dependency on the ICU plugin and its `feature.xml` (e.g.
|
||||
`<eclipse>/features/org.eclipse.e4.rcp_1.6.2.v20171129-0543/feature.xml`)
|
||||
contains the dependent plugin information. Replace just version attribute to
|
||||
match the version built by above steps. You can leave size attributes
|
||||
unchanged. The current ICU build script does not append hour/minute in
|
||||
plugin jar file, so the version format is XX.Y.Z.vYYYYMMDD.
|
||||
` <plugin`
|
||||
` id="com.ibm.icu"`
|
||||
` download-size="11775"`
|
||||
` install-size="26242"`
|
||||
` version="58.2.0.v20170418-1837" -> "61.1.0.v20180502" `
|
||||
` unpack="false"/>`
|
||||
5. Open
|
||||
`<eclipse>/configuration/org.eclipse.equinox.simpleconfigurator/bundles.info`
|
||||
in a text editor, and update the line including com.ibm.icu plugin
|
||||
information.
|
||||
```
|
||||
com.ibm.icu,58.2.0.v20170418-1837,plugins/com.ibm.icu_58.2.0.v20170418-1837.jar,4,false
|
||||
```
|
||||
then becomes ->
|
||||
```
|
||||
com.ibm.icu,**61.1.0.v20190502**,plugins/com.ibm.icu_**61.1.0.v20190502**.jar,4,false
|
||||
```
|
||||
6. Make sure Eclipse can successfully starts with no errors. If ICU4J plug-in
|
||||
is not successfully loaded, Eclipse IDE won't start.
|
||||
|
||||
ICU4J plugin test - Note: This is currently broken
|
||||
<http://bugs.icu-project.org/trac/ticket/13072>
|
||||
|
||||
1. Start the Eclipse (with new ICU4J plugin), and create a new workspace.
|
||||
2. Import existing Eclipse project from `<icu4jroot>/eclipse-build/out/projects/com.ibm.icu.tests`
|
||||
3. Run the project as JUnit Plug-in Test.
|
||||
|
||||
## Building ICU4J Release Files
|
||||
|
||||
See [Release Build](../../../processes/release/tasks/release-build.md)
|
BIN
docs/devsetup/java/eclipse-setup-for-java-developers/Capture.png
Normal file
BIN
docs/devsetup/java/eclipse-setup-for-java-developers/Capture.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 3 KiB |
290
docs/devsetup/java/eclipse-setup-for-java-developers/index.md
Normal file
290
docs/devsetup/java/eclipse-setup-for-java-developers/index.md
Normal file
|
@ -0,0 +1,290 @@
|
|||
---
|
||||
layout: default
|
||||
title: Eclipse Setup for Java Developers
|
||||
grand_parent: Setup for Contributors
|
||||
parent: Java Setup
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Eclipse Setup for Java Developers
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
|
||||
ICU4J source layout was changed after 4.2. There are several ways to set up the ICU4J development environment.
|
||||
|
||||
*If you want to use Eclipse, you should create a new clean workspace first.*
|
||||
|
||||
## Java Language Level
|
||||
|
||||
Eclipse typically requires a newer Java version than what we can depend on for
|
||||
ICU4J. If you don't do the following, you run the risk of calling Java library
|
||||
APIs that are newer than ICU4J's Java version, and you cause runtime exceptions
|
||||
for people who use the older version.
|
||||
|
||||
Currently (as of 2018-aug / ICU 63), ICU4J is on Java 7 (and Eclipse 4.6
|
||||
requires Java 8).
|
||||
|
||||
(Note: localespi/localespi-tests may use a different Java version from ICU4J
|
||||
proper.)
|
||||
|
||||
1. Check if you already have an older JRE or a JDK for the minimum version
|
||||
required for ICU4J.
|
||||
* A JRE (runtime environment, no compiler) is sufficient.
|
||||
* If you don't have one yet, then install one (OpenJDK or Oracle).
|
||||
2. Select \[Window\] - \[Preferences\] *(On Mac, this is \[Eclipse -
|
||||
Preferences\])*
|
||||
3. Navigate the preferences tree to Java/Installed JREs/Execution Environments
|
||||
4. On the left, Execution Enviornments: Select J2SE-1.7
|
||||
5. On the right, Compatible JREs, if there is no old-version Java 7 JRE:
|
||||
1. Go up one tree level to Java/Installed JREs.
|
||||
2. Click "Add..." and select "Standard VM" as JRE type.
|
||||
3. Click "Directory..." and find the location of your old-version JRE (or
|
||||
JDK) on your system
|
||||
* Linux tip: When you install an OpenJDK, look for it in /usr/lib/jvm/
|
||||
4. You can leave the detected settings as is - Click "Finish", then Click
|
||||
"OK" in Installed JREs (or "Apply" the modified settings as you navigate
|
||||
away from here).
|
||||
5. Go back down in the tree to Java/Installed JREs/Execution Environments.
|
||||
6. On the right, Compatible JREs, you should now see your old-version JRE
|
||||
6. The matching-old-version JRE should have a "\[perfect match\]" suffix.
|
||||
Select it for "JavaSE-1.7" on the left.
|
||||
|
||||
## Other Settings
|
||||
|
||||
1. ~~Turn on warnings about resource leaks. Preferences>Java>Compiler>Errors/Warnings>\[filter on leak\], set both "Resource leak" and "Potential resource leak" to "Warning".~~
|
||||
(ICU project files were updated to include these settings, so this has no
|
||||
effects 2015-03-11)
|
||||
|
||||
## Import ICU4J from the file system
|
||||
|
||||
(Recommended)
|
||||
|
||||
In <icu workspace root>/icu4j remember to run "ant init" first. You might run
|
||||
"ant check" as well for good measure.
|
||||
|
||||
If you check out ICU4J source from the repository using an external client
|
||||
(usually command-line git clone), the new instruction is not much different. You
|
||||
just follow the steps below -
|
||||
|
||||
1. File/Import...
|
||||
2. Select General/Existing Projects into Workspace
|
||||
3. Select root directory: Browse to <icu svn workspace root>/icu4j, which will
|
||||
show a number of projects.
|
||||
4. Deselect the following projects (i.e., do not import them). These are not
|
||||
needed for normal ICU development (and would require installing further
|
||||
prerequisite libraries to get them to build).
|
||||
* com.ibm.\* (Eclipse plug-in)
|
||||
* icu4j-localespi\* (more plug-in)
|
||||
* icu4j-build-tools
|
||||
* icu4j-packaging-tests
|
||||
5. Click Finish.
|
||||
6. Wait for Eclipse to build the projects.
|
||||
|
||||
## Obsolete: Import ICU4J using the [Subversive](http://www.eclipse.org/subversive/) SVN plugin
|
||||
|
||||
Subversive is the standard SVN plugin for Eclipse 3.4+. If you have
|
||||
[subversive](http://www.eclipse.org/subversive/) installed/configured on your
|
||||
Eclipse environment, you can directly check out these 8 projects from the SVN
|
||||
repository directory. (It looks this does not work well with
|
||||
"[subclipse](http://www.eclipse.org/subversive/)")
|
||||
|
||||
#### Installing Subversive (Eclipse 3.6 or later)
|
||||
|
||||
1. Select \[Help\] - \[Install New Software...\] from menu
|
||||
2. Select the appropriate Eclipse update site in "Work with:" field - for
|
||||
example, select "Indigo - http://download.eclipse.org/releases/indigo" for
|
||||
Eclipse 3.7.x and hit enter key.
|
||||
3. Expand "Collaboration" and check "Subversive SVN Team Provider
|
||||
(Incubation)", then click "Next >". Confirm the item selected in the next
|
||||
screen, then click "Next >" again, then accept license terms in the next
|
||||
screen and click "Finish". After the installation, click "Restart Now" to
|
||||
restart Eclipse.
|
||||
4. Select \[Window\] - \[Preferences\] to open Preferences. Expand Team on the
|
||||
left pane and click SVN. It will open "Subversive Connector Discovery".
|
||||
Select one from the list. **Note: Some people (including myself) are
|
||||
experiencing a problem with SVNKit 1.3.5. If you want to use SVNKit, use
|
||||
1.3.3 instead (2011-10-24 yoshito)** Restart Eclipse.
|
||||
|
||||
### Installing Subversive (Old)
|
||||
|
||||
1. Goto <http://www.eclipse.org/subversive/downloads.php>
|
||||
2. Goto "latest release" on that page
|
||||
3. Copy the update site, eg
|
||||
"<http://download.eclipse.org/technology/subversive/0.7/update-site/> "
|
||||
4. Go to Eclipse, then Help > Install New Software...
|
||||
5. Into "Work with...", paste the update site.
|
||||
6. Set the checkbox on Plug-ins. Hit Next and Finish until you are done.
|
||||
Restart Eclipse.
|
||||
7. Start Eclipse. It will ask for the connectors. Select all the SVN kits and
|
||||
install. Restart Eclipse.
|
||||
|
||||
### Importing ICU4J
|
||||
|
||||
1. File - Import
|
||||
2. Select "Project from SVN" under "SVN", Next
|
||||
3. In the General Tab, set URL to:
|
||||
svn+ssh://source.icu-project.org/repos/icu/icu4j, and set your User name:
|
||||
XXXXXX
|
||||
4. In the SSH Settings, fill-in proper authentication information. (port 922,
|
||||
your ssh key (eg icu-project-key) and passphrase..)
|
||||
5. If the connection is properly established, it opens next dialog "Select
|
||||
Resource". Set URL to be:
|
||||
svn+ssh://source.icu-project.org/repos/icu/icu4j/trunk/main - then click
|
||||
Finish
|
||||
6. The next dialog "Check Out As" should have 4 options indicating how to check
|
||||
out. Select the second option "Find projects in the children of the selected
|
||||
resource" - click Finish
|
||||
7. It takes a while to locate projects. The next dialog shows a batch of
|
||||
projects -
|
||||
1. You may want to deselect localespi and localespi-tests.
|
||||
2. Click Finish (that means, "Check out as a projects into workspace" is
|
||||
selected)
|
||||
8. After these projects are imported into the workspace, open Java perspective.
|
||||
You might notice there are modification marker (">") displayed for the
|
||||
projects. This is caused by build output directory created in each project's
|
||||
workspace. To resolve this issue, you can go to Window - Preferences, then
|
||||
select Team - Ignore Resources, then Add Pattern "out" (which is build
|
||||
output directory used by these projects). You may need to restart Eclipse
|
||||
after adding the new pattern.
|
||||
|
||||
**Note:** With the instruction above, you may see Eclipse errors when you open
|
||||
ant build.xml in each project, such as "Target @build-all does not exist in this
|
||||
project". This is because the import operation above flatten the original SVN
|
||||
directory structure and files referenced via ${share.dir} does not work well. To
|
||||
resolve the issue, you need to override the property by importing
|
||||
locations-eclipse.properties globally. See the following steps to configure the
|
||||
override.
|
||||
|
||||
1. From Eclipse menu, select \[Window\] - \[Preferences\]
|
||||
2. Select "Ant" - "Runtime" on the left in the Preferences dialog
|
||||
3. Open "Properties" tab
|
||||
4. Under "Global property files", click "Add Files..."
|
||||
5. Select icu4j-shared project in the list, then select
|
||||
build/locations-eclipse.properties
|
||||
6. Click OK - OK, to save the configuration.
|
||||
|
||||
## Another method using Eclipse SVN plugin (Subversive and Subclipse)
|
||||
|
||||
1. File - New - Other... then, select "Repository Location" under "SVN"
|
||||
1. General Tab
|
||||
1. URL - svn+ssh://source.icu-project.org/repos/icu/icu4j
|
||||
2. User name: <yourname>
|
||||
3. Password: <leave empty>
|
||||
2. SSH Settings
|
||||
1. Port: 922
|
||||
2. Private key: <browse to your ssh private key>
|
||||
3. Passphrase: <your passphrase>
|
||||
3. Finish
|
||||
2. Open SVN Repositories perspective (Window>Open Perspective>SVN Repository
|
||||
Exploring) and expand the repository location you added above.
|
||||
3. Navigate to trunk
|
||||
4. Right click and select "Check Out" - this may take a few minutes.
|
||||
5. File - Import and select "Existing Projects into Workspace" under "General"
|
||||
6. Select root directory - navigate to "main" directory under your workspace
|
||||
location where the source files were checked out (for example,
|
||||
C:\\eclipse_ws\\icu4j\\trunk\\main)
|
||||
7. You should see 10 projects including icu4j-charset, icu4j-charset-tests,
|
||||
icu4j-core.... (number of projects might be changed in future)
|
||||
1. All of them should be selected
|
||||
2. **Important**: unclick "copy projects into Workspace"
|
||||
3. Click Finish to import all
|
||||
8. Back in the Java perspective, you should see the new projects.
|
||||
9. For this time, projects are associated with SVN workspace. If you see the
|
||||
modification marker (">") displayed for the projects, configure your
|
||||
workspace to ignore pattern "out" (See step 7 in b-2 above).
|
||||
10. From the command line, run "ant init" in the top level "main" (for example,
|
||||
C:\\eclipse_ws\\icu4j\\trunk\\main)
|
||||
|
||||
## Testing & Debugging
|
||||
|
||||
### Run All Tests
|
||||
|
||||
To run all of the main tests, do the following:
|
||||
|
||||
**58 or later**
|
||||
|
||||
* "ant check" from the command line?
|
||||
|
||||
**53-57**
|
||||
|
||||
* Select icu4j-testall project in package explorer
|
||||
* Right Click > Run As > Java Application
|
||||
|
||||
**52 or before**
|
||||
|
||||
* In icu4j-test-testframework, open com.ibm.icu.dev.test.TestAll
|
||||
* RightClick>Run As>Java Application...
|
||||
* It will fail, but create a Run Configuration
|
||||
* RightClick>Run Configuration...
|
||||
* Change the name to "TestAll - ICU4J"
|
||||
* Click on Arguments, and set to "-n -t"
|
||||
* Click on Classpath>User Entries>Add Projects...
|
||||
* Select all of your ICU projects but **except icu4j-localespi and
|
||||
icu4j-localespi-test**, and Add, eg:
|
||||
* icu4j-charset
|
||||
* icu4j-charset-tests
|
||||
* ...
|
||||
* Now Run.
|
||||
|
||||
### Run specific tests
|
||||
|
||||
#### 58 or later
|
||||
|
||||
* Right click on a test package (for example `com.ibm.icu.dev.test.rbbi` in
|
||||
the **icu4j-core-tests** project), or an entire test source directory (such
|
||||
as src in the **icu4j-core-tests** project) and choose **Run As->JUnit
|
||||
Test**
|
||||
* For test coverage, install EclEmma (below) and use **Coverage As** instead
|
||||
of **Run As**.
|
||||
|
||||
### Test in Eclipse with ICU4J from jar files
|
||||
|
||||
You can manually create an Eclipse Run Configuration that doesn't include any of
|
||||
the directories but all of the JAR files:
|
||||
<http://stackoverflow.com/questions/1732259/eclipse-how-to-debug-a-java-program-as-a-jar-file>
|
||||
|
||||
### Test Coverage (53 or later)
|
||||
|
||||
* Install EclEmma plug-in. The installation instruction is found in [the
|
||||
EclEmma site page](http://www.eclemma.org/installation.html).
|
||||
* Run all tests once as described in the above section once.
|
||||
* For the menu, select "Run" > "Coverage..." to open "Coverage Configurations"
|
||||
window.
|
||||
* Go to "Coverage" tab and uncheck all test projects (icu4j-\*-tests,
|
||||
icu4j-test-framework, icu4j-testall) to exclude test codes from coverage
|
||||
analysis.
|
||||
* Click "Coverage" to run the all tests with coverage analysis enabled. After
|
||||
the text execution, coverage report displayed in "Coverage" view.
|
||||
* After running coverage, source lines are highlighted in different colors
|
||||
depending on coverage level. Too remove the highlights, click "Remove All
|
||||
Sessions" icon below (which also delete the coverage results).
|
||||
|
||||

|
||||
|
||||
* If you want to run coverage again, you can just right click on icu4j-testall
|
||||
project and select "Coverage As" > "Java Application"
|
||||
|
||||
## Branching
|
||||
|
||||
* // Needs review
|
||||
|
||||
To Create the Branch
|
||||
|
||||
* Modify
|
||||
* To merge, use Team>Merge. Pick Start from Copy.
|
||||
|
||||
To Merge a Branch
|
||||
|
||||
* ...
|
15
docs/devsetup/java/index.md
Normal file
15
docs/devsetup/java/index.md
Normal file
|
@ -0,0 +1,15 @@
|
|||
---
|
||||
layout: default
|
||||
title: Java Setup
|
||||
parent: Setup for Contributors
|
||||
has_children: true
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Java Setup
|
||||
|
||||
|
39
docs/devsetup/java/java-profiling-and-monitoring-tools.md
Normal file
39
docs/devsetup/java/java-profiling-and-monitoring-tools.md
Normal file
|
@ -0,0 +1,39 @@
|
|||
---
|
||||
layout: default
|
||||
title: Java Profiling and Monitoring tools
|
||||
grand_parent: Setup for Contributors
|
||||
parent: Java Setup
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Java Profiling and Monitoring tools
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
|
||||
There are many Java development tools available for analyzing Java application
|
||||
run time performance. Eclipse has a set of plug-ins called TPTP which provides
|
||||
Java application profiling/monitoring framework. However, TPTP is very slow and
|
||||
I experienced frequent crash while profiling ICU4J codes. For ICU4J development,
|
||||
I recommend several tools described below.
|
||||
|
||||
## VisualVM
|
||||
|
||||
VisualVM is available as a separate download since JDK 9. You can download the latest
|
||||
version from here - <https://visualvm.github.io/download.html>
|
||||
There is an Eclipse plug-in, which allow you to launch VisualVM when you run a
|
||||
Java app on Eclipse. You can monitor CPU usage of the Java app, Memory usage
|
||||
(heap/permgen), classes loaded, etc in GUI. You can also get basic profiling
|
||||
information, such as CPU usage by class, memory allocations and generate heap
|
||||
dump, force GC etc.
|
87
docs/devsetup/source/gittooling.md
Normal file
87
docs/devsetup/source/gittooling.md
Normal file
|
@ -0,0 +1,87 @@
|
|||
---
|
||||
layout: default
|
||||
title: Local tooling configs for git and Github
|
||||
grand_parent: Setup for Contributors
|
||||
parent: Source Code Setup
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Local tooling configs for git and Github
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
## git difftool & mergetool
|
||||
|
||||
The `git diff` command prints changes to stdout, normally to the terminal
|
||||
screen.
|
||||
|
||||
Set up a visual diff and merge program for use with `git difftool` and `git
|
||||
mergetool`.
|
||||
|
||||
Changes in binary files do not show well in common diff tools and can take a
|
||||
long time for them to compute visual diffs.
|
||||
|
||||
This is easily avoided using the -d option: `git difftool -d`
|
||||
|
||||
This shows all changed files in the diff program, and you can view and skip
|
||||
files there as appropriate.
|
||||
|
||||
### Linux example
|
||||
|
||||
[stackoverflow/.../setting-up-and-using-meld-as-your-git-difftool-and-mergetool](https://stackoverflow.com/questions/34119866/setting-up-and-using-meld-as-your-git-difftool-and-mergetool)
|
||||
|
||||
#### Linux meld
|
||||
|
||||
`gedit ~/.gitconfig` →
|
||||
|
||||
```
|
||||
[diff]
|
||||
tool = meld
|
||||
[difftool]
|
||||
prompt = false
|
||||
[difftool "meld"]
|
||||
cmd = meld "$LOCAL" "$REMOTE"
|
||||
[merge]
|
||||
tool = meld
|
||||
[mergetool "meld"]
|
||||
cmd = meld "$LOCAL" "$MERGED" "$REMOTE" --output "$MERGED"
|
||||
```
|
||||
|
||||
## Auto-link from GitHub to Jira tickets
|
||||
|
||||
GitHub itself does not linkify text like "ICU-23456" to point to the Jira
|
||||
ticket. You can get links via browser extensions.
|
||||
|
||||
### Chrome Jira HotLinker
|
||||
|
||||
Install the [Jira
|
||||
HotLinker](https://chrome.google.com/webstore/detail/jira-hotlinker/lbifpcpomdegljfpfhgfcjdabbeallhk)
|
||||
from the Chrome Web Store.
|
||||
|
||||
Configuration Options:
|
||||
|
||||
* Jira instance url: https://unicode-org.atlassian.net/
|
||||
* Locations: https://github.com/
|
||||
|
||||
### Safari extension from SRL
|
||||
|
||||
<https://github.com/unicode-org/icu-jira-safari>
|
||||
|
||||
### Firefox extension from JefGen
|
||||
|
||||
Install from the Mozilla Firefox Add-ons site:
|
||||
<https://addons.mozilla.org/en-US/firefox/addon/github-jira-issue-linkifier/>
|
||||
|
||||
Source:
|
||||
<https://github.com/jefgen/github-jira-linkifier-webextension>
|
161
docs/devsetup/source/index.md
Normal file
161
docs/devsetup/source/index.md
Normal file
|
@ -0,0 +1,161 @@
|
|||
---
|
||||
layout: default
|
||||
title: Source Code Setup
|
||||
parent: Setup for Contributors
|
||||
has_children: true
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Source Code Setup
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
|
||||
> Announcement 07/16/2018: The ICU source code repository has been migrated from
|
||||
> Subversion to Git, and is now hosted on GitHub.
|
||||
|
||||
## Quick Start
|
||||
|
||||
You can view ICU source code online: <https://github.com/unicode-org/icu>
|
||||
|
||||
***Make sure you have git lfs installed.*** See the following section.
|
||||
|
||||
For read-only usage, create a local clone:
|
||||
|
||||
```
|
||||
git clone https://github.com/unicode-org/icu.git
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```
|
||||
git clone git@github.com:unicode-org/icu.git
|
||||
```
|
||||
|
||||
This will check out a new directory `icu` which contains **icu4c** and
|
||||
**icu4j** subdirectories as detailed below.
|
||||
|
||||
*For ICU development*, do *not* work directly with the Unicode ICU `main` branch!
|
||||
See the [git for ICU Developers](../../userguide/dev/gitdev) page instead.
|
||||
|
||||
For cloning from your own fork, replace `unicode-org` with your GitHub user
|
||||
name.
|
||||
|
||||
**For fetching just the files for an ICU release tag**, you can use a shallow
|
||||
clone:
|
||||
|
||||
```
|
||||
git clone https://github.com/unicode-org/icu.git --depth=1 --branch=release-63-1
|
||||
```
|
||||
|
||||
If you already have a clone of the ICU repository, you can add and extract
|
||||
release files like this:
|
||||
|
||||
```
|
||||
mkdir /tmp/extracted-icu # or wherever you want to extract to
|
||||
cd local-git-repo-top-level-dir
|
||||
git fetch upstream
|
||||
git tag --list "*63*" # List tags relevant to ICU 63, e.g., release-63-1
|
||||
git archive release-63-1 | tar -x -C /tmp/extracted-icu
|
||||
```
|
||||
|
||||
## Detailed Instructions
|
||||
|
||||
### Prerequisites: Git and Git LFS
|
||||
|
||||
(Note: you do not need a [GitHub](http://github.com) *account* to download the
|
||||
ICU source code. However, you might want such an account to be able to
|
||||
contribute to ICU.)
|
||||
|
||||
* Install a **git client**
|
||||
* <https://git-scm.com/downloads>
|
||||
* Linux: `sudo apt install git`
|
||||
* Install **git-lfs** if your git client does not already have LFS support
|
||||
(ICU uses git Large File Storage to store large binary content such as
|
||||
\*.jar files.)
|
||||
* <https://git-lfs.github.com/>
|
||||
* See also
|
||||
<https://help.github.com/articles/installing-git-large-file-storage/>
|
||||
* Linux: `sudo apt install git-lfs`
|
||||
* MacOS: Consider using Homebrew or MacPorts.
|
||||
* The command `git lfs version` will indicate if LFS is installed.
|
||||
* Setup git LFS for your local user account once on each machine:
|
||||
* `git lfs install --skip-repo`
|
||||
|
||||
### Working with git
|
||||
|
||||
There are many resources available to help you work with git, here are a few:
|
||||
|
||||
* <https://git-scm.com/> - the homepage of the git project
|
||||
* <https://help.github.com/> - GitHub’s help page
|
||||
* <https://try.github.io/> - Resources to learn Git
|
||||
|
||||
Want to contribute back to ICU? See
|
||||
[How to contribute](../../userguide/processes/contribute.md).
|
||||
|
||||
## Repository Layout
|
||||
|
||||
The top level
|
||||
[README.md](https://github.com/unicode-org/icu#international-components-for-unicode)
|
||||
contains the latest information about the repository’s layout. Currently:
|
||||
|
||||
* **icu4c**/ ICU for C/C++
|
||||
* **icu4j**/ ICU for Java
|
||||
* **tools**/ Tools
|
||||
* **vendor**/ Vendor dependencies (copied here for reference)
|
||||
|
||||
### Tags and Branches
|
||||
|
||||
The repository is **tagged** with different release versions of ICU.
|
||||
|
||||
For example,
|
||||
[release-55-1](https://github.com/unicode-org/icu/tree/release-55-1) is the tag
|
||||
which corresponds to version 55.1 of ICU (for both C and J).
|
||||
|
||||
Branches in the main fork are used for maintenance branches of ICU.
|
||||
|
||||
For example,
|
||||
[maint/maint-61](https://github.com/unicode-org/icu/tree/maint/maint-61) is a
|
||||
branch containing the latest maintenance work on the 61.x line of ICU.
|
||||
|
||||
There are other tags and branches which may be cleaned up/deleted at any time.
|
||||
|
||||
* branches/tags/releases from [before the icu4c and icu4j trees were
|
||||
merged](https://unicode-org.atlassian.net/browse/ICU-12800) - items prefixed
|
||||
with "icu-" are for icu4c, and "icu4j-" for icu4j, etc.
|
||||
* old personal work branches (with a person's username, such as **andy/6910**)
|
||||
* long running shared feature branches (In general, feature work is done on
|
||||
personal forks of the repository.)
|
||||
|
||||
See also the [Tips (for developers)](repository/tips/index.md) subpage.
|
||||
|
||||
## A Bit of History
|
||||
|
||||
ICU was first open sourced in 1999 using CVS and Jitterbug. The source files
|
||||
were imported from other source control systems internal to IBM at that time.
|
||||
|
||||
The ICU project moved to using a Subversion source code repository and a Trac
|
||||
bug database on Nov 30, 2006. These replace our original CVS source code
|
||||
repository and Jitterbug bug data base. All history from the older systems has
|
||||
been migrated into the new, so there should normally be no need to refer back to
|
||||
Jitterbug or CVS.
|
||||
|
||||
In July 2018, the ICU project [moved
|
||||
again](http://blog.unicode.org/2018/07/icu-moves-to-github-and-jira.html), this
|
||||
time from svn to git on GitHub, and from trac to Atlassian Cloud Jira. Many
|
||||
tools and much effort was involved in migration and testing. There is a
|
||||
[detailed blog post](https://srl295.github.io/2018/07/02/icu-infra/) on the
|
||||
topic (not an official ICU-TC document!) for those interested in the technical
|
||||
details of this move.
|
||||
|
|
@ -21,6 +21,14 @@ includes details that go beyond the C, C++, and Java API docs (and avoids some d
|
|||
|
||||
This is the new home of the User Guide (since 2020 August).
|
||||
|
||||
## ICU Site
|
||||
|
||||
The official ICU Site is located at <https://icu.unicode.org>.
|
||||
It is the official landing page for the ICU project.
|
||||
|
||||
Some of the pages from the ICU Site have been migrated here.
|
||||
The migrated sections and pages from the ICU Site are visible in the navigation bar of this site below the "ICU Site" section heading.
|
||||
|
||||
## ICU team member pages
|
||||
|
||||
Other documentation pages here are written by and for team members.
|
||||
|
|
47
docs/processes/release/maintenance-releases.md
Normal file
47
docs/processes/release/maintenance-releases.md
Normal file
|
@ -0,0 +1,47 @@
|
|||
---
|
||||
layout: default
|
||||
title: Maintenance Release Procedure
|
||||
parent: Release & Milestone Tasks
|
||||
grand_parent: Contributors
|
||||
nav_order: 75
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Maintenance Release Procedure
|
||||
|
||||
When a critical problem is found in ICU libraries, we try to fix the problem in
|
||||
the latest development stream first. If there is a demand for the fix in a past
|
||||
release, an ICU project developer may escalate the fix to be integrated in the
|
||||
release to the ICU project management committee. Once the committee approved to
|
||||
merge the fix into back level stream, the developer can merge the bug fix back
|
||||
to the past release suggested by the committee. This merge activity must be
|
||||
tracked by maintenance release place holder tickets and the developer should
|
||||
provide original ticket number and description as the response in each
|
||||
maintenance ticket. These fixes are automatically included in a future ICU
|
||||
maintenance release.
|
||||
|
||||
## Place Holder Ticket
|
||||
|
||||
Once a major version of ICU library is released, we create maintenance release
|
||||
place holder tickets for the major release (one for C, one for J). The ticket
|
||||
should have subject: "ICU4\[C|J\] m.n.X". For example, after ICU 4.8 release, we
|
||||
create two tickets - "ICU4C 4.8.X" and "ICU4J 4.8.X". These tickets must use the
|
||||
target milestone - "maintenance-release".
|
||||
|
||||
## Maintenance Release
|
||||
|
||||
When the ICU project committee agree on releasing a new maintenance release, the
|
||||
corresponding place holder ticket will be promoted to a real maintenance release
|
||||
task ticket. This is done by following steps.
|
||||
|
||||
* Create the new actual maintenance release milestone (e.g. 4.8.1)
|
||||
* Change the place holder ticket's subject to the actual version (e.g. "ICU4C
|
||||
4.8.X" -> "ICU4C 4.8.1")
|
||||
* Retarget the place holder ticket to the actual release (e.g.
|
||||
"maintenance-release" -> "4.8.1")
|
||||
* Create a new place holder ticket for future release (e.g. new ticket "ICU4C
|
||||
4.8.X", milestone: "maintenance-release")
|
|
@ -2,7 +2,6 @@
|
|||
layout: default
|
||||
title: Release & Milestone Tasks
|
||||
parent: Contributors
|
||||
nav_order: 10
|
||||
has_children: true
|
||||
---
|
||||
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
title: Coding Guidelines
|
||||
nav_order: 1
|
||||
parent: Contributors
|
||||
---
|
||||
<!--
|
||||
|
|
|
@ -1,35 +0,0 @@
|
|||
---
|
||||
layout: default
|
||||
title: Contributions
|
||||
nav_order: 4
|
||||
parent: Contributors
|
||||
---
|
||||
<!--
|
||||
© 2020 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Contributions to the ICU library
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
## Why Contribute?
|
||||
|
||||
ICU is an open source library that is a de-facto industry standard for
|
||||
internationalization libraries. Our goal is to provide top of the line i18n
|
||||
support on all widely used platforms. By contributing your code to the ICU
|
||||
library, you will get the benefit of continuing improvement by the ICU team and
|
||||
the community, as well as testing and multi-platform portability. In addition,
|
||||
it saves you from having to re-merge your own additions into ICU each time you
|
||||
upgrade to a new ICU release.
|
||||
|
||||
## Current Process
|
||||
|
||||
See [CONTRIBUTING.md](https://github.com/unicode-org/icu/blob/main/CONTRIBUTING.md)
|
|
@ -1,7 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
title: User Guide Editing
|
||||
nav_order: 5
|
||||
parent: Contributors
|
||||
---
|
||||
<!--
|
||||
|
|
|
@ -1,22 +1,37 @@
|
|||
---
|
||||
layout: default
|
||||
title: Developing Fuzzer Targets for ICU APIs
|
||||
parent: Contributors
|
||||
---
|
||||
|
||||
# Developing Fuzzer Targets for ICU APIs
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
|
||||
<!--
|
||||
© 2019 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
Developing Fuzzer Targets for ICU APIs
|
||||
======================================
|
||||
|
||||
This documents describes how to develop a [fuzzer](https://opensource.google.com/projects/oss-fuzz)
|
||||
target for an ICU API and its integration into the ICU build process.
|
||||
|
||||
### Directory and naming conventions
|
||||
## Directory and naming conventions
|
||||
|
||||
Fuzzer targets are exclusively in directory
|
||||
[`source/test/fuzzer/`](https://github.com/unicode-org/icu/tree/main/icu4c/source/test/fuzzer)
|
||||
and end with `_fuzzer.cpp`. Only files with such ending are recognized and executed as fuzzer
|
||||
targets by the OSS-Fuzz system.
|
||||
|
||||
### General structure of a fuzzer target
|
||||
## General structure of a fuzzer target
|
||||
|
||||
As a minimum, a fuzzer target contains the function
|
||||
|
||||
|
@ -69,7 +84,7 @@ constructor. The code interprets the fuzzer data as UnicodeString and passes it
|
|||
And that is all. Specific error handling or return value verification is not required because the
|
||||
fuzzer will detect all memory issues by means of memory/address sanitizer findings.
|
||||
|
||||
### Makefile.in changes
|
||||
## Makefile.in changes
|
||||
|
||||
ICU fuzzer targets are built and executed by the OSS-Fuzz project. On side of ICU they are compiled
|
||||
to assure that the code is syntactically correct and, as a sanity check, executed in the most basic
|
||||
|
@ -81,14 +96,14 @@ The new fuzzer target will then be built and executed as part of a normal ICU4C
|
|||
that each fuzzer target becomes executable on its own. As such it is linked with the code in
|
||||
`fuzzer_driver.cpp`, which contains the `main()` function.
|
||||
|
||||
### Fuzzer seed corpus
|
||||
## Fuzzer seed corpus
|
||||
|
||||
Any fuzzer seed data for a fuzzer target goes into a file with name `<fuzzer_target>_seed_corpus.txt`.
|
||||
In many cases the input parameter of the ICU API under test is of type `UnicodeString`, in case
|
||||
of which the seed data should be in UTF-16 format. As an example,see
|
||||
[collator_rulebased_fuzzer_seed_corpus.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/test/fuzzer/collator_rulebased_fuzzer_seed_corpus.txt).
|
||||
|
||||
### Guidelines and tips
|
||||
## Guidelines and tips
|
||||
|
||||
* Leave all randomness to the fuzzer. If a random selection of any kind is needed (e.g., of a
|
||||
locale), then use bytes from the fuzzer data to make the selection
|
||||
|
@ -97,7 +112,7 @@ of which the seed data should be in UTF-16 format. As an example,see
|
|||
under test requires a Unicode string then make sure that the seed data is in UTF-16 encoding.
|
||||
This can be achieved with e.g. the 'iconv' command or using an editor that saves text in UTF-16.
|
||||
|
||||
### How to locally reproduce fuzzer findings
|
||||
## How to locally reproduce fuzzer findings
|
||||
|
||||
At this time reproduction of fuzzer findings requires Docker installed on the local machine and the
|
||||
OSS-Fuzz project downloaded in a local git client.
|
606
docs/userguide/dev/gitdev.md
Normal file
606
docs/userguide/dev/gitdev.md
Normal file
|
@ -0,0 +1,606 @@
|
|||
---
|
||||
layout: default
|
||||
title: git and Github for ICU Developers
|
||||
parent: Contributors
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# git and Github for ICU Developers
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
|
||||
For git & git lfs installation see the [Source Code Setup](../../devsetup/source/)
|
||||
page.
|
||||
|
||||
For setup with language compilers and IDEs, see the [Setup for Contributors](../../devsetup/source/) page
|
||||
and its subpages.
|
||||
|
||||
## Overview
|
||||
|
||||
ICU development is on GitHub, in the **main** branch of the git repository.
|
||||
<https://github.com/unicode-org/icu>
|
||||
|
||||
In preparation for a release, we create a maintenance branch, such as
|
||||
[maint/maint-62](https://github.com/unicode-org/icu/tree/maint/maint-62) for ICU
|
||||
62 and its maintenance releases.
|
||||
|
||||
For each release we create a release tag.
|
||||
[releases/tag/release-62-1](https://github.com/unicode-org/icu/releases/tag/release-62-1)
|
||||
(GitHub project page > Releases > Tags > select one; a Release is a Tag with
|
||||
metadata.)
|
||||
|
||||
There are additional branches that you can ignore. Some are old development
|
||||
branches.
|
||||
|
||||
Also, when you edit a file directly on the GitHub source browser (for docs: API
|
||||
comments, or .md/.html/.txt), it creates a branch for your pull request. Make
|
||||
sure to delete this branch when you are done.
|
||||
|
||||
## Development
|
||||
|
||||
We do *not* develop directly on the main repository. Do *not* clone from
|
||||
there to commit and push back into the main repository.
|
||||
|
||||
Instead, use the GitHub UI (top right) to create a fork of the repository in
|
||||
your own GitHub account. Then clone that to your local machine. You need only
|
||||
one fork for all of your ICU work.
|
||||
|
||||
```
|
||||
mkdir -p icu/mine/src
|
||||
git clone git@github.com:markusicu/icu.git icu/mine/src
|
||||
cd icu/mine/src
|
||||
```
|
||||
|
||||
You should be in the **main** branch of your fork's clone.
|
||||
|
||||
Do *not* do any development in your own **main** branch either! That would
|
||||
lead to messy merging with the upstream **main** branch.
|
||||
|
||||
Instead, create a new branch in your local clone for each piece of work. You
|
||||
need a separate branch for each pull request. More on that later.
|
||||
|
||||
```
|
||||
git checkout -b mybranchname
|
||||
```
|
||||
|
||||
Now you are in a new development branch in your local git repo. Confirm with
|
||||
`git status`. Change stuff. Do `git status` again, use `git add` for staging and
|
||||
`git commit -m 'ICU-23456 what I changed'` to commit, or use `git commit -a -m 'ICU-23456 what I changed'` if you want to commit everything that git status
|
||||
shows as changed.
|
||||
|
||||
For looking at changes, you should set up a visual diff program for use with
|
||||
`git difftool`. See the [Setup: git difftool & mergetool](../../devsetup/source/gittooling.md) page.
|
||||
|
||||
For new files: Remember to add the appropriate copyright lines. Copy from a file
|
||||
of the same type, and set the copyright year to the current year (that is, the
|
||||
year you are creating the file).
|
||||
|
||||
You should have a Jira ticket for each line of work. (See [Submitting ICU Bugs and Feature Requests](https://icu.unicode.org/bugs) and [ICU Ticket Life cycle](https://icu.unicode.org/processes/ticket-lifecycle).) You can have multiple pull
|
||||
requests per ticket. Each pull request needs a ticket in Accepted state.
|
||||
|
||||
Always prefix your commit statements with the Jira ticket number using this
|
||||
pattern (including the space after the number; note: no colon):
|
||||
`**ICU-23456** what I changed`
|
||||
|
||||
Local commits are only on your local machine. If your local disk crashes, your
|
||||
changes are gone. `git push` your commits to your GitHub fork.
|
||||
|
||||
**Tips for Branches**
|
||||
|
||||
Shane
|
||||
[recommends](https://blog.sffc.xyz/post/185195398930/why-you-should-use-git-pull-ff-only)
|
||||
setting the default behavior of `git pull` to `--ff-only`. Shane also
|
||||
[prevents](https://stackoverflow.com/a/40465455/1407170) local commits to the
|
||||
**main** branch via *.git/hooks/pre-commit*. These two measures make it easier
|
||||
to do the right thing in Git.
|
||||
|
||||
## Trivial changes
|
||||
|
||||
For trivial changes, such as small fixes in API docs or text files, it is ok to
|
||||
edit the file in the GitHub GUI, in the main unicode-org/icu repository.
|
||||
|
||||
You still need a Jira ticket.
|
||||
|
||||
Once you are done editing, the GUI lets you create a branch and a commit right
|
||||
in the main repository. Use the usual **ICU-23456** what I changed pattern
|
||||
for the commit message.
|
||||
|
||||
Pull request, review, merge as usual, see the next section.
|
||||
|
||||
*Remember to delete your branch after merging.*
|
||||
|
||||
## Review & commit to Unicode main
|
||||
|
||||
When you are ready for code review, go to your GitHub page and your ICU fork.
|
||||
|
||||
Select your dev branch (Branch drop-down on the left, search for your branch).
|
||||
|
||||
Click "New pull request" next to the Branch button, or "Pull request" on the
|
||||
right near "Compare". *Make sure it compares with unicode-org/icu main on the
|
||||
left and your own fork's dev branch on the right*.
|
||||
|
||||
Prefix the title of your pull request with the Jira ticket number, same format
|
||||
as for a commit.
|
||||
|
||||
Follow the rest of the checklist in the PR template.
|
||||
|
||||
Set the PR assignee to your main reviewer. You may add more people as reviewers,
|
||||
but there is normally just one assignee. Be somewhat judicious with additional
|
||||
reviewers: Don't just add them because they were recommended by GitHub.
|
||||
|
||||
Nice to have: Optionally set the Jira ticket reviewer field for documentation.
|
||||
Still possible to close the ticket if the field is empty.
|
||||
|
||||
Watch the PR status for build failures and other issues.
|
||||
|
||||
A PR reviewer (at least the assignee) should look to see if the PR does what the
|
||||
ticket says.
|
||||
|
||||
Respond to review feedback. Make changes on your local machine, commit, push to
|
||||
your fork. The GitHub PR will update automatically for your additional commits.
|
||||
|
||||
Try to not rebase, squash, or force-push until the reviewer gives you a green
|
||||
light.
|
||||
|
||||
*You should normally squash multiple commits into one in your fork before
|
||||
merging (after the reviewer is satisfied)*. For multiple commits, the reviewer
|
||||
should first respond with something like "lgtm please squash" but not yet
|
||||
GitHub-approve; after squashing, they should check that the changes are the
|
||||
same, and then GitHub-approve. A bot will respond to the PR confirming whether
|
||||
the squash succeeded without changing the file contents.
|
||||
|
||||
If you squash, since you are rewriting the commit message anyway, please append
|
||||
the pull request number to the first line of the updated message, using the
|
||||
format "` (#199)`".
|
||||
|
||||
When you squash, please keep the parent hash (sha) the same so that the squash
|
||||
is nothing more than a squash. If you change the parent hash, you may also be
|
||||
pulling in other people's changes, and it may be harder for the reviewer to
|
||||
verify that the squash was done correctly.
|
||||
|
||||
### Options on how to squash
|
||||
|
||||
#### Option 1: Use the online PR commit checker bot
|
||||
|
||||
Please note: this makes the
|
||||
change in your remote branch but not in your local branch. Click the "Details"
|
||||
link in the GitHub status, which brings you to a page with a summary of your PR.
|
||||
Find the "Squash..." button. Sign in using your GitHub account, and follow the
|
||||
flow to squash your branch.
|
||||
|
||||
Warning: do not `git pull` after you use the remote tool! If you subsequently need
|
||||
to update your local branch to the squash commit, you need to fetch and reset:
|
||||
|
||||
```
|
||||
git fetch origin BRANCHNAME
|
||||
git checkout BRANCHNAME
|
||||
git reset origin/BRANCHNAME
|
||||
```
|
||||
|
||||
#### Option 2: Use git rebase
|
||||
|
||||
This works as long as you have no merge commits with
|
||||
conflicts in your history. Plenty of examples:
|
||||
|
||||
* <https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History#_squashing>
|
||||
* <https://github.com/todotxt/todo.txt-android/wiki/Squash-All-Commits-Related-to-a-Single-Issue-into-a-Single-Commit>
|
||||
* <https://blog.carbonfive.com/2017/08/28/always-squash-and-rebase-your-git-commits/>
|
||||
* <http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html>
|
||||
* <https://medium.com/@slamflipstrom/a-beginners-guide-to-squashing-commits-with-git-rebase-8185cf6e62ec>
|
||||
* Several other options:
|
||||
<https://stackoverflow.com/questions/5189560/squash-my-last-x-commits-together-using-git>
|
||||
|
||||
#### Option 3: Use git merge
|
||||
|
||||
This is a little tricker but works even if you have
|
||||
merge commits with conflicts. Assuming your feature branch is called BRANCHNAME:
|
||||
|
||||
```
|
||||
# Make sure your branch is up-to-date with main and that the tests pass:
|
||||
|
||||
git checkout BRANCHNAME
|
||||
git merge main
|
||||
git push
|
||||
|
||||
# At this point, wait for an LGTM from a reviewer before proceeding.
|
||||
# Once confirmed, make your squash commit in a new temp branch.
|
||||
# NOTE: In the first line, make sure to checkout the same sha as
|
||||
# you most recently merged into your branch!
|
||||
|
||||
git checkout main
|
||||
git checkout -b temp
|
||||
git merge --squash BRANCHNAME
|
||||
git commit
|
||||
|
||||
# Point your branch to the squash commit, and there should be no dirty files:
|
||||
|
||||
git checkout BRANCHNAME
|
||||
git reset temp
|
||||
git status # should be empty! If it's not, you didn't check out the right sha.
|
||||
|
||||
# Push your squash commit and clean up:
|
||||
|
||||
git push -f
|
||||
git branch -d temp
|
||||
```
|
||||
|
||||
#### Option 4: Amend a small commit
|
||||
|
||||
When making code review changes on a small PR, you can amend your
|
||||
previous commit rather than making a new commit. Instead of running "git
|
||||
commit", just run "git commit --amend". You will need to force-push. The PR bot
|
||||
will post a link for the reviewer to see the changes from your old commit to
|
||||
your new commit.
|
||||
|
||||
Once the reviewer(s) has/have approved your (squashed) changes:
|
||||
|
||||
* If you are an ICU team member with main repo write access:
|
||||
* Merge your commits into the Unicode main.
|
||||
* We almost always want to "rebase and merge" the commits. We normally
|
||||
want them pre-squashed for a simple, clean change history. We rarely
|
||||
want to permanently keep intermediate commits.
|
||||
* (For ICU 63 we used "squash merge" but ended up with some ill-formed
|
||||
commit messages. "Rebase and merge" lets us review the commit
|
||||
messages before merging.)
|
||||
* After you click the Merge button, if you don't use "rebase and merge"
|
||||
(although normally you should...), make sure that the commit message
|
||||
includes the "ICU-23456 " prefix, and add a suffix like " (#65)" with
|
||||
the pull request number (if it's not there already).
|
||||
* Known limitation: We won't have the PR number in the commit
|
||||
message(s) when using the recommended "rebase and merge" -- unless
|
||||
you manually amend the commit message(s) and add it.
|
||||
* You should probably check the box for deleting your dev branch after
|
||||
merging.
|
||||
* Remember one branch per PR. You can create multiple branches & PRs per
|
||||
ticket.
|
||||
* If this was the last commit to finish work on the ticket, then go to
|
||||
Jira and close the ticket as Fixed.
|
||||
* You can optionally have someone (probably the same person as your PR
|
||||
assignee) review the ticket as well, but that's not normally necessary.
|
||||
* (We normally use ticket reviews for non-code changes, such as a
|
||||
non-coding task or a web site update for the User Guide etc.)
|
||||
* Otherwise:
|
||||
* The PR assignee should be an ICU team member, and they are responsible
|
||||
both for reviewing and for merging your PR, and then also for closing
|
||||
the ticket.
|
||||
|
||||
|
||||
## Merge conflicts
|
||||
|
||||
When someone else has made changes that conflict with yours, then you can't
|
||||
merge as is. (The GitHub pull request page will tell you if there is a
|
||||
conflict.)
|
||||
|
||||
You need to update your fork's **main** via your local clone, rebase your local
|
||||
dev branch with that, resolve conflicts as you go, and force-push to your fork.
|
||||
|
||||
As easy as it is in GitHub to *create* a fork, you would think that it would be
|
||||
a simple button-click to *update* your fork's **main** with commits on the
|
||||
Unicode **main**. If you find a way to do this, please update this section.
|
||||
|
||||
Switch to your local **main**.
|
||||
|
||||
```
|
||||
git checkout main
|
||||
```
|
||||
|
||||
### Pull from upstream
|
||||
|
||||
Pull updates from the Unicode main (rather than a vanilla `git pull` which
|
||||
pulls form your out-of-date fork), push to your fork's main.
|
||||
|
||||
*Norbert’s version:*
|
||||
|
||||
```
|
||||
git pull git@github.com:unicode-org/icu.git
|
||||
git push
|
||||
```
|
||||
|
||||
*Andy’s version:*
|
||||
|
||||
Once per local git repo, set up an additional "remote". Something like the
|
||||
following, but this may be incomplete!
|
||||
|
||||
```
|
||||
git remote add upstream https://github.com/unicode-org/icu.git
|
||||
git pull upstream main
|
||||
git push origin main
|
||||
```
|
||||
|
||||
*Andy's Version, take 2:*
|
||||
|
||||
Set the local main to track the upstream (unicode-org) main instead of your
|
||||
fork's main (orign). Your fork's main is effectively out of the loop.
|
||||
|
||||
```
|
||||
# one time setup
|
||||
git branch -u upstream/main
|
||||
# subsequent pulls from upstream (unicode.org) main
|
||||
git pull
|
||||
```
|
||||
|
||||
### Resolve conflicts
|
||||
|
||||
There are two ways to do this. You can rebase, or you can create a merge commit.
|
||||
The advantage of rebase is that it makes it somewhat easier to squash later on.
|
||||
The advantage of creating a merge commit is that you don't have to force-push,
|
||||
so it makes it easier to work across different workstations, you are less likely
|
||||
to get something wrong, and it makes it easier for the reviewer because GitHub
|
||||
keeps track of comment history better when shas don't change.
|
||||
|
||||
#### Option 1: Merge
|
||||
|
||||
Switch to your dev branch, then merge in main. I like to use
|
||||
the --no-commit option:
|
||||
|
||||
```
|
||||
git checkout mybranchname
|
||||
git merge main --no-commit
|
||||
```
|
||||
|
||||
If you have conflicts, resolve them. Then, review the merge commit. It should
|
||||
have all changes from main that were not yet on your branch. If it looks good,
|
||||
commit the merge. You can push the merge commit without having to use -f.
|
||||
|
||||
```
|
||||
git commit
|
||||
git push
|
||||
```
|
||||
|
||||
#### Option 2: Rebase
|
||||
|
||||
First switch back to your dev branch (without the -b option
|
||||
which is for creating a new branch).
|
||||
|
||||
```
|
||||
git checkout mybranchname
|
||||
```
|
||||
|
||||
Then rebase, which reapplies your branch changes on top of the new main commits.
|
||||
|
||||
```
|
||||
git rebase main
|
||||
```
|
||||
|
||||
Sometimes you need to manually resolve conflicts. Follow the instructions git
|
||||
prints or look for help...
|
||||
|
||||
If it had stopped and you are done resolving conflicts, continue rebasing.
|
||||
|
||||
```
|
||||
git rebase --continue
|
||||
```
|
||||
|
||||
You might get conflicts at several stages; resolve & continue until done.
|
||||
|
||||
When done, push to your GitHub fork. You need to force-push after rebasing.
|
||||
|
||||
```
|
||||
git push -f
|
||||
```
|
||||
|
||||
## Update your fork
|
||||
|
||||
Once in a while, you should update your fork's main with changes from the
|
||||
Unicode main, so that you don't fall too far behind and your new changes don't
|
||||
create unnecessary merge conflicts.
|
||||
|
||||
Go to your local main, pull commits from the Unicode main, and push to your
|
||||
GitHub fork. See the "Merge conflicts" section above for details. If you don't
|
||||
have a current dev branch, you can skip the rebasing.
|
||||
|
||||
## Committing to Maintenance Branch
|
||||
|
||||
Follow these steps for adding a commit to a maintenance branch.
|
||||
|
||||
The process is different between when we are between RC and GA and when we are
|
||||
after GA.
|
||||
|
||||
### Between RC and GA
|
||||
|
||||
When working on a commit that you know at the time of
|
||||
authorship to be a candidate for the maintenance branch, write the commit and
|
||||
send the PR directly against the maintenance branch. All commits on the maint
|
||||
branch will be merged *from maint to main* as a BRS task (see the next section).
|
||||
|
||||
Check out the current maint branch:
|
||||
|
||||
```
|
||||
git fetch upstream maint/maint-64
|
||||
git checkout maint/maint-64
|
||||
```
|
||||
|
||||
Next, make a local branch off of the maint branch. For example, to use the
|
||||
branch name "ICU-12345-maint-64", you can do:
|
||||
|
||||
```
|
||||
git checkout -b ICU-12345-maint-64
|
||||
```
|
||||
|
||||
Now, write your change and send it for review. Open your PR against the maint
|
||||
branch.
|
||||
|
||||
### After GA
|
||||
|
||||
Write the commit against the main branch, and send your own
|
||||
cherry-pick commits to put it on the desired maint branches.
|
||||
|
||||
Update your local main from the Unicode main (see above). Otherwise your git
|
||||
workspace won't recognize the commits you are trying to cherry-pick.
|
||||
|
||||
Make a note of the SHA hash/ID of your commit on the main branch. You will use
|
||||
this later when cherry-picking into the maint branch.
|
||||
|
||||
* The commit ID is listed on the pull request page.
|
||||
* You can use git log to see the SHA once your change is on main.
|
||||
* You can look at the commit history on GitHub too.
|
||||
|
||||
Next, checkout the maintenance branch locally. For example, for the ICU 63
|
||||
maintenance branch:
|
||||
|
||||
```
|
||||
git fetch upstream maint/maint-64
|
||||
git checkout maint/maint-64
|
||||
```
|
||||
|
||||
Next, make a local branch off of the maint branch. This new branch will be used
|
||||
for your cherry-pick.
|
||||
|
||||
For example, to use the branch name "ICU-12345-maint-64", you can do:
|
||||
|
||||
```
|
||||
git checkout -b ICU-12345-maint-64
|
||||
```
|
||||
|
||||
Next, cherry-pick the commit(s) you want to apply to the maintenance branch.
|
||||
(Note: If you only have one commit to merge to the maint branch then you would
|
||||
only have one command below).
|
||||
|
||||
```
|
||||
git cherry-pick 7d99ba4
|
||||
git cherry-pick e578f3f
|
||||
...
|
||||
```
|
||||
|
||||
This creates **new** commits directly onto your local branch.
|
||||
|
||||
Look at the output from each of these commands to double-check that you got the
|
||||
intended commits.
|
||||
|
||||
Finally, push your branch to your fork (should be "origin"), and open a PR into
|
||||
the Unicode ICU branch maint/maint-64.
|
||||
|
||||
```
|
||||
git push -u your-fork ICU-12345-maint-64
|
||||
```
|
||||
|
||||
The reviewer of the PR has the following special responsibilities:
|
||||
|
||||
1. Don't approve the PR unless ICU-TC has agreed that this should be a
|
||||
maintenance fix.
|
||||
2. Make sure that the PR is targeting the correct branch in the Unicode ICU
|
||||
repro. (ex: maint/maint-64 ).
|
||||
3. Make sure that the PR includes all commits associated with the fix, which
|
||||
was already approved for main.
|
||||
4. Use "Rebase and merge".
|
||||
|
||||
## Checking for Missing Commits (BRS Task)
|
||||
|
||||
It is not hard to accidentally make a commit against main that should have been
|
||||
against maint. As a BRS task before tagging, you should check the list of
|
||||
commits that are on main but not maint and make sure of them belong on
|
||||
maint.
|
||||
|
||||
To get the list, run:
|
||||
|
||||
```
|
||||
git fetch upstream
|
||||
git cherry -v upstream/maint/maint-64 upstream/main
|
||||
```
|
||||
|
||||
Commits prefixed with "+" are on main but not on the specified maint branch.
|
||||
Commits prefixed with "-" are present on both branches.
|
||||
|
||||
Send the list to the team and discuss in the weekly meeting if there are any
|
||||
problems.
|
||||
|
||||
## Merging from Maint to Main (BRS Task)
|
||||
|
||||
Merging from the maint branch to main might be as easy as opening a pull
|
||||
request, without having to touch the command line. However, if there are merge
|
||||
conflicts, more work will need to be done.
|
||||
|
||||
**The Easy Way (No Merge Conflicts):** Open a pull request on GitHub from the
|
||||
maint branch to main. If it says there are no merge conflicts, congratulations!
|
||||
Use a new ticket number for the PR (it is suggested to NOT use the main BRS
|
||||
ticket). The new ticket should have the next release as its fix version, because
|
||||
the merge commit used to pull the commits from maint to main will be in the next
|
||||
release but not the current release.
|
||||
|
||||
You may need to add "DISABLE_JIRA_ISSUE_MATCH=true" and/or
|
||||
"ALLOW_MANY_COMMITS=true" to the PR description to silence errors coming from
|
||||
the Unicode bot.
|
||||
|
||||
You should use a MERGE COMMIT to merge from maint to main, NOT REBASE MERGE as
|
||||
is normally recommended. You will need to go into the admin panel on GitHub,
|
||||
enable merge commits, perform your merge commit, and then disable merge commits
|
||||
again from the admin panel. When making your merge commit, remember to use the
|
||||
correct commit message syntax: prefix the merge commit message with ICU-#####,
|
||||
the new ticket number you created above.
|
||||
|
||||
**The Hard Way (Merge Conflicts):** At the end of the day, the goal is that main
|
||||
should share the maint branch's history. This is done using merge commits. What
|
||||
follows is an example of how to create merge commits that retain full branch
|
||||
history.
|
||||
|
||||
Create a new branch based on the tag you want to merge:
|
||||
|
||||
```
|
||||
git fetch upstream
|
||||
git checkout main
|
||||
git checkout -b 64-merge-branch # use any name you like
|
||||
```
|
||||
|
||||
*If you already have this branch from a previous release tag*, you could either
|
||||
use a new branch, or merge the latest main into your branch:
|
||||
|
||||
```
|
||||
git checkout 64-merge-branch
|
||||
# DANGER: Please make sure your workspace is clean before proceeding!
|
||||
# If it's not, you might sneak in unreviewed changes.
|
||||
git merge --no-commit main
|
||||
git commit -am "ICU-##### Merge tag 'main' into 64-merge-branch"
|
||||
```
|
||||
|
||||
Now, merge in maint:
|
||||
|
||||
```
|
||||
# DANGER: Please make sure your workspace is clean before proceeding!
|
||||
# If it's not, you might sneak in unreviewed changes.
|
||||
git merge --no-commit upstream/maint/maint-39
|
||||
```
|
||||
|
||||
After running the final line, you will have the opportunity to resolve merge
|
||||
conflicts. If the conflict is in a large binary file like the ICU4J data jar
|
||||
files, you may need to re-generate them.
|
||||
|
||||
Remember to prefix your commit message with the ticket number:
|
||||
|
||||
```
|
||||
git commit -am "ICU-##### Merge branch 'maint/maint-39' into 64-merge-branch"
|
||||
git push -u origin 64-merge-branch
|
||||
```
|
||||
|
||||
As in the Easy Way, you may need to add `DISABLE_JIRA_ISSUE_MATCH=true` and/or
|
||||
`ALLOW_MANY_COMMITS=true` to the PR description to silence errors coming from
|
||||
the Unicode bot.
|
||||
|
||||
Send the PR off for review. As in the Easy Way, **you should use the MERGE COMMIT option in GitHub to land the PR!!**
|
||||
|
||||
## Requesting an Exhaustive Test run on a Pull-Request (PR)
|
||||
|
||||
The ICU4C and ICU4J Exhaustive Tests run on the main branch after a pull-request
|
||||
has been submitted. They do not run on pull-requests by default as they take 1-2
|
||||
hours to run.
|
||||
|
||||
However, you can manually request the CI builds to run the exhaustive tests on a
|
||||
PR by commenting with the following text:
|
||||
|
||||
```
|
||||
/azp run CI-Exhaustive
|
||||
```
|
||||
|
||||
This will trigger the test run on the PR. This is covered more in a separate
|
||||
[document](https://docs.google.com/document/d/1kmcFFUozpWah_y7dk_Inlw_BIq3vG3-ZR2A28tIiXJc/edit?usp=sharing).
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
layout: default
|
||||
title: Contributors
|
||||
nav_order: 1800
|
||||
nav_order: 9000
|
||||
has_children: true
|
||||
---
|
||||
<!--
|
||||
|
|
68
docs/userguide/dev/logknownissue.md
Normal file
68
docs/userguide/dev/logknownissue.md
Normal file
|
@ -0,0 +1,68 @@
|
|||
---
|
||||
layout: default
|
||||
title: Skipping Known Test Failures
|
||||
parent: Contributors
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Skipping Known Test Failures (logKnownIssue)
|
||||
|
||||
If you need a test to be disabled temporarily, call `logKnownIssue`. The method
|
||||
is defined as below:
|
||||
|
||||
```java
|
||||
/**
|
||||
* Log the known issue.
|
||||
* This method returns true unless -prop:logKnownIssue=no is specified
|
||||
* in the argument list.
|
||||
*
|
||||
* @param ticket A ticket number string. For an ICU ticket, use numeric
|
||||
characters only,
|
||||
* such as "10245". For a CLDR ticket, use prefix "cldrbug:" followed by ticket
|
||||
number,
|
||||
* such as "cldrbug:5013".
|
||||
* @param comment Additional comment, or null
|
||||
* @return true unless -prop:logKnownIssue=no is specified in the test command
|
||||
line argument.
|
||||
*/
|
||||
public boolean logKnownIssue(String ticket, String comment)
|
||||
```
|
||||
|
||||
Below is an example:
|
||||
|
||||
```java
|
||||
if (logKnownIssue("1234", "New data is not integrated yet.")) {
|
||||
return;
|
||||
}
|
||||
|
||||
// test code below
|
||||
```
|
||||
|
||||
By default, logKnownIssue returns true and emit a log line including the link to
|
||||
the ticket and the comment.
|
||||
|
||||
When `-prop:logKnownIssue=no` is specified as a command line argument,
|
||||
`logKnownIssue()` returns false, so you can temporary enable a test code skipped
|
||||
by logKnownIssue.
|
||||
|
||||
Before ICU4J 52, we used to use isICUVersionBefore() method like below. The test
|
||||
method is still available in the trunk, but developers are suggested to use
|
||||
logKnownIssue() instead.
|
||||
|
||||
```java
|
||||
if (isICUVersionBefore(50,0,2)) {
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
Before ICU4J 49M2, we used to use the style below -
|
||||
|
||||
```java
|
||||
if(skipIfBeforeICU(4, 5, 2)) {
|
||||
return;
|
||||
}
|
||||
```
|
|
@ -1,18 +1,33 @@
|
|||
---
|
||||
layout: default
|
||||
title: Updating ICU's built-in Break Iterator rules
|
||||
parent: Contributors
|
||||
---
|
||||
|
||||
# Updating ICU's built-in Break Iterator rules
|
||||
{: .no_toc }
|
||||
|
||||
## Contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
|
||||
<!--
|
||||
Copyright (C) 2016 and later: Unicode, Inc. and others.
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
Updating ICU's built-in Break Iterator rules
|
||||
============================================
|
||||
|
||||
Here are instructions for updating ICU's built-in break iterator rules, for Grapheme, Word, Line and Sentence breaks.
|
||||
|
||||
The ICU rules implement the boundary behavior from Unicode [UAX-14](https://www.unicode.org/reports/tr14/) and [UAX-29](https://www.unicode.org/reports/tr29/), with tailorings from CLDR and some ICU-specific enhancements. ICU rules updates are needed in response to changes from Unicode or CLDR, or for bug fixes. Often ideas for CLDR or UAX updates are prototyped in ICU first, before becoming official.
|
||||
|
||||
This is not a cook book process. Familiarity with ICU break iterator behavior and rules is needed. Sets of break rules often interact in subtle and difficult to understand ways. Expect some bumps.
|
||||
|
||||
### Have clear specifications for the change.
|
||||
## Have clear specifications for the change.
|
||||
|
||||
The changes will typically come from a proposed update to Unicode UAX 29 or UAX 14,
|
||||
or from CLDR based tailorings to these specifications.
|
||||
|
@ -21,7 +36,7 @@ As an example, see [CLDR proposal for Extended Indic Grapheme Clusters](https://
|
|||
|
||||
Often ICU will implement draft versions of proposed specification updates, to check that they are complete and consistent, and to identify any issues before they are released.
|
||||
|
||||
### Files that typically will need to be updated:
|
||||
## Files that typically will need to be updated:
|
||||
|
||||
|
||||
| File | Contents |
|
||||
|
@ -40,7 +55,7 @@ Often ICU will implement draft versions of proposed specification updates, to ch
|
|||
| .../main/tests/core/src/com/ibm/icu/dev/test/rbbi/RBBITestMonkey.java | Monkey test w rules as code. Port from ICU4C.
|
||||
|
||||
|
||||
### ICU4C
|
||||
## ICU4C
|
||||
|
||||
The rule updates are done first for ICU4C, and then ported (code changes) or moved (data changes) to ICU4J. This order is easiest because the the break rule source files are part of the ICU4C project, as is the rule builder.
|
||||
|
||||
|
@ -225,7 +240,7 @@ The rule updates are done first for ICU4C, and then ported (code changes) or mov
|
|||
|
||||
As with the main rules, after everything appears to be working, run the rule based monkey test for an extended period of time (with loop=-1).
|
||||
|
||||
### ICU4J
|
||||
## ICU4J
|
||||
|
||||
1. **Copy the Data Driven Test File to ICU4J**
|
||||
|
|
@ -1,7 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
title: Custom ICU4C Synchronization
|
||||
nav_order: 3
|
||||
parent: Contributors
|
||||
---
|
||||
<!--
|
||||
|
|
|
@ -1,7 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
title: Synchronization
|
||||
nav_order: 2
|
||||
parent: Contributors
|
||||
---
|
||||
<!--
|
||||
|
|
98
docs/userguide/icu4j/why-use-icu4j.md
Normal file
98
docs/userguide/icu4j/why-use-icu4j.md
Normal file
|
@ -0,0 +1,98 @@
|
|||
---
|
||||
layout: default
|
||||
title: Why Use ICU4J?
|
||||
nav_order: 100
|
||||
parent: ICU4J
|
||||
---
|
||||
|
||||
<!--
|
||||
© 2016 and later: Unicode, Inc. and others.
|
||||
License & terms of use: http://www.unicode.org/copyright.html
|
||||
-->
|
||||
|
||||
# Why Use ICU4J?
|
||||
|
||||
## Summary
|
||||
|
||||
* Fully implements current standards
|
||||
* Unicode collation, normalization, break iteration
|
||||
* Updated more frequently than Java
|
||||
* Full CLDR Locale data
|
||||
* Improved performance
|
||||
|
||||
## Details
|
||||
|
||||
* Normalization
|
||||
* Addresses lack of Unicode normalization support in Java 5
|
||||
* Addresses outdated Unicode normalization support in Java 6
|
||||
* Up-To-Date Unicode version
|
||||
* Java 5 & 6 are Unicode 4.0, while ICU 4.0 is Unicode 5.1
|
||||
* Characters added after Unicode 4.0 do not have character properties in
|
||||
Java
|
||||
* IDNA and StringPrep
|
||||
* Addresses lack of Internationalized Domain Name support in Java 5
|
||||
* Addresses generic stringprep (RFC3454) support. stringprep is required
|
||||
for supporting various internet protocols (NFS, LDAP...)
|
||||
* Collation
|
||||
* Provides Unicode standard compliant collation support
|
||||
* ICU Collator fully implements UTR#10, while the Java implementation is
|
||||
outdated and not compatible.
|
||||
* Provides ICU UnicodeSet for easy character range validation
|
||||
* much more flexible and convenient for validating identifiers/text tokens
|
||||
with a given syntax
|
||||
* full boolean operations (union, intersection, difference)
|
||||
* all Unicode properties supported
|
||||
* Locales
|
||||
* BCP47 (language tag) support in locale class (supporting "script",
|
||||
3-letter language codes, 3-digit region codes)
|
||||
* Locale data coverage - much better, many more locales, up-to-date
|
||||
* Broader charset converter coverage
|
||||
* In ICU4J 4.2, also output charset selection
|
||||
* Custom fallback in charset converter
|
||||
* Other features missing in the JDK
|
||||
* Dates:
|
||||
* Many more date formats: month+day, year+month,...
|
||||
* Date interval formats: "Dec 15-17, 2009"
|
||||
* APIs for returning time zone transitions
|
||||
* Other formatting
|
||||
* Plural formatting, including units: "1 hour" / "2 hours"
|
||||
* Rule based number format ("three thousand two hundred")
|
||||
* Extensive Non-Gregorian calendar support
|
||||
* Transliterator (for flexible text/script transformations)
|
||||
* Collation-sensitive string search
|
||||
* Same data as ICU4C, allowing same behavior across programming languages
|
||||
* All Unicode character properties - over 80, Java provides access to only
|
||||
about 10
|
||||
* Thai wordbreak
|
||||
|
||||
## Performance & Size
|
||||
|
||||
* Instantiation times are comparable
|
||||
* Common instantiate and reuse model
|
||||
* ICU4J and Java both use caches to limit impact
|
||||
* Collation performance *many times* faster
|
||||
* sorting: 2 to 20 times faster
|
||||
* sort key generation: 1.5 to 4 times faster
|
||||
* sort key length: 2/3 to 1/4 the length of Java sort keys
|
||||
* Property access much faster (isLetter, isWhitespace,...)
|
||||
* Can easily produce scaled-down version (removing data)
|
||||
|
||||
## API
|
||||
|
||||
* Subclasses of JDK classes where possible
|
||||
* Drop-in (change of import) if not
|
||||
|
||||
## Summary
|
||||
|
||||
* **ICU4J is not for you if**
|
||||
|
||||
* you have tight size constraints
|
||||
* you require the Java runtime behavior
|
||||
|
||||
* **ICU4J is for you if**
|
||||
|
||||
* you need full compliance with current standards
|
||||
* you need current or additional locale and property data
|
||||
* you need customizability
|
||||
* you need features missing from Java (normalization, collation,...)
|
||||
* you need better performance
|
Loading…
Add table
Reference in a new issue