ICU-13530 add UCPTrie/CodePointTrie, switch normalization to use it (#48)

* ICU-13530 copy C/C++ files UTrie2 -> UTrie3

X-SVN-Rev: 40754

* ICU-13530 UTrie3 new files copied from UTrie2: rename types/functions/macros

X-SVN-Rev: 40755

* ICU-13530 debug-print building each UTrie2

X-SVN-Rev: 40756

* ICU-13530 remove two-byte-UTF-8 errorValue block; move highValue from end of data array into header; add errorValue to header

X-SVN-Rev: 40762

* ICU-13530 UTrie3 U16_NEXT/PREV: errorValue for unpaired surrogates

X-SVN-Rev: 40763

* ICU-13530 no more separate values for lead surrogate code units

X-SVN-Rev: 40764

* ICU-13530 change from 11:5 trie bits to 10:6 for simpler UTF-8 code

X-SVN-Rev: 40766

* ICU-13530 UTrie2 build UTrie3 as well, print sizes

X-SVN-Rev: 40767

* ICU-13530 debug-print countSame, sumOverlaps, countInitial

X-SVN-Rev: 40768

* ICU-13530 debug-print whether trie is for CanonIterData

X-SVN-Rev: 40769

* ICU-13530 no index-shift for BMP data, no separate index-2 for 2-byte UTF-8; builder changes incomplete

X-SVN-Rev: 40777

* ICU-13530 remove errorValue and highStart from UNewTrie3

X-SVN-Rev: 40778

* ICU-13530 rewrite UTrie3 builder code

X-SVN-Rev: 40783

* ICU-13530 UTrie3 bug fixes

X-SVN-Rev: 40788

* ICU-13530 fully re-inline _UTRIE3_U8_NEXT()

X-SVN-Rev: 40790

* ICU-13530 find most common all-same data block for dataNullBlock and initialValue

X-SVN-Rev: 40792

* ICU-13530 UTrie3 iterator functions take start and return the end of a range, rather than callback call for each range

X-SVN-Rev: 40800

* ICU-13530 mask off unused data value bits before building a UTrie3 with values less than 32 bits wide

X-SVN-Rev: 40803

* ICU-13530 split utrie3builder.h out of utrie3.h

X-SVN-Rev: 40804

* ICU-13530 separate types UTrie3 vs. UTrie3Builder, implement builder as wrapper over C++ class Trie3Builder in .cpp

X-SVN-Rev: 40809

* ICU-13530 function to make a UTrie3Builder from a UTrie3

X-SVN-Rev: 40810

* ICU-13530 debug-print some data; some cleanup

X-SVN-Rev: 40865

* ICU-13530 BMP 10:6 but supplementary 10:6:4

X-SVN-Rev: 40984

* ICU-13530 move errorValue & highValue to the end of the data table, minimal padding to 4 bytes

X-SVN-Rev: 41011

* ICU-13530 index-1 table gap of index-2 null blocks

X-SVN-Rev: 41018

* ICU-13530 test with more than 128k compacted data

X-SVN-Rev: 41034

* ICU-13530 supplementary bits 11:5:4 saves a little space

X-SVN-Rev: 41039

* ICU-13530 supplementary bits 6:5:5:4 instead of gap: about same size but simpler

X-SVN-Rev: 41050

* ICU-13530 remove unnecessary utrie3_clone(built trie)

X-SVN-Rev: 41058

* ICU-13530 remove unnecessary UTrie3StringIterator

X-SVN-Rev: 41059

* ICU-13530 back to UTRIE3_GET...() macros *returning* data values

X-SVN-Rev: 41060

* ICU-13530 fast vs. small

X-SVN-Rev: 41066

* ICU-13530 always load NFC data, add simple normalization performance test

X-SVN-Rev: 41110

* ICU-13530 change normalization main trie to UTrie3 with special values for lead surrogates; forbid non-inert surrogate code *points* because unable to store values different from code *units*; runtime code work around that for code point lookup and iteration; adjust UTS 46 for normalization no longer mapping unpaired surrogates to U+FFFD

X-SVN-Rev: 41122

* ICU-13530 simplenormperf bug fix and NFC base line

X-SVN-Rev: 41126

* ICU-13530 move normalization getRange skipping lead surrogates to API getRangeSkipLead()

X-SVN-Rev: 41182

* ICU-13530 switch CanonIterData and gennorm2 Norms to UTrie3

X-SVN-Rev: 41183

* ICU-13530 remove unused overwrite parameter from setRange()

X-SVN-Rev: 41184

* ICU-13530 getRange skip lead -> fixed surrogates

X-SVN-Rev: 41219

* ICU-13530 minor cleanup

X-SVN-Rev: 41221

* ICU-13530 UTS 46 code map unpaired surrogates to U+FFFD before normalization

X-SVN-Rev: 41224

* ICU-13530 minor internal-docs cleanup

X-SVN-Rev: 41225

* ICU-13530 rename UTrie3 to UCPTrie, and other name changes

X-SVN-Rev: 41226

* ICU-13530 add 8-bit data option; add type-any & valueBits-any for fromBinary(); macros consistently source type then data width

X-SVN-Rev: 41234

* ICU-13530 scrub the API docs for the proposal

X-SVN-Rev: 41319

* ICU-13530 tag internal definitions as such, or move them to an internal header

X-SVN-Rev: 41320

* ICU-13530 Java API skeleton

X-SVN-Rev: 41326

* ICU-13530 API feedback: ValueWidth, MutableCodePointTrie, base CodePointMap, ...

X-SVN-Rev: 41382

* ICU-13530 add UCPTrie valueWidth field and padding, and combine data pointers into a union

X-SVN-Rev: 41408

* ICU-13530 switch some macros to using dataAccess parameter: separate index vs. data lookups, no macro variant for each value width

X-SVN-Rev: 41409

* ICU-13530 StringIterator is no longer a java.util.Iterator (bad fit)

X-SVN-Rev: 41455

* ICU-13530 CodePointTrie.java code complete

X-SVN-Rev: 41518

* ICU-13530 finish Java port incl test; keep C++ parallel

* ICU-13530 adjust API for feedback: rename HandleValue to FilterValue, change getRange+getRangeFixedSurr(bool allSurr) to enum RangeOption+getRange(enum option); change remaining C macros to use dataAccess for 16/32/8-bit value widths; fix/clarify some API docs

* ICU-13530 add javadoc

* ICU-13530 document UCPTrie binary data format

* ICU-13530 update .nrm formatVersion 3->4, document change in surrogate handling with new trie

* ICU-13530 re-hardcode NFC data

* move trie swapper code into new file; add new files to Windows project files; turn off trie debugging

* ICU-13530 minor cleanup

* ICU-13530 test more range starts; fix a C test leak

* ICU-13530 regenerate Java data from scratch

* ICU-13530 review feedback changes: API docs typos, more @internal, C++11 field initializers, fix potential leak in MutableCodePointTrie::fromUCPTrie()

* ICU-13530 rename interface FilterValue to ValueFilter
This commit is contained in:
Markus Scherer 2018-08-14 14:04:10 -07:00 committed by Shane Carr
parent 8a52f44951
commit fe3eb3ed5c
No known key found for this signature in database
GPG key ID: FCED3B24AAB18B5C
60 changed files with 11129 additions and 1486 deletions

View file

@ -81,7 +81,7 @@ LIBS = $(LIBICUDT) $(DEFAULT_LIBS)
OBJECTS = errorcode.o putil.o umath.o utypes.o uinvchar.o umutex.o ucln_cmn.o \
uinit.o uobject.o cmemory.o charstr.o cstr.o \
udata.o ucmndata.o udatamem.o umapfile.o udataswp.o ucol_swp.o utrace.o \
udata.o ucmndata.o udatamem.o umapfile.o udataswp.o utrie_swap.o ucol_swp.o utrace.o \
uhash.o uhash_us.o uenum.o ustrenum.o uvector.o ustack.o uvectr32.o uvectr64.o \
ucnv.o ucnv_bld.o ucnv_cnv.o ucnv_io.o ucnv_cb.o ucnv_err.o ucnvlat1.o \
ucnv_u7.o ucnv_u8.o ucnv_u16.o ucnv_u32.o ucnvscsu.o ucnvbocu.o \
@ -102,7 +102,8 @@ normalizer2impl.o normalizer2.o filterednormalizer2.o normlzr.o unorm.o unormcmp
chariter.o schriter.o uchriter.o uiter.o \
patternprops.o uchar.o uprops.o ucase.o propname.o ubidi_props.o ubidi.o ubidiwrt.o ubidiln.o ushape.o \
uscript.o uscript_props.o usc_impl.o unames.o \
utrie.o utrie2.o utrie2_builder.o bmpset.o unisetspan.o uset_props.o uniset_props.o uniset_closure.o uset.o uniset.o usetiter.o ruleiter.o caniter.o unifilt.o unifunct.o \
utrie.o utrie2.o utrie2_builder.o ucptrie.o umutablecptrie.o \
bmpset.o unisetspan.o uset_props.o uniset_props.o uniset_closure.o uset.o uniset.o usetiter.o ruleiter.o caniter.o unifilt.o unifunct.o \
uarrsort.o brkiter.o ubrk.o brkeng.o dictbe.o filteredbrk.o \
rbbi.o rbbidata.o rbbinode.o rbbirb.o rbbiscan.o rbbisetb.o rbbistbl.o rbbitblb.o rbbi_cache.o \
serv.o servnotf.o servls.o servlk.o servlkf.o servrbf.o servslkf.o \

View file

@ -181,6 +181,7 @@
<ClCompile Include="ustack.cpp" />
<ClCompile Include="ustrenum.cpp" />
<ClCompile Include="utrie.cpp" />
<ClCompile Include="utrie_swap.cpp" />
<ClCompile Include="utrie2.cpp" />
<ClCompile Include="utrie2_builder.cpp" />
<ClCompile Include="uvector.cpp" />
@ -314,8 +315,10 @@
<ClCompile Include="ucharstriebuilder.cpp" />
<ClCompile Include="ucharstrieiterator.cpp" />
<ClCompile Include="uchriter.cpp" />
<ClCompile Include="ucptrie.cpp" />
<ClCompile Include="uinvchar.cpp" />
<ClCompile Include="uiter.cpp" />
<ClCompile Include="umutablecptrie.cpp" />
<ClCompile Include="unistr.cpp" />
<ClCompile Include="unistr_case.cpp" />
<ClCompile Include="unistr_case_locale.cpp" />

View file

@ -139,6 +139,9 @@
<ClCompile Include="utrie.cpp">
<Filter>collections</Filter>
</ClCompile>
<ClCompile Include="utrie_swap.cpp">
<Filter>collections</Filter>
</ClCompile>
<ClCompile Include="utrie2.cpp">
<Filter>collections</Filter>
</ClCompile>
@ -589,6 +592,12 @@
<ClCompile Include="ucharstrieiterator.cpp">
<Filter>collections</Filter>
</ClCompile>
<ClCompile Include="ucptrie.cpp">
<Filter>collections</Filter>
</ClCompile>
<ClCompile Include="umutablecptrie.cpp">
<Filter>collections</Filter>
</ClCompile>
<ClCompile Include="patternprops.cpp">
<Filter>properties &amp; sets</Filter>
</ClCompile>
@ -1204,6 +1213,12 @@
<CustomBuild Include="unicode\ucharstriebuilder.h">
<Filter>collections</Filter>
</CustomBuild>
<CustomBuild Include="unicode\ucptrie.h">
<Filter>collections</Filter>
</CustomBuild>
<CustomBuild Include="unicode\umutablecptrie.h">
<Filter>collections</Filter>
</CustomBuild>
<CustomBuild Include="unicode\enumset.h">
<Filter>data &amp; memory</Filter>
</CustomBuild>
@ -1217,4 +1232,4 @@
<Filter>strings</Filter>
</CustomBuild>
</ItemGroup>
</Project>
</Project>

View file

@ -304,6 +304,7 @@
<ClCompile Include="ustack.cpp" />
<ClCompile Include="ustrenum.cpp" />
<ClCompile Include="utrie.cpp" />
<ClCompile Include="utrie_swap.cpp" />
<ClCompile Include="utrie2.cpp" />
<ClCompile Include="utrie2_builder.cpp" />
<ClCompile Include="uvector.cpp" />
@ -439,9 +440,11 @@
<ClCompile Include="ucharstrie.cpp" />
<ClCompile Include="ucharstriebuilder.cpp" />
<ClCompile Include="ucharstrieiterator.cpp" />
<ClCompile Include="ucptrie.cpp" />
<ClCompile Include="uchriter.cpp" />
<ClCompile Include="uinvchar.cpp" />
<ClCompile Include="uiter.cpp" />
<ClCompile Include="umutablecptrie.cpp" />
<ClCompile Include="unistr.cpp" />
<ClCompile Include="unistr_case.cpp" />
<ClCompile Include="unistr_case_locale.cpp" />

View file

@ -18,6 +18,7 @@
#include "unicode/udata.h"
#include "unicode/localpointer.h"
#include "unicode/normalizer2.h"
#include "unicode/ucptrie.h"
#include "unicode/unistr.h"
#include "unicode/unorm.h"
#include "cstring.h"
@ -42,12 +43,12 @@ private:
isAcceptable(void *context, const char *type, const char *name, const UDataInfo *pInfo);
UDataMemory *memory;
UTrie2 *ownedTrie;
UCPTrie *ownedTrie;
};
LoadedNormalizer2Impl::~LoadedNormalizer2Impl() {
udata_close(memory);
utrie2_close(ownedTrie);
ucptrie_close(ownedTrie);
}
UBool U_CALLCONV
@ -62,7 +63,7 @@ LoadedNormalizer2Impl::isAcceptable(void * /*context*/,
pInfo->dataFormat[1]==0x72 &&
pInfo->dataFormat[2]==0x6d &&
pInfo->dataFormat[3]==0x32 &&
pInfo->formatVersion[0]==3
pInfo->formatVersion[0]==4
) {
// Normalizer2Impl *me=(Normalizer2Impl *)context;
// uprv_memcpy(me->dataVersion, pInfo->dataVersion, 4);
@ -91,9 +92,9 @@ LoadedNormalizer2Impl::load(const char *packageName, const char *name, UErrorCod
int32_t offset=inIndexes[IX_NORM_TRIE_OFFSET];
int32_t nextOffset=inIndexes[IX_EXTRA_DATA_OFFSET];
ownedTrie=utrie2_openFromSerialized(UTRIE2_16_VALUE_BITS,
inBytes+offset, nextOffset-offset, NULL,
&errorCode);
ownedTrie=ucptrie_openFromBinary(UCPTRIE_TYPE_FAST, UCPTRIE_VALUE_BITS_16,
inBytes+offset, nextOffset-offset, NULL,
&errorCode);
if(U_FAILURE(errorCode)) {
return;
}
@ -131,15 +132,26 @@ U_CDECL_BEGIN
static UBool U_CALLCONV uprv_loaded_normalizer2_cleanup();
U_CDECL_END
static Norm2AllModes *nfkcSingleton;
static Norm2AllModes *nfkc_cfSingleton;
static UHashtable *cache=NULL;
#if !NORM2_HARDCODE_NFC_DATA
static Norm2AllModes *nfcSingleton;
static icu::UInitOnce nfcInitOnce = U_INITONCE_INITIALIZER;
#endif
static Norm2AllModes *nfkcSingleton;
static icu::UInitOnce nfkcInitOnce = U_INITONCE_INITIALIZER;
static Norm2AllModes *nfkc_cfSingleton;
static icu::UInitOnce nfkc_cfInitOnce = U_INITONCE_INITIALIZER;
static UHashtable *cache=NULL;
// UInitOnce singleton initialization function
static void U_CALLCONV initSingletons(const char *what, UErrorCode &errorCode) {
#if !NORM2_HARDCODE_NFC_DATA
if (uprv_strcmp(what, "nfc") == 0) {
nfcSingleton = Norm2AllModes::createInstance(NULL, "nfc", errorCode);
} else
#endif
if (uprv_strcmp(what, "nfkc") == 0) {
nfkcSingleton = Norm2AllModes::createInstance(NULL, "nfkc", errorCode);
} else if (uprv_strcmp(what, "nfkc_cf") == 0) {
@ -157,19 +169,36 @@ static void U_CALLCONV deleteNorm2AllModes(void *allModes) {
}
static UBool U_CALLCONV uprv_loaded_normalizer2_cleanup() {
#if !NORM2_HARDCODE_NFC_DATA
delete nfcSingleton;
nfcSingleton = NULL;
nfcInitOnce.reset();
#endif
delete nfkcSingleton;
nfkcSingleton = NULL;
nfkcInitOnce.reset();
delete nfkc_cfSingleton;
nfkc_cfSingleton = NULL;
nfkc_cfInitOnce.reset();
uhash_close(cache);
cache=NULL;
nfkcInitOnce.reset();
nfkc_cfInitOnce.reset();
return TRUE;
}
U_CDECL_END
#if !NORM2_HARDCODE_NFC_DATA
const Norm2AllModes *
Norm2AllModes::getNFCInstance(UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) { return NULL; }
umtx_initOnce(nfcInitOnce, &initSingletons, "nfc", errorCode);
return nfcSingleton;
}
#endif
const Norm2AllModes *
Norm2AllModes::getNFKCInstance(UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) { return NULL; }
@ -184,6 +213,36 @@ Norm2AllModes::getNFKC_CFInstance(UErrorCode &errorCode) {
return nfkc_cfSingleton;
}
#if !NORM2_HARDCODE_NFC_DATA
const Normalizer2 *
Normalizer2::getNFCInstance(UErrorCode &errorCode) {
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
return allModes!=NULL ? &allModes->comp : NULL;
}
const Normalizer2 *
Normalizer2::getNFDInstance(UErrorCode &errorCode) {
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
return allModes!=NULL ? &allModes->decomp : NULL;
}
const Normalizer2 *Normalizer2Factory::getFCDInstance(UErrorCode &errorCode) {
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
return allModes!=NULL ? &allModes->fcd : NULL;
}
const Normalizer2 *Normalizer2Factory::getFCCInstance(UErrorCode &errorCode) {
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
return allModes!=NULL ? &allModes->fcc : NULL;
}
const Normalizer2Impl *
Normalizer2Factory::getNFCImpl(UErrorCode &errorCode) {
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
return allModes!=NULL ? allModes->impl : NULL;
}
#endif
const Normalizer2 *
Normalizer2::getNFKCInstance(UErrorCode &errorCode) {
const Norm2AllModes *allModes=Norm2AllModes::getNFKCInstance(errorCode);

File diff suppressed because it is too large Load diff

View file

@ -34,9 +34,11 @@
using icu::Normalizer2Impl;
#if NORM2_HARDCODE_NFC_DATA
// NFC/NFD data machine-generated by gennorm2 --csource
#define INCLUDED_FROM_NORMALIZER2_CPP
#include "norm2_nfc_data.h"
#endif
U_NAMESPACE_BEGIN
@ -176,6 +178,36 @@ FCDNormalizer2::~FCDNormalizer2() {}
// instance cache ---------------------------------------------------------- ***
U_CDECL_BEGIN
static UBool U_CALLCONV uprv_normalizer2_cleanup();
U_CDECL_END
static Normalizer2 *noopSingleton;
static icu::UInitOnce noopInitOnce = U_INITONCE_INITIALIZER;
static void U_CALLCONV initNoopSingleton(UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) {
return;
}
noopSingleton=new NoopNormalizer2;
if(noopSingleton==NULL) {
errorCode=U_MEMORY_ALLOCATION_ERROR;
return;
}
ucln_common_registerCleanup(UCLN_COMMON_NORMALIZER2, uprv_normalizer2_cleanup);
}
const Normalizer2 *Normalizer2Factory::getNoopInstance(UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) { return NULL; }
umtx_initOnce(noopInitOnce, &initNoopSingleton, errorCode);
return noopSingleton;
}
const Normalizer2Impl *
Normalizer2Factory::getImpl(const Normalizer2 *norm2) {
return &((Normalizer2WithImpl *)norm2)->impl;
}
Norm2AllModes::~Norm2AllModes() {
delete impl;
}
@ -195,6 +227,7 @@ Norm2AllModes::createInstance(Normalizer2Impl *impl, UErrorCode &errorCode) {
return allModes;
}
#if NORM2_HARDCODE_NFC_DATA
Norm2AllModes *
Norm2AllModes::createNFCInstance(UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) {
@ -210,48 +243,15 @@ Norm2AllModes::createNFCInstance(UErrorCode &errorCode) {
return createInstance(impl, errorCode);
}
U_CDECL_BEGIN
static UBool U_CALLCONV uprv_normalizer2_cleanup();
U_CDECL_END
static Norm2AllModes *nfcSingleton;
static Normalizer2 *noopSingleton;
static icu::UInitOnce nfcInitOnce = U_INITONCE_INITIALIZER;
static icu::UInitOnce noopInitOnce = U_INITONCE_INITIALIZER;
// UInitOnce singleton initialization functions
static void U_CALLCONV initNFCSingleton(UErrorCode &errorCode) {
nfcSingleton=Norm2AllModes::createNFCInstance(errorCode);
ucln_common_registerCleanup(UCLN_COMMON_NORMALIZER2, uprv_normalizer2_cleanup);
}
static void U_CALLCONV initNoopSingleton(UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) {
return;
}
noopSingleton=new NoopNormalizer2;
if(noopSingleton==NULL) {
errorCode=U_MEMORY_ALLOCATION_ERROR;
return;
}
ucln_common_registerCleanup(UCLN_COMMON_NORMALIZER2, uprv_normalizer2_cleanup);
}
U_CDECL_BEGIN
static UBool U_CALLCONV uprv_normalizer2_cleanup() {
delete nfcSingleton;
nfcSingleton = NULL;
delete noopSingleton;
noopSingleton = NULL;
nfcInitOnce.reset();
noopInitOnce.reset();
return TRUE;
}
U_CDECL_END
const Norm2AllModes *
Norm2AllModes::getNFCInstance(UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) { return NULL; }
@ -281,23 +281,29 @@ const Normalizer2 *Normalizer2Factory::getFCCInstance(UErrorCode &errorCode) {
return allModes!=NULL ? &allModes->fcc : NULL;
}
const Normalizer2 *Normalizer2Factory::getNoopInstance(UErrorCode &errorCode) {
if(U_FAILURE(errorCode)) { return NULL; }
umtx_initOnce(noopInitOnce, &initNoopSingleton, errorCode);
return noopSingleton;
}
const Normalizer2Impl *
Normalizer2Factory::getNFCImpl(UErrorCode &errorCode) {
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
return allModes!=NULL ? allModes->impl : NULL;
}
#endif // NORM2_HARDCODE_NFC_DATA
const Normalizer2Impl *
Normalizer2Factory::getImpl(const Normalizer2 *norm2) {
return &((Normalizer2WithImpl *)norm2)->impl;
U_CDECL_BEGIN
static UBool U_CALLCONV uprv_normalizer2_cleanup() {
delete noopSingleton;
noopSingleton = NULL;
noopInitOnce.reset();
#if NORM2_HARDCODE_NFC_DATA
delete nfcSingleton;
nfcSingleton = NULL;
nfcInitOnce.reset();
#endif
return TRUE;
}
U_CDECL_END
U_NAMESPACE_END
// C API ------------------------------------------------------------------- ***

View file

@ -16,6 +16,8 @@
* created by: Markus W. Scherer
*/
// #define UCPTRIE_DEBUG
#include "unicode/utypes.h"
#if !UCONFIG_NO_NORMALIZATION
@ -24,7 +26,9 @@
#include "unicode/edits.h"
#include "unicode/normalizer2.h"
#include "unicode/stringoptions.h"
#include "unicode/ucptrie.h"
#include "unicode/udata.h"
#include "unicode/umutablecptrie.h"
#include "unicode/ustring.h"
#include "unicode/utf16.h"
#include "unicode/utf8.h"
@ -34,8 +38,8 @@
#include "normalizer2impl.h"
#include "putilimp.h"
#include "uassert.h"
#include "ucptrie_impl.h"
#include "uset_imp.h"
#include "utrie2.h"
#include "uvector.h"
U_NAMESPACE_BEGIN
@ -62,7 +66,7 @@ inline uint8_t leadByteForCP(UChar32 c) {
* Returns the code point from one single well-formed UTF-8 byte sequence
* between cpStart and cpLimit.
*
* UTrie2 UTF-8 macros do not assemble whole code points (for efficiency).
* Trie UTF-8 macros do not assemble whole code points (for efficiency).
* When we do need the code point, we call this function.
* We should not need it for normalization-inert data (norm16==0).
* Illegal sequences yield the error value norm16==0 just like real normalization-inert code points.
@ -253,7 +257,7 @@ UBool ReorderingBuffer::appendSupplementary(UChar32 c, uint8_t cc, UErrorCode &e
return TRUE;
}
UBool ReorderingBuffer::append(const UChar *s, int32_t length,
UBool ReorderingBuffer::append(const UChar *s, int32_t length, UBool isNFD,
uint8_t leadCC, uint8_t trailCC,
UErrorCode &errorCode) {
if(length==0) {
@ -280,8 +284,11 @@ UBool ReorderingBuffer::append(const UChar *s, int32_t length,
while(i<length) {
U16_NEXT(s, i, length, c);
if(i<length) {
// s must be in NFD, otherwise we need to use getCC().
leadCC=Normalizer2Impl::getCCFromYesOrMaybe(impl.getNorm16(c));
if (isNFD) {
leadCC = Normalizer2Impl::getCCFromYesOrMaybe(impl.getRawNorm16(c));
} else {
leadCC = impl.getCC(impl.getNorm16(c));
}
} else {
leadCC=trailCC;
}
@ -411,7 +418,8 @@ struct CanonIterData : public UMemory {
CanonIterData(UErrorCode &errorCode);
~CanonIterData();
void addToStartSet(UChar32 origin, UChar32 decompLead, UErrorCode &errorCode);
UTrie2 *trie;
UMutableCPTrie *mutableTrie;
UCPTrie *trie;
UVector canonStartSets; // contains UnicodeSet *
};
@ -420,7 +428,7 @@ Normalizer2Impl::~Normalizer2Impl() {
}
void
Normalizer2Impl::init(const int32_t *inIndexes, const UTrie2 *inTrie,
Normalizer2Impl::init(const int32_t *inIndexes, const UCPTrie *inTrie,
const uint16_t *inExtraData, const uint8_t *inSmallFCD) {
minDecompNoCP = static_cast<UChar>(inIndexes[IX_MIN_DECOMP_NO_CP]);
minCompNoMaybeCP = static_cast<UChar>(inIndexes[IX_MIN_COMP_NO_MAYBE_CP]);
@ -445,75 +453,8 @@ Normalizer2Impl::init(const int32_t *inIndexes, const UTrie2 *inTrie,
smallFCD=inSmallFCD;
}
class LcccContext {
public:
LcccContext(const Normalizer2Impl &ni, UnicodeSet &s) : impl(ni), set(s) {}
void handleRange(UChar32 start, UChar32 end, uint16_t norm16) {
if (norm16 > Normalizer2Impl::MIN_NORMAL_MAYBE_YES &&
norm16 != Normalizer2Impl::JAMO_VT) {
set.add(start, end);
} else if (impl.minNoNoCompNoMaybeCC <= norm16 && norm16 < impl.limitNoNo) {
uint16_t fcd16=impl.getFCD16(start);
if(fcd16>0xff) { set.add(start, end); }
}
}
private:
const Normalizer2Impl &impl;
UnicodeSet &set;
};
namespace {
struct PropertyStartsContext {
PropertyStartsContext(const Normalizer2Impl &ni, const USetAdder *adder)
: impl(ni), sa(adder) {}
const Normalizer2Impl &impl;
const USetAdder *sa;
};
} // namespace
U_CDECL_BEGIN
static UBool U_CALLCONV
enumLcccRange(const void *context, UChar32 start, UChar32 end, uint32_t value) {
((LcccContext *)context)->handleRange(start, end, (uint16_t)value);
return TRUE;
}
static UBool U_CALLCONV
enumNorm16PropertyStartsRange(const void *context, UChar32 start, UChar32 end, uint32_t value) {
/* add the start code point to the USet */
const PropertyStartsContext *ctx=(const PropertyStartsContext *)context;
const USetAdder *sa=ctx->sa;
sa->add(sa->set, start);
if (start != end && ctx->impl.isAlgorithmicNoNo((uint16_t)value) &&
(value & Normalizer2Impl::DELTA_TCCC_MASK) > Normalizer2Impl::DELTA_TCCC_1) {
// Range of code points with same-norm16-value algorithmic decompositions.
// They might have different non-zero FCD16 values.
uint16_t prevFCD16=ctx->impl.getFCD16(start);
while(++start<=end) {
uint16_t fcd16=ctx->impl.getFCD16(start);
if(fcd16!=prevFCD16) {
sa->add(sa->set, start);
prevFCD16=fcd16;
}
}
}
return TRUE;
}
static UBool U_CALLCONV
enumPropertyStartsRange(const void *context, UChar32 start, UChar32 /*end*/, uint32_t /*value*/) {
/* add the start code point to the USet */
const USetAdder *sa=(const USetAdder *)context;
sa->add(sa->set, start);
return TRUE;
}
static uint32_t U_CALLCONV
segmentStarterMapper(const void * /*context*/, uint32_t value) {
return value&CANON_NOT_SEGMENT_STARTER;
@ -523,15 +464,44 @@ U_CDECL_END
void
Normalizer2Impl::addLcccChars(UnicodeSet &set) const {
LcccContext context(*this, set);
utrie2_enum(normTrie, NULL, enumLcccRange, &context);
UChar32 start = 0, end;
uint32_t norm16;
while ((end = ucptrie_getRange(normTrie, start, UCPTRIE_RANGE_FIXED_LEAD_SURROGATES, INERT,
nullptr, nullptr, &norm16)) >= 0) {
if (norm16 > Normalizer2Impl::MIN_NORMAL_MAYBE_YES &&
norm16 != Normalizer2Impl::JAMO_VT) {
set.add(start, end);
} else if (minNoNoCompNoMaybeCC <= norm16 && norm16 < limitNoNo) {
uint16_t fcd16 = getFCD16(start);
if (fcd16 > 0xff) { set.add(start, end); }
}
start = end + 1;
}
}
void
Normalizer2Impl::addPropertyStarts(const USetAdder *sa, UErrorCode & /*errorCode*/) const {
/* add the start code point of each same-value range of each trie */
PropertyStartsContext context(*this, sa);
utrie2_enum(normTrie, NULL, enumNorm16PropertyStartsRange, &context);
// Add the start code point of each same-value range of the trie.
UChar32 start = 0, end;
uint32_t value;
while ((end = ucptrie_getRange(normTrie, start, UCPTRIE_RANGE_FIXED_LEAD_SURROGATES, INERT,
nullptr, nullptr, &value)) >= 0) {
sa->add(sa->set, start);
if (start != end && isAlgorithmicNoNo((uint16_t)value) &&
(value & Normalizer2Impl::DELTA_TCCC_MASK) > Normalizer2Impl::DELTA_TCCC_1) {
// Range of code points with same-norm16-value algorithmic decompositions.
// They might have different non-zero FCD16 values.
uint16_t prevFCD16 = getFCD16(start);
while (++start <= end) {
uint16_t fcd16 = getFCD16(start);
if (fcd16 != prevFCD16) {
sa->add(sa->set, start);
prevFCD16 = fcd16;
}
}
}
start = end + 1;
}
/* add Hangul LV syllables and LV+1 because of skippables */
for(UChar c=Hangul::HANGUL_BASE; c<Hangul::HANGUL_LIMIT; c+=Hangul::JAMO_T_COUNT) {
@ -543,10 +513,15 @@ Normalizer2Impl::addPropertyStarts(const USetAdder *sa, UErrorCode & /*errorCode
void
Normalizer2Impl::addCanonIterPropertyStarts(const USetAdder *sa, UErrorCode &errorCode) const {
/* add the start code point of each same-value range of the canonical iterator data trie */
if(ensureCanonIterData(errorCode)) {
// currently only used for the SEGMENT_STARTER property
utrie2_enum(fCanonIterData->trie, segmentStarterMapper, enumPropertyStartsRange, sa);
// Add the start code point of each same-value range of the canonical iterator data trie.
if (!ensureCanonIterData(errorCode)) { return; }
// Currently only used for the SEGMENT_STARTER property.
UChar32 start = 0, end;
uint32_t value;
while ((end = ucptrie_getRange(fCanonIterData->trie, start, UCPTRIE_RANGE_NORMAL, 0,
segmentStarterMapper, nullptr, &value)) >= 0) {
sa->add(sa->set, start);
start = end + 1;
}
}
@ -633,27 +608,23 @@ Normalizer2Impl::decompose(const UChar *src, const UChar *limit,
// count code units below the minimum or with irrelevant data for the quick check
for(prevSrc=src; src!=limit;) {
if( (c=*src)<minNoCP ||
isMostDecompYesAndZeroCC(norm16=UTRIE2_GET16_FROM_U16_SINGLE_LEAD(normTrie, c))
isMostDecompYesAndZeroCC(norm16=UCPTRIE_FAST_BMP_GET(normTrie, UCPTRIE_16, c))
) {
++src;
} else if(!U16_IS_SURROGATE(c)) {
} else if(!U16_IS_LEAD(c)) {
break;
} else {
UChar c2;
if(U16_IS_SURROGATE_LEAD(c)) {
if((src+1)!=limit && U16_IS_TRAIL(c2=src[1])) {
c=U16_GET_SUPPLEMENTARY(c, c2);
if((src+1)!=limit && U16_IS_TRAIL(c2=src[1])) {
c=U16_GET_SUPPLEMENTARY(c, c2);
norm16=UCPTRIE_FAST_SUPP_GET(normTrie, UCPTRIE_16, c);
if(isMostDecompYesAndZeroCC(norm16)) {
src+=2;
} else {
break;
}
} else /* trail surrogate */ {
if(prevSrc<src && U16_IS_LEAD(c2=*(src-1))) {
--src;
c=U16_GET_SUPPLEMENTARY(c2, c);
}
}
if(isMostDecompYesAndZeroCC(norm16=getNorm16(c))) {
src+=U16_LENGTH(c);
} else {
break;
++src; // unpaired lead surrogate: inert
}
}
}
@ -713,7 +684,7 @@ Normalizer2Impl::decomposeShort(const UChar *src, const UChar *limit,
const UChar *prevSrc = src;
UChar32 c;
uint16_t norm16;
UTRIE2_U16_NEXT16(normTrie, src, limit, c, norm16);
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, src, limit, c, norm16);
if (stopAtCompBoundary && norm16HasCompBoundaryBefore(norm16)) {
return prevSrc;
}
@ -737,7 +708,7 @@ UBool Normalizer2Impl::decompose(UChar32 c, uint16_t norm16,
}
// Maps to an isCompYesAndZeroCC.
c=mapAlgorithmic(c, norm16);
norm16=getNorm16(c);
norm16=getRawNorm16(c);
}
if (norm16 < minYesNo) {
// c does not decompose
@ -758,7 +729,7 @@ UBool Normalizer2Impl::decompose(UChar32 c, uint16_t norm16,
} else {
leadCC=0;
}
return buffer.append((const UChar *)mapping+1, length, leadCC, trailCC, errorCode);
return buffer.append((const UChar *)mapping+1, length, TRUE, leadCC, trailCC, errorCode);
}
const uint8_t *
@ -771,7 +742,7 @@ Normalizer2Impl::decomposeShort(const uint8_t *src, const uint8_t *limit,
while (src < limit) {
const uint8_t *prevSrc = src;
uint16_t norm16;
UTRIE2_U8_NEXT16(normTrie, src, limit, norm16);
UCPTRIE_FAST_U8_NEXT(normTrie, UCPTRIE_16, src, limit, norm16);
// Get the decomposition and the lead and trail cc's.
UChar32 c = U_SENTINEL;
if (norm16 >= limitNoNo) {
@ -789,7 +760,7 @@ Normalizer2Impl::decomposeShort(const uint8_t *src, const uint8_t *limit,
}
c = codePointFromValidUTF8(prevSrc, src);
c = mapAlgorithmic(c, norm16);
norm16 = getNorm16(c);
norm16 = getRawNorm16(c);
} else if (stopAtCompBoundary && norm16 < minNoNoCompNoMaybeCC) {
return prevSrc;
}
@ -828,7 +799,7 @@ Normalizer2Impl::decomposeShort(const uint8_t *src, const uint8_t *limit,
} else {
leadCC = 0;
}
if (!buffer.append((const char16_t *)mapping+1, length, leadCC, trailCC, errorCode)) {
if (!buffer.append((const char16_t *)mapping+1, length, TRUE, leadCC, trailCC, errorCode)) {
return nullptr;
}
}
@ -854,7 +825,7 @@ Normalizer2Impl::getDecomposition(UChar32 c, UChar buffer[4], int32_t &length) c
length=0;
U16_APPEND_UNSAFE(buffer, length, c);
// The mapping might decompose further.
norm16 = getNorm16(c);
norm16 = getRawNorm16(c);
}
if (norm16 < minYesNo) {
return decomp;
@ -926,19 +897,30 @@ void Normalizer2Impl::decomposeAndAppend(const UChar *src, const UChar *limit,
return;
}
// Just merge the strings at the boundary.
ForwardUTrie2StringIterator iter(normTrie, src, limit);
uint8_t firstCC, prevCC, cc;
firstCC=prevCC=cc=getCC(iter.next16());
while(cc!=0) {
prevCC=cc;
cc=getCC(iter.next16());
};
bool isFirst = true;
uint8_t firstCC = 0, prevCC = 0, cc;
const UChar *p = src;
while (p != limit) {
const UChar *codePointStart = p;
UChar32 c;
uint16_t norm16;
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, p, limit, c, norm16);
if ((cc = getCC(norm16)) == 0) {
p = codePointStart;
break;
}
if (isFirst) {
firstCC = cc;
isFirst = false;
}
prevCC = cc;
}
if(limit==NULL) { // appendZeroCC() needs limit!=NULL
limit=u_strchr(iter.codePointStart, 0);
limit=u_strchr(p, 0);
}
if (buffer.append(src, (int32_t)(iter.codePointStart-src), firstCC, prevCC, errorCode)) {
buffer.appendZeroCC(iter.codePointStart, limit, errorCode);
if (buffer.append(src, (int32_t)(p - src), FALSE, firstCC, prevCC, errorCode)) {
buffer.appendZeroCC(p, limit, errorCode);
}
}
@ -1085,7 +1067,7 @@ void Normalizer2Impl::addComposites(const uint16_t *list, UnicodeSet &set) const
}
UChar32 composite=compositeAndFwd>>1;
if((compositeAndFwd&1)!=0) {
addComposites(getCompositionsListForComposite(getNorm16(composite)), set);
addComposites(getCompositionsListForComposite(getRawNorm16(composite)), set);
}
set.add(composite);
} while((firstUnit&COMP_1_LAST_TUPLE)==0);
@ -1124,7 +1106,7 @@ void Normalizer2Impl::recompose(ReorderingBuffer &buffer, int32_t recomposeStart
prevCC=0;
for(;;) {
UTRIE2_U16_NEXT16(normTrie, p, limit, c, norm16);
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, p, limit, c, norm16);
cc=getCCFromYesOrMaybe(norm16);
if( // this character combines backward and
isMaybe(norm16) &&
@ -1229,7 +1211,7 @@ void Normalizer2Impl::recompose(ReorderingBuffer &buffer, int32_t recomposeStart
// Is the composite a starter that combines forward?
if(compositeAndFwd&1) {
compositionsList=
getCompositionsListForComposite(getNorm16(composite));
getCompositionsListForComposite(getRawNorm16(composite));
} else {
compositionsList=NULL;
}
@ -1268,7 +1250,7 @@ void Normalizer2Impl::recompose(ReorderingBuffer &buffer, int32_t recomposeStart
UChar32
Normalizer2Impl::composePair(UChar32 a, UChar32 b) const {
uint16_t norm16=getNorm16(a); // maps an out-of-range 'a' to inert norm16=0
uint16_t norm16=getNorm16(a); // maps an out-of-range 'a' to inert norm16
const uint16_t *list;
if(isInert(norm16)) {
return U_SENTINEL;
@ -1359,28 +1341,22 @@ Normalizer2Impl::compose(const UChar *src, const UChar *limit,
return TRUE;
}
if( (c=*src)<minNoMaybeCP ||
isCompYesAndZeroCC(norm16=UTRIE2_GET16_FROM_U16_SINGLE_LEAD(normTrie, c))
isCompYesAndZeroCC(norm16=UCPTRIE_FAST_BMP_GET(normTrie, UCPTRIE_16, c))
) {
++src;
} else {
prevSrc = src++;
if(!U16_IS_SURROGATE(c)) {
if(!U16_IS_LEAD(c)) {
break;
} else {
UChar c2;
if(U16_IS_SURROGATE_LEAD(c)) {
if(src!=limit && U16_IS_TRAIL(c2=*src)) {
++src;
c=U16_GET_SUPPLEMENTARY(c, c2);
if(src!=limit && U16_IS_TRAIL(c2=*src)) {
++src;
c=U16_GET_SUPPLEMENTARY(c, c2);
norm16=UCPTRIE_FAST_SUPP_GET(normTrie, UCPTRIE_16, c);
if(!isCompYesAndZeroCC(norm16)) {
break;
}
} else /* trail surrogate */ {
if(prevBoundary<prevSrc && U16_IS_LEAD(c2=*(prevSrc-1))) {
--prevSrc;
c=U16_GET_SUPPLEMENTARY(c2, c);
}
}
if(!isCompYesAndZeroCC(norm16=getNorm16(c))) {
break;
}
}
}
@ -1529,7 +1505,7 @@ Normalizer2Impl::compose(const UChar *src, const UChar *limit,
}
uint8_t prevCC = cc;
nextSrc = src;
UTRIE2_U16_NEXT16(normTrie, nextSrc, limit, c, n16);
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, nextSrc, limit, c, n16);
if (n16 >= MIN_YES_YES_WITH_CC) {
cc = getCCFromNormalYesOrMaybe(n16);
if (prevCC > cc) {
@ -1559,7 +1535,7 @@ Normalizer2Impl::compose(const UChar *src, const UChar *limit,
// decompose and recompose.
if (prevBoundary != prevSrc && !norm16HasCompBoundaryBefore(norm16)) {
const UChar *p = prevSrc;
UTRIE2_U16_PREV16(normTrie, prevBoundary, p, c, norm16);
UCPTRIE_FAST_U16_PREV(normTrie, UCPTRIE_16, prevBoundary, p, c, norm16);
if (!norm16HasCompBoundaryAfter(norm16, onlyContiguous)) {
prevSrc = p;
}
@ -1626,28 +1602,22 @@ Normalizer2Impl::composeQuickCheck(const UChar *src, const UChar *limit,
return src;
}
if( (c=*src)<minNoMaybeCP ||
isCompYesAndZeroCC(norm16=UTRIE2_GET16_FROM_U16_SINGLE_LEAD(normTrie, c))
isCompYesAndZeroCC(norm16=UCPTRIE_FAST_BMP_GET(normTrie, UCPTRIE_16, c))
) {
++src;
} else {
prevSrc = src++;
if(!U16_IS_SURROGATE(c)) {
if(!U16_IS_LEAD(c)) {
break;
} else {
UChar c2;
if(U16_IS_SURROGATE_LEAD(c)) {
if(src!=limit && U16_IS_TRAIL(c2=*src)) {
++src;
c=U16_GET_SUPPLEMENTARY(c, c2);
if(src!=limit && U16_IS_TRAIL(c2=*src)) {
++src;
c=U16_GET_SUPPLEMENTARY(c, c2);
norm16=UCPTRIE_FAST_SUPP_GET(normTrie, UCPTRIE_16, c);
if(!isCompYesAndZeroCC(norm16)) {
break;
}
} else /* trail surrogate */ {
if(prevBoundary<prevSrc && U16_IS_LEAD(c2=*(prevSrc-1))) {
--prevSrc;
c=U16_GET_SUPPLEMENTARY(c2, c);
}
}
if(!isCompYesAndZeroCC(norm16=getNorm16(c))) {
break;
}
}
}
@ -1665,7 +1635,7 @@ Normalizer2Impl::composeQuickCheck(const UChar *src, const UChar *limit,
} else {
const UChar *p = prevSrc;
uint16_t n16;
UTRIE2_U16_PREV16(normTrie, prevBoundary, p, c, n16);
UCPTRIE_FAST_U16_PREV(normTrie, UCPTRIE_16, prevBoundary, p, c, n16);
if (norm16HasCompBoundaryAfter(n16, onlyContiguous)) {
prevBoundary = prevSrc;
} else {
@ -1699,7 +1669,7 @@ Normalizer2Impl::composeQuickCheck(const UChar *src, const UChar *limit,
}
uint8_t prevCC = cc;
nextSrc = src;
UTRIE2_U16_NEXT16(normTrie, nextSrc, limit, c, norm16);
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, nextSrc, limit, c, norm16);
if (isMaybeOrNonZeroCC(norm16)) {
cc = getCCFromYesOrMaybe(norm16);
if (!(prevCC <= cc || cc == 0)) {
@ -1786,7 +1756,7 @@ Normalizer2Impl::composeUTF8(uint32_t options, UBool onlyContiguous,
++src;
} else {
prevSrc = src;
UTRIE2_U8_NEXT16(normTrie, src, limit, norm16);
UCPTRIE_FAST_U8_NEXT(normTrie, UCPTRIE_16, src, limit, norm16);
if (!isCompYesAndZeroCC(norm16)) {
break;
}
@ -1945,7 +1915,7 @@ Normalizer2Impl::composeUTF8(uint32_t options, UBool onlyContiguous,
}
uint8_t prevCC = cc;
nextSrc = src;
UTRIE2_U8_NEXT16(normTrie, nextSrc, limit, n16);
UCPTRIE_FAST_U8_NEXT(normTrie, UCPTRIE_16, nextSrc, limit, n16);
if (n16 >= MIN_YES_YES_WITH_CC) {
cc = getCCFromNormalYesOrMaybe(n16);
if (prevCC > cc) {
@ -1975,7 +1945,7 @@ Normalizer2Impl::composeUTF8(uint32_t options, UBool onlyContiguous,
// decompose and recompose.
if (prevBoundary != prevSrc && !norm16HasCompBoundaryBefore(norm16)) {
const uint8_t *p = prevSrc;
UTRIE2_U8_PREV16(normTrie, prevBoundary, p, norm16);
UCPTRIE_FAST_U8_PREV(normTrie, UCPTRIE_16, prevBoundary, p, norm16);
if (!norm16HasCompBoundaryAfter(norm16, onlyContiguous)) {
prevSrc = p;
}
@ -2023,7 +1993,7 @@ UBool Normalizer2Impl::hasCompBoundaryBefore(const UChar *src, const UChar *limi
}
UChar32 c;
uint16_t norm16;
UTRIE2_U16_NEXT16(normTrie, src, limit, c, norm16);
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, src, limit, c, norm16);
return norm16HasCompBoundaryBefore(norm16);
}
@ -2032,7 +2002,7 @@ UBool Normalizer2Impl::hasCompBoundaryBefore(const uint8_t *src, const uint8_t *
return TRUE;
}
uint16_t norm16;
UTRIE2_U8_NEXT16(normTrie, src, limit, norm16);
UCPTRIE_FAST_U8_NEXT(normTrie, UCPTRIE_16, src, limit, norm16);
return norm16HasCompBoundaryBefore(norm16);
}
@ -2043,7 +2013,7 @@ UBool Normalizer2Impl::hasCompBoundaryAfter(const UChar *start, const UChar *p,
}
UChar32 c;
uint16_t norm16;
UTRIE2_U16_PREV16(normTrie, start, p, c, norm16);
UCPTRIE_FAST_U16_PREV(normTrie, UCPTRIE_16, start, p, c, norm16);
return norm16HasCompBoundaryAfter(norm16, onlyContiguous);
}
@ -2053,36 +2023,42 @@ UBool Normalizer2Impl::hasCompBoundaryAfter(const uint8_t *start, const uint8_t
return TRUE;
}
uint16_t norm16;
UTRIE2_U8_PREV16(normTrie, start, p, norm16);
UCPTRIE_FAST_U8_PREV(normTrie, UCPTRIE_16, start, p, norm16);
return norm16HasCompBoundaryAfter(norm16, onlyContiguous);
}
const UChar *Normalizer2Impl::findPreviousCompBoundary(const UChar *start, const UChar *p,
UBool onlyContiguous) const {
BackwardUTrie2StringIterator iter(normTrie, start, p);
for(;;) {
uint16_t norm16=iter.previous16();
while (p != start) {
const UChar *codePointLimit = p;
UChar32 c;
uint16_t norm16;
UCPTRIE_FAST_U16_PREV(normTrie, UCPTRIE_16, start, p, c, norm16);
if (norm16HasCompBoundaryAfter(norm16, onlyContiguous)) {
return iter.codePointLimit;
return codePointLimit;
}
if (hasCompBoundaryBefore(iter.codePoint, norm16)) {
return iter.codePointStart;
if (hasCompBoundaryBefore(c, norm16)) {
return p;
}
}
return p;
}
const UChar *Normalizer2Impl::findNextCompBoundary(const UChar *p, const UChar *limit,
UBool onlyContiguous) const {
ForwardUTrie2StringIterator iter(normTrie, p, limit);
for(;;) {
uint16_t norm16=iter.next16();
if (hasCompBoundaryBefore(iter.codePoint, norm16)) {
return iter.codePointStart;
while (p != limit) {
const UChar *codePointStart = p;
UChar32 c;
uint16_t norm16;
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, p, limit, c, norm16);
if (hasCompBoundaryBefore(c, norm16)) {
return codePointStart;
}
if (norm16HasCompBoundaryAfter(norm16, onlyContiguous)) {
return iter.codePointLimit;
return p;
}
}
return p;
}
uint8_t Normalizer2Impl::getPreviousTrailCC(const UChar *start, const UChar *p) const {
@ -2130,7 +2106,7 @@ uint16_t Normalizer2Impl::getFCD16FromNormData(UChar32 c) const {
}
// Maps to an isCompYesAndZeroCC.
c=mapAlgorithmic(c, norm16);
norm16=getNorm16(c);
norm16=getRawNorm16(c);
}
}
if(norm16<=minYesNo || isHangulLVT(norm16)) {
@ -2195,17 +2171,10 @@ Normalizer2Impl::makeFCD(const UChar *src, const UChar *limit,
prevFCD16=0;
++src;
} else {
if(U16_IS_SURROGATE(c)) {
if(U16_IS_LEAD(c)) {
UChar c2;
if(U16_IS_SURROGATE_LEAD(c)) {
if((src+1)!=limit && U16_IS_TRAIL(c2=src[1])) {
c=U16_GET_SUPPLEMENTARY(c, c2);
}
} else /* trail surrogate */ {
if(prevSrc<src && U16_IS_LEAD(c2=*(src-1))) {
--src;
c=U16_GET_SUPPLEMENTARY(c2, c);
}
if((src+1)!=limit && U16_IS_TRAIL(c2=src[1])) {
c=U16_GET_SUPPLEMENTARY(c, c2);
}
}
if((fcd16=getFCD16FromNormData(c))<=0xff) {
@ -2336,7 +2305,7 @@ const UChar *Normalizer2Impl::findPreviousFCDBoundary(const UChar *start, const
const UChar *codePointLimit = p;
UChar32 c;
uint16_t norm16;
UTRIE2_U16_PREV16(normTrie, start, p, c, norm16);
UCPTRIE_FAST_U16_PREV(normTrie, UCPTRIE_16, start, p, c, norm16);
if (c < minDecompNoCP || norm16HasDecompBoundaryAfter(norm16)) {
return codePointLimit;
}
@ -2352,7 +2321,7 @@ const UChar *Normalizer2Impl::findNextFCDBoundary(const UChar *p, const UChar *l
const UChar *codePointStart=p;
UChar32 c;
uint16_t norm16;
UTRIE2_U16_NEXT16(normTrie, p, limit, c, norm16);
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, p, limit, c, norm16);
if (c < minLcccCP || norm16HasDecompBoundaryBefore(norm16)) {
return codePointStart;
}
@ -2366,19 +2335,20 @@ const UChar *Normalizer2Impl::findNextFCDBoundary(const UChar *p, const UChar *l
// CanonicalIterator data -------------------------------------------------- ***
CanonIterData::CanonIterData(UErrorCode &errorCode) :
trie(utrie2_open(0, 0, &errorCode)),
mutableTrie(umutablecptrie_open(0, 0, &errorCode)), trie(nullptr),
canonStartSets(uprv_deleteUObject, NULL, errorCode) {}
CanonIterData::~CanonIterData() {
utrie2_close(trie);
umutablecptrie_close(mutableTrie);
ucptrie_close(trie);
}
void CanonIterData::addToStartSet(UChar32 origin, UChar32 decompLead, UErrorCode &errorCode) {
uint32_t canonValue=utrie2_get32(trie, decompLead);
uint32_t canonValue = umutablecptrie_get(mutableTrie, decompLead);
if((canonValue&(CANON_HAS_SET|CANON_VALUE_MASK))==0 && origin!=0) {
// origin is the first character whose decomposition starts with
// the character for which we are setting the value.
utrie2_set32(trie, decompLead, canonValue|origin, &errorCode);
umutablecptrie_set(mutableTrie, decompLead, canonValue|origin, &errorCode);
} else {
// origin is not the first character, or it is U+0000.
UnicodeSet *set;
@ -2390,7 +2360,7 @@ void CanonIterData::addToStartSet(UChar32 origin, UChar32 decompLead, UErrorCode
}
UChar32 firstOrigin=(UChar32)(canonValue&CANON_VALUE_MASK);
canonValue=(canonValue&~CANON_VALUE_MASK)|CANON_HAS_SET|(uint32_t)canonStartSets.size();
utrie2_set32(trie, decompLead, canonValue, &errorCode);
umutablecptrie_set(mutableTrie, decompLead, canonValue, &errorCode);
canonStartSets.addElement(set, errorCode);
if(firstOrigin!=0) {
set->add(firstOrigin);
@ -2406,7 +2376,6 @@ void CanonIterData::addToStartSet(UChar32 origin, UChar32 decompLead, UErrorCode
class InitCanonIterData {
public:
static void doInit(Normalizer2Impl *impl, UErrorCode &errorCode);
static void handleRange(Normalizer2Impl *impl, UChar32 start, UChar32 end, uint16_t value, UErrorCode &errorCode);
};
U_CDECL_BEGIN
@ -2417,18 +2386,6 @@ initCanonIterData(Normalizer2Impl *impl, UErrorCode &errorCode) {
InitCanonIterData::doInit(impl, errorCode);
}
// Call Normalizer2Impl::makeCanonIterDataFromNorm16() for a range of same-norm16 characters.
// context: the Normalizer2Impl
static UBool U_CALLCONV
enumCIDRangeHandler(const void *context, UChar32 start, UChar32 end, uint32_t value) {
UErrorCode errorCode = U_ZERO_ERROR;
if (value != Normalizer2Impl::INERT) {
Normalizer2Impl *impl = (Normalizer2Impl *)context;
InitCanonIterData::handleRange(impl, start, end, (uint16_t)value, errorCode);
}
return U_SUCCESS(errorCode);
}
U_CDECL_END
void InitCanonIterData::doInit(Normalizer2Impl *impl, UErrorCode &errorCode) {
@ -2438,8 +2395,24 @@ void InitCanonIterData::doInit(Normalizer2Impl *impl, UErrorCode &errorCode) {
errorCode=U_MEMORY_ALLOCATION_ERROR;
}
if (U_SUCCESS(errorCode)) {
utrie2_enum(impl->normTrie, NULL, enumCIDRangeHandler, impl);
utrie2_freeze(impl->fCanonIterData->trie, UTRIE2_32_VALUE_BITS, &errorCode);
UChar32 start = 0, end;
uint32_t value;
while ((end = ucptrie_getRange(impl->normTrie, start,
UCPTRIE_RANGE_FIXED_LEAD_SURROGATES, Normalizer2Impl::INERT,
nullptr, nullptr, &value)) >= 0) {
// Call Normalizer2Impl::makeCanonIterDataFromNorm16() for a range of same-norm16 characters.
if (value != Normalizer2Impl::INERT) {
impl->makeCanonIterDataFromNorm16(start, end, value, *impl->fCanonIterData, errorCode);
}
start = end + 1;
}
#ifdef UCPTRIE_DEBUG
umutablecptrie_setName(impl->fCanonIterData->mutableTrie, "CanonIterData");
#endif
impl->fCanonIterData->trie = umutablecptrie_buildImmutable(
impl->fCanonIterData->mutableTrie, UCPTRIE_TYPE_SMALL, UCPTRIE_VALUE_BITS_32, &errorCode);
umutablecptrie_close(impl->fCanonIterData->mutableTrie);
impl->fCanonIterData->mutableTrie = nullptr;
}
if (U_FAILURE(errorCode)) {
delete impl->fCanonIterData;
@ -2447,11 +2420,6 @@ void InitCanonIterData::doInit(Normalizer2Impl *impl, UErrorCode &errorCode) {
}
}
void InitCanonIterData::handleRange(
Normalizer2Impl *impl, UChar32 start, UChar32 end, uint16_t value, UErrorCode &errorCode) {
impl->makeCanonIterDataFromNorm16(start, end, value, *impl->fCanonIterData, errorCode);
}
void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, const uint16_t norm16,
CanonIterData &newData,
UErrorCode &errorCode) const {
@ -2465,7 +2433,7 @@ void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, co
return;
}
for(UChar32 c=start; c<=end; ++c) {
uint32_t oldValue=utrie2_get32(newData.trie, c);
uint32_t oldValue = umutablecptrie_get(newData.mutableTrie, c);
uint32_t newValue=oldValue;
if(isMaybeOrNonZeroCC(norm16)) {
// not a segment starter if it occurs in a decomposition or has cc!=0
@ -2483,7 +2451,7 @@ void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, co
if (isDecompNoAlgorithmic(norm16_2)) {
// Maps to an isCompYesAndZeroCC.
c2 = mapAlgorithmic(c2, norm16_2);
norm16_2 = getNorm16(c2);
norm16_2 = getRawNorm16(c2);
// No compatibility mappings for the CanonicalIterator.
U_ASSERT(!(isHangulLV(norm16_2) || isHangulLVT(norm16_2)));
}
@ -2510,10 +2478,10 @@ void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, co
if(norm16_2>=minNoNo) {
while(i<length) {
U16_NEXT_UNSAFE(mapping, i, c2);
uint32_t c2Value=utrie2_get32(newData.trie, c2);
uint32_t c2Value = umutablecptrie_get(newData.mutableTrie, c2);
if((c2Value&CANON_NOT_SEGMENT_STARTER)==0) {
utrie2_set32(newData.trie, c2, c2Value|CANON_NOT_SEGMENT_STARTER,
&errorCode);
umutablecptrie_set(newData.mutableTrie, c2,
c2Value|CANON_NOT_SEGMENT_STARTER, &errorCode);
}
}
}
@ -2524,7 +2492,7 @@ void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, co
}
}
if(newValue!=oldValue) {
utrie2_set32(newData.trie, c, newValue, &errorCode);
umutablecptrie_set(newData.mutableTrie, c, newValue, &errorCode);
}
}
}
@ -2537,7 +2505,7 @@ UBool Normalizer2Impl::ensureCanonIterData(UErrorCode &errorCode) const {
}
int32_t Normalizer2Impl::getCanonValue(UChar32 c) const {
return (int32_t)utrie2_get32(fCanonIterData->trie, c);
return (int32_t)ucptrie_get(fCanonIterData->trie, c);
}
const UnicodeSet &Normalizer2Impl::getCanonStartSet(int32_t n) const {
@ -2561,7 +2529,7 @@ UBool Normalizer2Impl::getCanonStartSet(UChar32 c, UnicodeSet &set) const {
set.add(value);
}
if((canonValue&CANON_HAS_COMPOSITIONS)!=0) {
uint16_t norm16=getNorm16(c);
uint16_t norm16=getRawNorm16(c);
if(norm16==JAMO_L) {
UChar32 syllable=
(UChar32)(Hangul::HANGUL_BASE+(c-Hangul::JAMO_L_BASE)*Hangul::JAMO_VT_COUNT);
@ -2608,7 +2576,7 @@ unorm2_swap(const UDataSwapper *ds,
pInfo->dataFormat[1]==0x72 &&
pInfo->dataFormat[2]==0x6d &&
pInfo->dataFormat[3]==0x32 &&
(1<=formatVersion0 && formatVersion0<=3)
(1<=formatVersion0 && formatVersion0<=4)
)) {
udata_printError(ds, "unorm2_swap(): data format %02x.%02x.%02x.%02x (format version %02x) is not recognized as Normalizer2 data\n",
pInfo->dataFormat[0], pInfo->dataFormat[1],
@ -2669,9 +2637,9 @@ unorm2_swap(const UDataSwapper *ds,
ds->swapArray32(ds, inBytes, nextOffset-offset, outBytes, pErrorCode);
offset=nextOffset;
/* swap the UTrie2 */
/* swap the trie */
nextOffset=indexes[Normalizer2Impl::IX_EXTRA_DATA_OFFSET];
utrie2_swap(ds, inBytes+offset, nextOffset-offset, outBytes+offset, pErrorCode);
utrie_swapAnyVersion(ds, inBytes+offset, nextOffset-offset, outBytes+offset, pErrorCode);
offset=nextOffset;
/* swap the uint16_t extraData[] */

View file

@ -24,12 +24,19 @@
#if !UCONFIG_NO_NORMALIZATION
#include "unicode/normalizer2.h"
#include "unicode/ucptrie.h"
#include "unicode/unistr.h"
#include "unicode/unorm.h"
#include "unicode/utf.h"
#include "unicode/utf16.h"
#include "mutex.h"
#include "uset_imp.h"
#include "utrie2.h"
// When the nfc.nrm data is *not* hardcoded into the common library
// (with this constant set to 0),
// then it needs to be built into the data package:
// Add nfc.nrm to icu4c/source/data/Makefile.in DAT_FILES_SHORT
#define NORM2_HARDCODE_NFC_DATA 1
U_NAMESPACE_BEGIN
@ -158,8 +165,7 @@ public:
appendBMP((UChar)c, cc, errorCode) :
appendSupplementary(c, cc, errorCode);
}
// s must be in NFD, otherwise change the implementation.
UBool append(const UChar *s, int32_t length,
UBool append(const UChar *s, int32_t length, UBool isNFD,
uint8_t leadCC, uint8_t trailCC,
UErrorCode &errorCode);
UBool appendBMP(UChar c, uint8_t cc, UErrorCode &errorCode) {
@ -243,7 +249,7 @@ public:
}
virtual ~Normalizer2Impl();
void init(const int32_t *inIndexes, const UTrie2 *inTrie,
void init(const int32_t *inIndexes, const UCPTrie *inTrie,
const uint16_t *inExtraData, const uint8_t *inSmallFCD);
void addLcccChars(UnicodeSet &set) const;
@ -254,7 +260,12 @@ public:
UBool ensureCanonIterData(UErrorCode &errorCode) const;
uint16_t getNorm16(UChar32 c) const { return UTRIE2_GET16(normTrie, c); }
// The trie stores values for lead surrogate code *units*.
// Surrogate code *points* are inert.
uint16_t getNorm16(UChar32 c) const {
return U_IS_LEAD(c) ? INERT : UCPTRIE_FAST_GET(normTrie, UCPTRIE_16, c);
}
uint16_t getRawNorm16(UChar32 c) const { return UCPTRIE_FAST_GET(normTrie, UCPTRIE_16, c); }
UNormalizationCheckResult getCompQuickCheck(uint16_t norm16) const {
if(norm16<minNoNo || MIN_YES_YES_WITH_CC<=norm16) {
@ -704,7 +715,7 @@ private:
uint16_t centerNoNoDelta;
uint16_t minMaybeYes;
const UTrie2 *normTrie;
const UCPTrie *normTrie;
const uint16_t *maybeYesCompositions;
const uint16_t *extraData; // mappings and/or compositions for yesYes, yesNo & noNo characters
const uint8_t *smallFCD; // [0x100] one bit per 32 BMP code points, set if any FCD!=0
@ -764,7 +775,7 @@ unorm_getFCD16(UChar32 c);
/**
* Format of Normalizer2 .nrm data files.
* Format version 3.0.
* Format version 4.0.
*
* Normalizer2 .nrm data files provide data for the Unicode Normalization algorithms.
* ICU ships with data files for standard Unicode Normalization Forms
@ -818,7 +829,7 @@ unorm_getFCD16(UChar32 c);
* minMaybeYes=indexes[IX_MIN_MAYBE_YES];
* See the normTrie description below and the design doc for details.
*
* UTrie2 normTrie; -- see utrie2_impl.h and utrie2.h
* UCPTrie normTrie; -- see ucptrie_impl.h and ucptrie.h, same as Java CodePointTrie
*
* The trie holds the main normalization data. Each code point is mapped to a 16-bit value.
* Rather than using independent bits in the value (which would require more than 16 bits),
@ -946,6 +957,20 @@ unorm_getFCD16(UChar32 c);
* which is artificially assigned "worst case" values lccc=1 and tccc=255.
*
* - A mapping to an empty string has explicit lccc=1 and tccc=255 values.
*
* Changes from format version 3 to format version 4 (ICU 63) ------------------
*
* Switched from UTrie2 to UCPTrie/CodePointTrie.
*
* The new trie no longer stores different values for surrogate code *units* vs.
* surrogate code *points*.
* Lead surrogates still have values for optimized UTF-16 string processing.
* When looking up code point properties, the code now checks for lead surrogates and
* treats them as inert.
*
* gennorm2 now has to reject mappings for surrogate code points.
* UTS #46 maps unpaired surrogates to U+FFFD in code rather than via its
* custom normalization data file.
*/
#endif /* !UCONFIG_NO_NORMALIZATION */

View file

@ -41,6 +41,7 @@
#include "propsvec.h"
#include "uassert.h"
#include "ucmndata.h"
#include "udataswp.h"
#include "uenumimp.h"
#include "cmemory.h"
#include "cstring.h"

View file

@ -28,81 +28,6 @@
/* swapping ----------------------------------------------------------------- */
/*
* This performs data swapping for a folded trie (see utrie.c for details).
*/
U_CAPI int32_t U_EXPORT2
utrie_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
const UTrieHeader *inTrie;
UTrieHeader trie;
int32_t size;
UBool dataIs32;
if(pErrorCode==NULL || U_FAILURE(*pErrorCode)) {
return 0;
}
if(ds==NULL || inData==NULL || (length>=0 && outData==NULL)) {
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
return 0;
}
/* setup and swapping */
if(length>=0 && (uint32_t)length<sizeof(UTrieHeader)) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
inTrie=(const UTrieHeader *)inData;
trie.signature=ds->readUInt32(inTrie->signature);
trie.options=ds->readUInt32(inTrie->options);
trie.indexLength=udata_readInt32(ds, inTrie->indexLength);
trie.dataLength=udata_readInt32(ds, inTrie->dataLength);
if( trie.signature!=0x54726965 ||
(trie.options&UTRIE_OPTIONS_SHIFT_MASK)!=UTRIE_SHIFT ||
((trie.options>>UTRIE_OPTIONS_INDEX_SHIFT)&UTRIE_OPTIONS_SHIFT_MASK)!=UTRIE_INDEX_SHIFT ||
trie.indexLength<UTRIE_BMP_INDEX_LENGTH ||
(trie.indexLength&(UTRIE_SURROGATE_BLOCK_COUNT-1))!=0 ||
trie.dataLength<UTRIE_DATA_BLOCK_LENGTH ||
(trie.dataLength&(UTRIE_DATA_GRANULARITY-1))!=0 ||
((trie.options&UTRIE_OPTIONS_LATIN1_IS_LINEAR)!=0 && trie.dataLength<(UTRIE_DATA_BLOCK_LENGTH+0x100))
) {
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UTrie */
return 0;
}
dataIs32=(UBool)((trie.options&UTRIE_OPTIONS_DATA_IS_32_BIT)!=0);
size=sizeof(UTrieHeader)+trie.indexLength*2+trie.dataLength*(dataIs32?4:2);
if(length>=0) {
UTrieHeader *outTrie;
if(length<size) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
outTrie=(UTrieHeader *)outData;
/* swap the header */
ds->swapArray32(ds, inTrie, sizeof(UTrieHeader), outTrie, pErrorCode);
/* swap the index and the data */
if(dataIs32) {
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, trie.dataLength*4,
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
} else {
ds->swapArray16(ds, inTrie+1, (trie.indexLength+trie.dataLength)*2, outTrie+1, pErrorCode);
}
}
return size;
}
#if !UCONFIG_NO_COLLATION
U_CAPI UBool U_EXPORT2

View file

@ -0,0 +1,573 @@
// © 2017 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
// ucptrie.cpp (modified from utrie2.cpp)
// created: 2017dec29 Markus W. Scherer
// #define UCPTRIE_DEBUG
#ifdef UCPTRIE_DEBUG
# include <stdio.h>
#endif
#include "unicode/utypes.h"
#include "unicode/ucptrie.h"
#include "unicode/utf.h"
#include "unicode/utf8.h"
#include "unicode/utf16.h"
#include "cmemory.h"
#include "uassert.h"
#include "ucptrie_impl.h"
U_CAPI UCPTrie * U_EXPORT2
ucptrie_openFromBinary(UCPTrieType type, UCPTrieValueWidth valueWidth,
const void *data, int32_t length, int32_t *pActualLength,
UErrorCode *pErrorCode) {
if (U_FAILURE(*pErrorCode)) {
return nullptr;
}
if (length <= 0 || (U_POINTER_MASK_LSB(data, 3) != 0) ||
type < UCPTRIE_TYPE_ANY || UCPTRIE_TYPE_SMALL < type ||
valueWidth < UCPTRIE_VALUE_BITS_ANY || UCPTRIE_VALUE_BITS_8 < valueWidth) {
*pErrorCode = U_ILLEGAL_ARGUMENT_ERROR;
return nullptr;
}
// Enough data for a trie header?
if (length < (int32_t)sizeof(UCPTrieHeader)) {
*pErrorCode = U_INVALID_FORMAT_ERROR;
return nullptr;
}
// Check the signature.
const UCPTrieHeader *header = (const UCPTrieHeader *)data;
if (header->signature != UCPTRIE_SIG) {
*pErrorCode = U_INVALID_FORMAT_ERROR;
return nullptr;
}
int32_t options = header->options;
int32_t typeInt = (options >> 6) & 3;
int32_t valueWidthInt = options & UCPTRIE_OPTIONS_VALUE_BITS_MASK;
if (typeInt > UCPTRIE_TYPE_SMALL || valueWidthInt > UCPTRIE_VALUE_BITS_8 ||
(options & UCPTRIE_OPTIONS_RESERVED_MASK) != 0) {
*pErrorCode = U_INVALID_FORMAT_ERROR;
return nullptr;
}
UCPTrieType actualType = (UCPTrieType)typeInt;
UCPTrieValueWidth actualValueWidth = (UCPTrieValueWidth)valueWidthInt;
if (type < 0) {
type = actualType;
}
if (valueWidth < 0) {
valueWidth = actualValueWidth;
}
if (type != actualType || valueWidth != actualValueWidth) {
*pErrorCode = U_INVALID_FORMAT_ERROR;
return nullptr;
}
// Get the length values and offsets.
UCPTrie tempTrie;
uprv_memset(&tempTrie, 0, sizeof(tempTrie));
tempTrie.indexLength = header->indexLength;
tempTrie.dataLength =
((options & UCPTRIE_OPTIONS_DATA_LENGTH_MASK) << 4) | header->dataLength;
tempTrie.index3NullOffset = header->index3NullOffset;
tempTrie.dataNullOffset =
((options & UCPTRIE_OPTIONS_DATA_NULL_OFFSET_MASK) << 8) | header->dataNullOffset;
tempTrie.highStart = header->shiftedHighStart << UCPTRIE_SHIFT_2;
tempTrie.shifted12HighStart = (tempTrie.highStart + 0xfff) >> 12;
tempTrie.type = type;
tempTrie.valueWidth = valueWidth;
// Calculate the actual length.
int32_t actualLength = (int32_t)sizeof(UCPTrieHeader) + tempTrie.indexLength * 2;
if (valueWidth == UCPTRIE_VALUE_BITS_16) {
actualLength += tempTrie.dataLength * 2;
} else if (valueWidth == UCPTRIE_VALUE_BITS_32) {
actualLength += tempTrie.dataLength * 4;
} else {
actualLength += tempTrie.dataLength;
}
if (length < actualLength) {
*pErrorCode = U_INVALID_FORMAT_ERROR; // Not enough bytes.
return nullptr;
}
// Allocate the trie.
UCPTrie *trie = (UCPTrie *)uprv_malloc(sizeof(UCPTrie));
if (trie == nullptr) {
*pErrorCode = U_MEMORY_ALLOCATION_ERROR;
return nullptr;
}
uprv_memcpy(trie, &tempTrie, sizeof(tempTrie));
#ifdef UCPTRIE_DEBUG
trie->name = "fromSerialized";
#endif
// Set the pointers to its index and data arrays.
const uint16_t *p16 = (const uint16_t *)(header + 1);
trie->index = p16;
p16 += trie->indexLength;
// Get the data.
int32_t nullValueOffset = trie->dataNullOffset;
if (nullValueOffset >= trie->dataLength) {
nullValueOffset = trie->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET;
}
switch (valueWidth) {
case UCPTRIE_VALUE_BITS_16:
trie->data.ptr16 = p16;
trie->nullValue = trie->data.ptr16[nullValueOffset];
break;
case UCPTRIE_VALUE_BITS_32:
trie->data.ptr32 = (const uint32_t *)p16;
trie->nullValue = trie->data.ptr32[nullValueOffset];
break;
case UCPTRIE_VALUE_BITS_8:
trie->data.ptr8 = (const uint8_t *)p16;
trie->nullValue = trie->data.ptr8[nullValueOffset];
break;
default:
// Unreachable because valueWidth was checked above.
*pErrorCode = U_INVALID_FORMAT_ERROR;
return nullptr;
}
if (pActualLength != nullptr) {
*pActualLength = actualLength;
}
return trie;
}
U_CAPI void U_EXPORT2
ucptrie_close(UCPTrie *trie) {
uprv_free(trie);
}
U_CAPI UCPTrieType U_EXPORT2
ucptrie_getType(const UCPTrie *trie) {
return (UCPTrieType)trie->type;
}
U_CAPI UCPTrieValueWidth U_EXPORT2
ucptrie_getValueWidth(const UCPTrie *trie) {
return (UCPTrieValueWidth)trie->valueWidth;
}
U_CAPI int32_t U_EXPORT2
ucptrie_internalSmallIndex(const UCPTrie *trie, UChar32 c) {
int32_t i1 = c >> UCPTRIE_SHIFT_1;
if (trie->type == UCPTRIE_TYPE_FAST) {
U_ASSERT(0xffff < c && c < trie->highStart);
i1 += UCPTRIE_BMP_INDEX_LENGTH - UCPTRIE_OMITTED_BMP_INDEX_1_LENGTH;
} else {
U_ASSERT((uint32_t)c < (uint32_t)trie->highStart && trie->highStart > UCPTRIE_SMALL_LIMIT);
i1 += UCPTRIE_SMALL_INDEX_LENGTH;
}
int32_t i3Block = trie->index[
(int32_t)trie->index[i1] + ((c >> UCPTRIE_SHIFT_2) & UCPTRIE_INDEX_2_MASK)];
int32_t i3 = (c >> UCPTRIE_SHIFT_3) & UCPTRIE_INDEX_3_MASK;
int32_t dataBlock;
if ((i3Block & 0x8000) == 0) {
// 16-bit indexes
dataBlock = trie->index[i3Block + i3];
} else {
// 18-bit indexes stored in groups of 9 entries per 8 indexes.
i3Block = (i3Block & 0x7fff) + (i3 & ~7) + (i3 >> 3);
i3 &= 7;
dataBlock = ((int32_t)trie->index[i3Block++] << (2 + (2 * i3))) & 0x30000;
dataBlock |= trie->index[i3Block + i3];
}
return dataBlock + (c & UCPTRIE_SMALL_DATA_MASK);
}
U_CAPI int32_t U_EXPORT2
ucptrie_internalSmallU8Index(const UCPTrie *trie, int32_t lt1, uint8_t t2, uint8_t t3) {
UChar32 c = (lt1 << 12) | (t2 << 6) | t3;
if (c >= trie->highStart) {
// Possible because the UTF-8 macro compares with shifted12HighStart which may be higher.
return trie->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET;
}
return ucptrie_internalSmallIndex(trie, c);
}
U_CAPI int32_t U_EXPORT2
ucptrie_internalU8PrevIndex(const UCPTrie *trie, UChar32 c,
const uint8_t *start, const uint8_t *src) {
int32_t i, length;
// Support 64-bit pointers by avoiding cast of arbitrary difference.
if ((src - start) <= 7) {
i = length = (int32_t)(src - start);
} else {
i = length = 7;
start = src - 7;
}
c = utf8_prevCharSafeBody(start, 0, &i, c, -1);
i = length - i; // Number of bytes read backward from src.
int32_t idx = _UCPTRIE_CP_INDEX(trie, 0xffff, c);
return (idx << 3) | i;
}
namespace {
inline uint32_t getValue(UCPTrieData data, UCPTrieValueWidth valueWidth, int32_t dataIndex) {
switch (valueWidth) {
case UCPTRIE_VALUE_BITS_16:
return data.ptr16[dataIndex];
case UCPTRIE_VALUE_BITS_32:
return data.ptr32[dataIndex];
case UCPTRIE_VALUE_BITS_8:
return data.ptr8[dataIndex];
default:
// Unreachable if the trie is properly initialized.
return 0xffffffff;
}
}
} // namespace
U_CAPI uint32_t U_EXPORT2
ucptrie_get(const UCPTrie *trie, UChar32 c) {
int32_t dataIndex;
if ((uint32_t)c <= 0x7f) {
// linear ASCII
dataIndex = c;
} else {
UChar32 fastMax = trie->type == UCPTRIE_TYPE_FAST ? 0xffff : UCPTRIE_SMALL_MAX;
dataIndex = _UCPTRIE_CP_INDEX(trie, fastMax, c);
}
return getValue(trie->data, (UCPTrieValueWidth)trie->valueWidth, dataIndex);
}
namespace {
constexpr int32_t MAX_UNICODE = 0x10ffff;
inline uint32_t maybeFilterValue(uint32_t value, uint32_t trieNullValue, uint32_t nullValue,
UCPTrieValueFilter *filter, const void *context) {
if (value == trieNullValue) {
value = nullValue;
} else if (filter != nullptr) {
value = filter(context, value);
}
return value;
}
UChar32 getRange(const void *t, UChar32 start,
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue) {
if ((uint32_t)start > MAX_UNICODE) {
return U_SENTINEL;
}
const UCPTrie *trie = reinterpret_cast<const UCPTrie *>(t);
UCPTrieValueWidth valueWidth = (UCPTrieValueWidth)trie->valueWidth;
if (start >= trie->highStart) {
if (pValue != nullptr) {
int32_t di = trie->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET;
uint32_t value = getValue(trie->data, valueWidth, di);
if (filter != nullptr) { value = filter(context, value); }
*pValue = value;
}
return MAX_UNICODE;
}
uint32_t nullValue = trie->nullValue;
if (filter != nullptr) { nullValue = filter(context, nullValue); }
const uint16_t *index = trie->index;
int32_t prevI3Block = -1;
int32_t prevBlock = -1;
UChar32 c = start;
uint32_t value;
bool haveValue = false;
do {
int32_t i3Block;
int32_t i3;
int32_t i3BlockLength;
int32_t dataBlockLength;
if (c <= 0xffff && (trie->type == UCPTRIE_TYPE_FAST || c <= UCPTRIE_SMALL_MAX)) {
i3Block = 0;
i3 = c >> UCPTRIE_FAST_SHIFT;
i3BlockLength = trie->type == UCPTRIE_TYPE_FAST ?
UCPTRIE_BMP_INDEX_LENGTH : UCPTRIE_SMALL_INDEX_LENGTH;
dataBlockLength = UCPTRIE_FAST_DATA_BLOCK_LENGTH;
} else {
// Use the multi-stage index.
int32_t i1 = c >> UCPTRIE_SHIFT_1;
if (trie->type == UCPTRIE_TYPE_FAST) {
U_ASSERT(0xffff < c && c < trie->highStart);
i1 += UCPTRIE_BMP_INDEX_LENGTH - UCPTRIE_OMITTED_BMP_INDEX_1_LENGTH;
} else {
U_ASSERT(c < trie->highStart && trie->highStart > UCPTRIE_SMALL_LIMIT);
i1 += UCPTRIE_SMALL_INDEX_LENGTH;
}
i3Block = trie->index[
(int32_t)trie->index[i1] + ((c >> UCPTRIE_SHIFT_2) & UCPTRIE_INDEX_2_MASK)];
if (i3Block == prevI3Block && (c - start) >= UCPTRIE_CP_PER_INDEX_2_ENTRY) {
// The index-3 block is the same as the previous one, and filled with value.
U_ASSERT((c & (UCPTRIE_CP_PER_INDEX_2_ENTRY - 1)) == 0);
c += UCPTRIE_CP_PER_INDEX_2_ENTRY;
continue;
}
prevI3Block = i3Block;
if (i3Block == trie->index3NullOffset) {
// This is the index-3 null block.
if (haveValue) {
if (nullValue != value) {
return c - 1;
}
} else {
value = nullValue;
if (pValue != nullptr) { *pValue = nullValue; }
haveValue = true;
}
prevBlock = trie->dataNullOffset;
c = (c + UCPTRIE_CP_PER_INDEX_2_ENTRY) & ~(UCPTRIE_CP_PER_INDEX_2_ENTRY - 1);
continue;
}
i3 = (c >> UCPTRIE_SHIFT_3) & UCPTRIE_INDEX_3_MASK;
i3BlockLength = UCPTRIE_INDEX_3_BLOCK_LENGTH;
dataBlockLength = UCPTRIE_SMALL_DATA_BLOCK_LENGTH;
}
// Enumerate data blocks for one index-3 block.
do {
int32_t block;
if ((i3Block & 0x8000) == 0) {
block = index[i3Block + i3];
} else {
// 18-bit indexes stored in groups of 9 entries per 8 indexes.
int32_t group = (i3Block & 0x7fff) + (i3 & ~7) + (i3 >> 3);
int32_t gi = i3 & 7;
block = ((int32_t)index[group++] << (2 + (2 * gi))) & 0x30000;
block |= index[group + gi];
}
if (block == prevBlock && (c - start) >= dataBlockLength) {
// The block is the same as the previous one, and filled with value.
U_ASSERT((c & (dataBlockLength - 1)) == 0);
c += dataBlockLength;
} else {
int32_t dataMask = dataBlockLength - 1;
prevBlock = block;
if (block == trie->dataNullOffset) {
// This is the data null block.
if (haveValue) {
if (nullValue != value) {
return c - 1;
}
} else {
value = nullValue;
if (pValue != nullptr) { *pValue = nullValue; }
haveValue = true;
}
c = (c + dataBlockLength) & ~dataMask;
} else {
int32_t di = block + (c & dataMask);
uint32_t value2 = getValue(trie->data, valueWidth, di);
value2 = maybeFilterValue(value2, trie->nullValue, nullValue,
filter, context);
if (haveValue) {
if (value2 != value) {
return c - 1;
}
} else {
value = value2;
if (pValue != nullptr) { *pValue = value; }
haveValue = true;
}
while ((++c & dataMask) != 0) {
if (maybeFilterValue(getValue(trie->data, valueWidth, ++di),
trie->nullValue, nullValue,
filter, context) != value) {
return c - 1;
}
}
}
}
} while (++i3 < i3BlockLength);
} while (c < trie->highStart);
U_ASSERT(haveValue);
int32_t di = trie->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET;
uint32_t highValue = getValue(trie->data, valueWidth, di);
if (maybeFilterValue(highValue, trie->nullValue, nullValue,
filter, context) != value) {
return c - 1;
} else {
return MAX_UNICODE;
}
}
} // namespace
U_CFUNC UChar32
ucptrie_internalGetRange(UCPTrieGetRange *getRange,
const void *trie, UChar32 start,
UCPTrieRangeOption option, uint32_t surrogateValue,
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue) {
if (option == UCPTRIE_RANGE_NORMAL) {
return getRange(trie, start, filter, context, pValue);
}
uint32_t value;
if (pValue == nullptr) {
// We need to examine the range value even if the caller does not want it.
pValue = &value;
}
UChar32 surrEnd = option == UCPTRIE_RANGE_FIXED_ALL_SURROGATES ? 0xdfff : 0xdbff;
UChar32 end = getRange(trie, start, filter, context, pValue);
if (end < 0xd7ff || start > surrEnd) {
return end;
}
// The range overlaps with surrogates, or ends just before the first one.
if (*pValue == surrogateValue) {
if (end >= surrEnd) {
// Surrogates followed by a non-surrogateValue range,
// or surrogates are part of a larger surrogateValue range.
return end;
}
} else {
if (start <= 0xd7ff) {
return 0xd7ff; // Non-surrogateValue range ends before surrogateValue surrogates.
}
// Start is a surrogate with a non-surrogateValue code *unit* value.
// Return a surrogateValue code *point* range.
*pValue = surrogateValue;
if (end > surrEnd) {
return surrEnd; // Surrogate range ends before non-surrogateValue rest of range.
}
}
// See if the surrogateValue surrogate range can be merged with
// an immediately following range.
uint32_t value2;
UChar32 end2 = getRange(trie, surrEnd + 1, filter, context, &value2);
if (value2 == surrogateValue) {
return end2;
}
return surrEnd;
}
U_CAPI UChar32 U_EXPORT2
ucptrie_getRange(const UCPTrie *trie, UChar32 start,
UCPTrieRangeOption option, uint32_t surrogateValue,
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue) {
return ucptrie_internalGetRange(getRange, trie, start,
option, surrogateValue,
filter, context, pValue);
}
U_CAPI int32_t U_EXPORT2
ucptrie_toBinary(const UCPTrie *trie,
void *data, int32_t capacity,
UErrorCode *pErrorCode) {
if (U_FAILURE(*pErrorCode)) {
return 0;
}
UCPTrieType type = (UCPTrieType)trie->type;
UCPTrieValueWidth valueWidth = (UCPTrieValueWidth)trie->valueWidth;
if (type < UCPTRIE_TYPE_FAST || UCPTRIE_TYPE_SMALL < type ||
valueWidth < UCPTRIE_VALUE_BITS_16 || UCPTRIE_VALUE_BITS_8 < valueWidth ||
capacity < 0 ||
(capacity > 0 && (data == nullptr || (U_POINTER_MASK_LSB(data, 3) != 0)))) {
*pErrorCode = U_ILLEGAL_ARGUMENT_ERROR;
return 0;
}
int32_t length = (int32_t)sizeof(UCPTrieHeader) + trie->indexLength * 2;
switch (valueWidth) {
case UCPTRIE_VALUE_BITS_16:
length += trie->dataLength * 2;
break;
case UCPTRIE_VALUE_BITS_32:
length += trie->dataLength * 4;
break;
case UCPTRIE_VALUE_BITS_8:
length += trie->dataLength;
break;
default:
// unreachable
break;
}
if (capacity < length) {
*pErrorCode = U_BUFFER_OVERFLOW_ERROR;
return length;
}
char *bytes = (char *)data;
UCPTrieHeader *header = (UCPTrieHeader *)bytes;
header->signature = UCPTRIE_SIG; // "Tri3"
header->options = (uint16_t)(
((trie->dataLength & 0xf0000) >> 4) |
((trie->dataNullOffset & 0xf0000) >> 8) |
(trie->type << 6) |
valueWidth);
header->indexLength = (uint16_t)trie->indexLength;
header->dataLength = (uint16_t)trie->dataLength;
header->index3NullOffset = trie->index3NullOffset;
header->dataNullOffset = (uint16_t)trie->dataNullOffset;
header->shiftedHighStart = trie->highStart >> UCPTRIE_SHIFT_2;
bytes += sizeof(UCPTrieHeader);
uprv_memcpy(bytes, trie->index, trie->indexLength * 2);
bytes += trie->indexLength * 2;
switch (valueWidth) {
case UCPTRIE_VALUE_BITS_16:
uprv_memcpy(bytes, trie->data.ptr16, trie->dataLength * 2);
break;
case UCPTRIE_VALUE_BITS_32:
uprv_memcpy(bytes, trie->data.ptr32, trie->dataLength * 4);
break;
case UCPTRIE_VALUE_BITS_8:
uprv_memcpy(bytes, trie->data.ptr8, trie->dataLength);
break;
default:
// unreachable
break;
}
return length;
}
namespace {
#ifdef UCPTRIE_DEBUG
long countNull(const UCPTrie *trie) {
uint32_t nullValue=trie->nullValue;
int32_t length=trie->dataLength;
long count=0;
switch (trie->valueWidth) {
case UCPTRIE_VALUE_BITS_16:
for(int32_t i=0; i<length; ++i) {
if(trie->data.ptr16[i]==nullValue) { ++count; }
}
break;
case UCPTRIE_VALUE_BITS_32:
for(int32_t i=0; i<length; ++i) {
if(trie->data.ptr32[i]==nullValue) { ++count; }
}
break;
case UCPTRIE_VALUE_BITS_8:
for(int32_t i=0; i<length; ++i) {
if(trie->data.ptr8[i]==nullValue) { ++count; }
}
break;
default:
// unreachable
break;
}
return count;
}
U_CFUNC void
ucptrie_printLengths(const UCPTrie *trie, const char *which) {
long indexLength=trie->indexLength;
long dataLength=(long)trie->dataLength;
long totalLength=(long)sizeof(UCPTrieHeader)+indexLength*2+
dataLength*(trie->valueWidth==UCPTRIE_VALUE_BITS_16 ? 2 :
trie->valueWidth==UCPTRIE_VALUE_BITS_32 ? 4 : 1);
printf("**UCPTrieLengths(%s %s)** index:%6ld data:%6ld countNull:%6ld serialized:%6ld\n",
which, trie->name, indexLength, dataLength, countNull(trie), totalLength);
}
#endif
} // namespace

View file

@ -0,0 +1,284 @@
// © 2017 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
// ucptrie_impl.h (modified from utrie2_impl.h)
// created: 2017dec29 Markus W. Scherer
#ifndef __UCPTRIE_IMPL_H__
#define __UCPTRIE_IMPL_H__
#include "unicode/ucptrie.h"
#ifdef UCPTRIE_DEBUG
#include "unicode/umutablecptrie.h"
#endif
// UCPTrie signature values, in platform endianness and opposite endianness.
// The UCPTrie signature ASCII byte values spell "Tri3".
#define UCPTRIE_SIG 0x54726933
#define UCPTRIE_OE_SIG 0x33697254
/**
* Header data for the binary, memory-mappable representation of a UCPTrie/CodePointTrie.
* @internal
*/
struct UCPTrieHeader {
/** "Tri3" in big-endian US-ASCII (0x54726933) */
uint32_t signature;
/**
* Options bit field:
* Bits 15..12: Data length bits 19..16.
* Bits 11..8: Data null block offset bits 19..16.
* Bits 7..6: UCPTrieType
* Bits 5..3: Reserved (0).
* Bits 2..0: UCPTrieValueWidth
*/
uint16_t options;
/** Total length of the index tables. */
uint16_t indexLength;
/** Data length bits 15..0. */
uint16_t dataLength;
/** Index-3 null block offset, 0x7fff or 0xffff if none. */
uint16_t index3NullOffset;
/** Data null block offset bits 15..0, 0xfffff if none. */
uint16_t dataNullOffset;
/**
* First code point of the single-value range ending with U+10ffff,
* rounded up and then shifted right by UCPTRIE_SHIFT_2.
*/
uint16_t shiftedHighStart;
};
/**
* Constants for use with UCPTrieHeader.options.
* @internal
*/
enum {
UCPTRIE_OPTIONS_DATA_LENGTH_MASK = 0xf000,
UCPTRIE_OPTIONS_DATA_NULL_OFFSET_MASK = 0xf00,
UCPTRIE_OPTIONS_RESERVED_MASK = 0x38,
UCPTRIE_OPTIONS_VALUE_BITS_MASK = 7,
/**
* Value for index3NullOffset which indicates that there is no index-3 null block.
* Bit 15 is unused for this value because this bit is used if the index-3 contains
* 18-bit indexes.
*/
UCPTRIE_NO_INDEX3_NULL_OFFSET = 0x7fff,
UCPTRIE_NO_DATA_NULL_OFFSET = 0xfffff
};
// Internal constants.
enum {
/** The length of the BMP index table. 1024=0x400 */
UCPTRIE_BMP_INDEX_LENGTH = 0x10000 >> UCPTRIE_FAST_SHIFT,
UCPTRIE_SMALL_LIMIT = 0x1000,
UCPTRIE_SMALL_INDEX_LENGTH = UCPTRIE_SMALL_LIMIT >> UCPTRIE_FAST_SHIFT,
/** Shift size for getting the index-3 table offset. */
UCPTRIE_SHIFT_3 = 4,
/** Shift size for getting the index-2 table offset. */
UCPTRIE_SHIFT_2 = 5 + UCPTRIE_SHIFT_3,
/** Shift size for getting the index-1 table offset. */
UCPTRIE_SHIFT_1 = 5 + UCPTRIE_SHIFT_2,
/**
* Difference between two shift sizes,
* for getting an index-2 offset from an index-3 offset. 5=9-4
*/
UCPTRIE_SHIFT_2_3 = UCPTRIE_SHIFT_2 - UCPTRIE_SHIFT_3,
/**
* Difference between two shift sizes,
* for getting an index-1 offset from an index-2 offset. 5=14-9
*/
UCPTRIE_SHIFT_1_2 = UCPTRIE_SHIFT_1 - UCPTRIE_SHIFT_2,
/**
* Number of index-1 entries for the BMP. (4)
* This part of the index-1 table is omitted from the serialized form.
*/
UCPTRIE_OMITTED_BMP_INDEX_1_LENGTH = 0x10000 >> UCPTRIE_SHIFT_1,
/** Number of entries in an index-2 block. 32=0x20 */
UCPTRIE_INDEX_2_BLOCK_LENGTH = 1 << UCPTRIE_SHIFT_1_2,
/** Mask for getting the lower bits for the in-index-2-block offset. */
UCPTRIE_INDEX_2_MASK = UCPTRIE_INDEX_2_BLOCK_LENGTH - 1,
/** Number of code points per index-2 table entry. 512=0x200 */
UCPTRIE_CP_PER_INDEX_2_ENTRY = 1 << UCPTRIE_SHIFT_2,
/** Number of entries in an index-3 block. 32=0x20 */
UCPTRIE_INDEX_3_BLOCK_LENGTH = 1 << UCPTRIE_SHIFT_2_3,
/** Mask for getting the lower bits for the in-index-3-block offset. */
UCPTRIE_INDEX_3_MASK = UCPTRIE_INDEX_3_BLOCK_LENGTH - 1,
/** Number of entries in a small data block. 16=0x10 */
UCPTRIE_SMALL_DATA_BLOCK_LENGTH = 1 << UCPTRIE_SHIFT_3,
/** Mask for getting the lower bits for the in-small-data-block offset. */
UCPTRIE_SMALL_DATA_MASK = UCPTRIE_SMALL_DATA_BLOCK_LENGTH - 1
};
typedef UChar32
UCPTrieGetRange(const void *trie, UChar32 start,
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue);
U_CFUNC UChar32
ucptrie_internalGetRange(UCPTrieGetRange *getRange,
const void *trie, UChar32 start,
UCPTrieRangeOption option, uint32_t surrogateValue,
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue);
#ifdef UCPTRIE_DEBUG
U_CFUNC void
ucptrie_printLengths(const UCPTrie *trie, const char *which);
U_CFUNC void umutablecptrie_setName(UMutableCPTrie *builder, const char *name);
#endif
/*
* Format of the binary, memory-mappable representation of a UCPTrie/CodePointTrie.
* For overview information see http://site.icu-project.org/design/struct/utrie
*
* The binary trie data should be 32-bit-aligned.
* The overall layout is:
*
* UCPTrieHeader header; -- 16 bytes, see struct definition above
* uint16_t index[header.indexLength];
* uintXY_t data[header.dataLength];
*
* The trie data array is an array of uint16_t, uint32_t, or uint8_t,
* specified via the UCPTrieValueWidth when building the trie.
* The data array is 32-bit-aligned for uint32_t, otherwise 16-bit-aligned.
* The overall length of the trie data is a multiple of 4 bytes.
* (Padding is added at the end of the index array and/or near the end of the data array as needed.)
*
* The length of the data array (dataLength) is stored as an integer split across two fields
* of the header struct (high bits in header.options).
*
* The trie type can be "fast" or "small" which determines the index structure,
* specified via the UCPTrieType when building the trie.
*
* The type and valueWidth are stored in the header.options.
* There are reserved type and valueWidth values, and reserved header.options bits.
* They could be used in future format extensions.
* Code reading the trie structure must fail with an error when unknown values or options are set.
*
* Values for ASCII character (U+0000..U+007F) can always be found at the start of the data array.
*
* Values for code points below a type-specific fast-indexing limit are found via two-stage lookup.
* For a "fast" trie, the limit is the BMP/supplementary boundary at U+10000.
* For a "small" trie, the limit is UCPTRIE_SMALL_MAX+1=U+1000.
*
* All code points in the range highStart..U+10FFFF map to a single highValue
* which is stored at the second-to-last position of the data array.
* (See UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET.)
* The highStart value is header.shiftedHighStart<<UCPTRIE_SHIFT_2.
* (UCPTRIE_SHIFT_2=9)
*
* Values for code points fast_limit..highStart-1 are found via four-stage lookup.
* The data block size is smaller for this range than for the fast range.
* This together with more index stages with small blocks makes this range
* more easily compactable.
*
* There is also a trie error value stored at the last position of the data array.
* (See UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET.)
* It is intended to be returned for inputs that are not Unicode code points
* (outside U+0000..U+10FFFF), or in string processing for ill-formed input
* (unpaired surrogate in UTF-16, ill-formed UTF-8 subsequence).
*
* For a "fast" trie:
*
* The index array starts with the BMP index table for BMP code point lookup.
* Its length is 1024=0x400.
*
* The supplementary index-1 table follows the BMP index table.
* Variable length, for code points up to highStart-1.
* Maximum length 64=0x40=0x100000>>UCPTRIE_SHIFT_1.
* (For 0x100000 supplementary code points U+10000..U+10ffff.)
*
* After this index-1 table follow the variable-length index-3 and index-2 tables.
*
* The supplementary index tables are omitted completely
* if there is only BMP data (highStart<=U+10000).
*
* For a "small" trie:
*
* The index array starts with a fast-index table for lookup of code points U+0000..U+0FFF.
*
* The "supplementary" index tables are always stored.
* The index-1 table starts from U+0000, its maximum length is 68=0x44=0x110000>>UCPTRIE_SHIFT_1.
*
* For both trie types:
*
* The last index-2 block may be a partial block, storing indexes only for code points
* below highStart.
*
* Lookup for ASCII code point c:
*
* Linear access from the start of the data array.
*
* value = data[c];
*
* Lookup for fast-range code point c:
*
* Shift the code point right by UCPTRIE_FAST_SHIFT=6 bits,
* fetch the index array value at that offset,
* add the lower code point bits, index into the data array.
*
* value = data[index[c>>6] + (c&0x3f)];
*
* (This works for ASCII as well.)
*
* Lookup for small-range code point c below highStart:
*
* Split the code point into four bit fields using several sets of shifts & masks
* to read consecutive values from the index-1, index-2, index-3 and data tables.
*
* If all of the data block offsets in an index-3 block fit within 16 bits (up to 0xffff),
* then the data block offsets are stored directly as uint16_t.
*
* Otherwise (this is very unusual but possible), the index-2 entry for the index-3 block
* has bit 15 (0x8000) set, and each set of 8 index-3 entries is preceded by
* an additional uint16_t word. Data block offsets are 18 bits wide, with the top 2 bits stored
* in the additional word.
*
* See ucptrie_internalSmallIndex() for details.
*
* (In a "small" trie, this works for ASCII and below-fast_limit code points as well.)
*
* Compaction:
*
* Multiple code point ranges ("blocks") that are aligned on certain boundaries
* (determined by the shifting/bit fields of code points) and
* map to the same data values normally share a single subsequence of the data array.
* Data blocks can also overlap partially.
* (Depending on the builder code finding duplicate and overlapping blocks.)
*
* Iteration over same-value ranges:
*
* Range iteration (ucptrie_getRange()) walks the structure from a start code point
* until some code point is found that maps to a different value;
* the end of the returned range is just before that.
*
* The header.dataNullOffset (split across two header fields, high bits in header.options)
* is the offset of a widely shared data block filled with one single value.
* It helps quickly skip over large ranges of data with that value.
* Similarly, the header.index3NullOffset is the index-array offset of an index-3 block
* where all index entries point to the dataNullOffset.
* If there is no such data or index-3 block, then these offsets are set to
* values that cannot be reached (data offset out of range/reserved index offset),
* normally UCPTRIE_NO_DATA_NULL_OFFSET or UCPTRIE_NO_INDEX3_NULL_OFFSET respectively.
*/
#endif

View file

@ -333,6 +333,43 @@ uprv_compareInvEbcdic(const UDataSwapper *ds,
# error Unknown charset family!
#endif
// utrie_swap.cpp -----------------------------------------------------------***
/**
* Swaps a serialized UTrie.
* @internal
*/
U_CAPI int32_t U_EXPORT2
utrie_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode);
/**
* Swaps a serialized UTrie2.
* @internal
*/
U_CAPI int32_t U_EXPORT2
utrie2_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode);
/**
* Swaps a serialized UCPTrie.
* @internal
*/
U_CAPI int32_t U_EXPORT2
ucptrie_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode);
/**
* Swaps a serialized UTrie, UTrie2, or UCPTrie.
* @internal
*/
U_CAPI int32_t U_EXPORT2
utrie_swapAnyVersion(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode);
/* material... -------------------------------------------------------------- */

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,695 @@
// © 2017 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
// ucptrie.h (modified from utrie2.h)
// created: 2017dec29 Markus W. Scherer
#ifndef __UCPTRIE_H__
#define __UCPTRIE_H__
#include "unicode/utypes.h"
#include "unicode/localpointer.h"
#include "unicode/utf8.h"
#include "putilimp.h"
#include "udataswp.h"
U_CDECL_BEGIN
/**
* \file
*
* This file defines an immutable Unicode code point trie.
*
* @see UCPTrie
* @see UMutableCPTrie
*/
/**
* Immutable Unicode code point trie structure.
* Fast, reasonably compact, map from Unicode code points (U+0000..U+10FFFF) to integer values.
* For details see http://site.icu-project.org/design/struct/utrie
*
* Do not access UCPTrie fields directly; use public functions and macros.
* Functions are easy to use: They support all trie types and value widths.
*
* When performance is really important, macros provide faster access.
* Most macros are specific to either "fast" or "small" tries, see UCPTrieType.
* There are "fast" macros for special optimized use cases.
*
* The macros will return bogus values, or may crash, if used on the wrong type or value width.
*
* @see UMutableCPTrie
* @draft ICU 63
*/
struct UCPTrie;
typedef struct UCPTrie UCPTrie;
/**
* Selectors for the type of a UCPTrie.
* Different trade-offs for size vs. speed.
*
* @see umutablecptrie_buildImmutable
* @see ucptrie_openFromBinary
* @see ucptrie_getType
* @draft ICU 63
*/
enum UCPTrieType {
/**
* For ucptrie_openFromBinary() to accept any type.
* ucptrie_getType() will return the actual type.
* @draft ICU 63
*/
UCPTRIE_TYPE_ANY = -1,
/**
* Fast/simple/larger BMP data structure. Use functions and "fast" macros.
* @draft ICU 63
*/
UCPTRIE_TYPE_FAST,
/**
* Small/slower BMP data structure. Use functions and "small" macros.
* @draft ICU 63
*/
UCPTRIE_TYPE_SMALL
};
typedef enum UCPTrieType UCPTrieType;
/**
* Selectors for the number of bits in a UCPTrie data value.
*
* @see umutablecptrie_buildImmutable
* @see ucptrie_openFromBinary
* @see ucptrie_getValueWidth
* @draft ICU 63
*/
enum UCPTrieValueWidth {
/**
* For ucptrie_openFromBinary() to accept any data value width.
* ucptrie_getValueWidth() will return the actual data value width.
* @draft ICU 63
*/
UCPTRIE_VALUE_BITS_ANY = -1,
/**
* 16 bits per UCPTrie data value.
* @draft ICU 63
*/
UCPTRIE_VALUE_BITS_16,
/**
* 32 bits per UCPTrie data value.
* @draft ICU 63
*/
UCPTRIE_VALUE_BITS_32,
/**
* 8 bits per UCPTrie data value.
* @draft ICU 63
*/
UCPTRIE_VALUE_BITS_8
};
typedef enum UCPTrieValueWidth UCPTrieValueWidth;
/**
* Selectors for how ucptrie_getRange() should report value ranges overlapping with surrogates.
* Most users should use UCPTRIE_RANGE_NORMAL.
*
* @see ucptrie_getRange
* @draft ICU 63
*/
enum UCPTrieRangeOption {
/**
* ucptrie_getRange() enumerates all same-value ranges as stored in the trie.
* Most users should use this option.
*/
UCPTRIE_RANGE_NORMAL,
/**
* ucptrie_getRange() enumerates all same-value ranges as stored in the trie,
* except that lead surrogates (U+D800..U+DBFF) are treated as having the
* surrogateValue, which is passed to getRange() as a separate parameter.
* The surrogateValue is not transformed via filter().
* See U_IS_LEAD(c).
*
* Most users should use UCPTRIE_RANGE_NORMAL instead.
*
* This option is useful for tries that map surrogate code *units* to
* special values optimized for UTF-16 string processing
* or for special error behavior for unpaired surrogates,
* but those values are not to be associated with the lead surrogate code *points*.
*/
UCPTRIE_RANGE_FIXED_LEAD_SURROGATES,
/**
* ucptrie_getRange() enumerates all same-value ranges as stored in the trie,
* except that all surrogates (U+D800..U+DFFF) are treated as having the
* surrogateValue, which is passed to getRange() as a separate parameter.
* The surrogateValue is not transformed via filter().
* See U_IS_SURROGATE(c).
*
* Most users should use UCPTRIE_RANGE_NORMAL instead.
*
* This option is useful for tries that map surrogate code *units* to
* special values optimized for UTF-16 string processing
* or for special error behavior for unpaired surrogates,
* but those values are not to be associated with the lead surrogate code *points*.
*/
UCPTRIE_RANGE_FIXED_ALL_SURROGATES
};
typedef enum UCPTrieRangeOption UCPTrieRangeOption;
/**
* Opens a trie from its binary form, stored in 32-bit-aligned memory.
* Inverse of ucptrie_toBinary().
*
* The memory must remain valid and unchanged as long as the trie is used.
* You must ucptrie_close() the trie once you are done using it.
*
* @param type selects the trie type; results in an
* U_INVALID_FORMAT_ERROR if it does not match the binary data;
* use UCPTRIE_TYPE_ANY to accept any type
* @param valueWidth selects the number of bits in a data value; results in an
* U_INVALID_FORMAT_ERROR if it does not match the binary data;
* use UCPTRIE_VALUE_BITS_ANY to accept any data value width
* @param data a pointer to 32-bit-aligned memory containing the binary data of a UCPTrie
* @param length the number of bytes available at data;
* can be more than necessary
* @param pActualLength receives the actual number of bytes at data taken up by the trie data;
* can be NULL
* @param pErrorCode an in/out ICU UErrorCode
* @return the trie
*
* @see umutablecptrie_open
* @see umutablecptrie_buildImmutable
* @see ucptrie_toBinary
* @draft ICU 63
*/
U_CAPI UCPTrie * U_EXPORT2
ucptrie_openFromBinary(UCPTrieType type, UCPTrieValueWidth valueWidth,
const void *data, int32_t length, int32_t *pActualLength,
UErrorCode *pErrorCode);
/**
* Closes a trie and releases associated memory.
*
* @param trie the trie
* @draft ICU 63
*/
U_CAPI void U_EXPORT2
ucptrie_close(UCPTrie *trie);
#if U_SHOW_CPLUSPLUS_API
U_NAMESPACE_BEGIN
/**
* \class LocalUCPTriePointer
* "Smart pointer" class, closes a UCPTrie via ucptrie_close().
* For most methods see the LocalPointerBase base class.
*
* @see LocalPointerBase
* @see LocalPointer
* @draft ICU 63
*/
U_DEFINE_LOCAL_OPEN_POINTER(LocalUCPTriePointer, UCPTrie, ucptrie_close);
U_NAMESPACE_END
#endif
/**
* Returns the trie type.
*
* @param trie the trie
* @return the trie type
* @see ucptrie_openFromBinary
* @see UCPTRIE_TYPE_ANY
* @draft ICU 63
*/
U_CAPI UCPTrieType U_EXPORT2
ucptrie_getType(const UCPTrie *trie);
/**
* Returns the number of bits in a trie data value.
*
* @param trie the trie
* @return the number of bits in a trie data value
* @see ucptrie_openFromBinary
* @see UCPTRIE_VALUE_BITS_ANY
* @draft ICU 63
*/
U_CAPI UCPTrieValueWidth U_EXPORT2
ucptrie_getValueWidth(const UCPTrie *trie);
/**
* Returns the value for a code point as stored in the trie, with range checking.
* Returns the trie error value if c is not in the range 0..U+10FFFF.
*
* Easier to use than UCPTRIE_FAST_GET() and similar macros but slower.
* Easier to use because, unlike the macros, this function works on all UCPTrie
* objects, for all types and value widths.
*
* @param trie the trie
* @param c the code point
* @return the trie value,
* or the trie error value if the code point is not in the range 0..U+10FFFF
* @draft ICU 63
*/
U_CAPI uint32_t U_EXPORT2
ucptrie_get(const UCPTrie *trie, UChar32 c);
/**
* Callback function type: Modifies a trie value.
* Optionally called by ucptrie_getRange() or umutablecptrie_getRange().
* The modified value will be returned by the getRange function.
*
* Can be used to ignore some of the value bits,
* make a filter for one of several values,
* return a value index computed from the trie value, etc.
*
* @param context an opaque pointer, as passed into the getRange function
* @param value a value from the trie
* @return the modified value
* @draft ICU 63
*/
typedef uint32_t U_CALLCONV
UCPTrieValueFilter(const void *context, uint32_t value);
/**
* Returns the last code point such that all those from start to there have the same value.
* Can be used to efficiently iterate over all same-value ranges in a trie.
*
* If the UCPTrieValueFilter function pointer is not NULL, then
* the value to be delivered is passed through that function, and the return value is the end
* of the range where all values are modified to the same actual value.
* The value is unchanged if that function pointer is NULL.
*
* Example:
* \code
* UChar32 start = 0, end;
* uint32_t value;
* while ((end = ucptrie_getRange(trie, start, UCPTRIE_RANGE_NORMAL, 0,
* NULL, NULL, &value)) >= 0) {
* // Work with the range start..end and its value.
* start = end + 1;
* }
* \endcode
*
* @param trie the trie
* @param start range start
* @param option defines whether surrogates are treated normally,
* or as having the surrogateValue; usually UCPTRIE_RANGE_NORMAL
* @param surrogateValue value for surrogates; ignored if option==UCPTRIE_RANGE_NORMAL
* @param filter a pointer to a function that may modify the trie data value,
* or NULL if the values from the trie are to be used unmodified
* @param context an opaque pointer that is passed on to the filter function
* @param pValue if not NULL, receives the value that every code point start..end has;
* may have been modified by filter(context, trie value)
* if that function pointer is not NULL
* @return the range end code point, or -1 if start is not a valid code point
* @draft ICU 63
*/
U_CAPI UChar32 U_EXPORT2
ucptrie_getRange(const UCPTrie *trie, UChar32 start,
UCPTrieRangeOption option, uint32_t surrogateValue,
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue);
/**
* Writes a memory-mappable form of the trie into 32-bit aligned memory.
* Inverse of ucptrie_openFromBinary().
*
* @param trie the trie
* @param data a pointer to 32-bit-aligned memory to be filled with the trie data;
* can be NULL if capacity==0
* @param capacity the number of bytes available at data, or 0 for pure preflighting
* @param pErrorCode an in/out ICU UErrorCode;
* U_BUFFER_OVERFLOW_ERROR if the capacity is too small
* @return the number of bytes written or (if buffer overflow) needed for the trie
*
* @see ucptrie_openFromBinary()
* @draft ICU 63
*/
U_CAPI int32_t U_EXPORT2
ucptrie_toBinary(const UCPTrie *trie, void *data, int32_t capacity, UErrorCode *pErrorCode);
/**
* Macro parameter value for a trie with 16-bit data values.
* Use the name of this macro as a "dataAccess" parameter in other macros.
* Do not use this macro in any other way.
*
* @see UCPTRIE_VALUE_BITS_16
* @draft ICU 63
*/
#define UCPTRIE_16(trie, i) ((trie)->data.ptr16[i])
/**
* Macro parameter value for a trie with 32-bit data values.
* Use the name of this macro as a "dataAccess" parameter in other macros.
* Do not use this macro in any other way.
*
* @see UCPTRIE_VALUE_BITS_32
* @draft ICU 63
*/
#define UCPTRIE_32(trie, i) ((trie)->data.ptr32[i])
/**
* Macro parameter value for a trie with 8-bit data values.
* Use the name of this macro as a "dataAccess" parameter in other macros.
* Do not use this macro in any other way.
*
* @see UCPTRIE_VALUE_BITS_8
* @draft ICU 63
*/
#define UCPTRIE_8(trie, i) ((trie)->data.ptr8[i])
/**
* Returns a trie value for a code point, with range checking.
* Returns the trie error value if c is not in the range 0..U+10FFFF.
*
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the tries value width
* @param c (UChar32, in) the input code point
* @return The code point's trie value.
* @draft ICU 63
*/
#define UCPTRIE_FAST_GET(trie, dataAccess, c) dataAccess(trie, _UCPTRIE_CP_INDEX(trie, 0xffff, c))
/**
* Returns a 16-bit trie value for a code point, with range checking.
* Returns the trie error value if c is not in the range U+0000..U+10FFFF.
*
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_SMALL
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the tries value width
* @param c (UChar32, in) the input code point
* @return The code point's trie value.
* @draft ICU 63
*/
#define UCPTRIE_SMALL_GET(trie, dataAccess, c) \
dataAccess(trie, _UCPTRIE_CP_INDEX(trie, UCPTRIE_SMALL_MAX, c))
/**
* UTF-16: Reads the next code point (UChar32 c, out), post-increments src,
* and gets a value from the trie.
* Sets the trie error value if c is an unpaired surrogate.
*
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the tries value width
* @param src (const UChar *, in/out) the source text pointer
* @param limit (const UChar *, in) the limit pointer for the text, or NULL if NUL-terminated
* @param c (UChar32, out) variable for the code point
* @param result (out) variable for the trie lookup result
* @draft ICU 63
*/
#define UCPTRIE_FAST_U16_NEXT(trie, dataAccess, src, limit, c, result) { \
(c) = *(src)++; \
int32_t __index; \
if (!U16_IS_SURROGATE(c)) { \
__index = _UCPTRIE_FAST_INDEX(trie, c); \
} else { \
uint16_t __c2; \
if (U16_IS_SURROGATE_LEAD(c) && (src) != (limit) && U16_IS_TRAIL(__c2 = *(src))) { \
++(src); \
(c) = U16_GET_SUPPLEMENTARY((c), __c2); \
__index = _UCPTRIE_SMALL_INDEX(trie, c); \
} else { \
__index = (trie)->dataLength - UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET; \
} \
} \
(result) = dataAccess(trie, __index); \
}
/**
* UTF-16: Reads the previous code point (UChar32 c, out), pre-decrements src,
* and gets a value from the trie.
* Sets the trie error value if c is an unpaired surrogate.
*
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the tries value width
* @param start (const UChar *, in) the start pointer for the text
* @param src (const UChar *, in/out) the source text pointer
* @param c (UChar32, out) variable for the code point
* @param result (out) variable for the trie lookup result
* @draft ICU 63
*/
#define UCPTRIE_FAST_U16_PREV(trie, dataAccess, start, src, c, result) { \
(c) = *--(src); \
int32_t __index; \
if (!U16_IS_SURROGATE(c)) { \
__index = _UCPTRIE_FAST_INDEX(trie, c); \
} else { \
uint16_t __c2; \
if (U16_IS_SURROGATE_TRAIL(c) && (src) != (start) && U16_IS_LEAD(__c2 = *((src) - 1))) { \
--(src); \
(c) = U16_GET_SUPPLEMENTARY(__c2, (c)); \
__index = _UCPTRIE_SMALL_INDEX(trie, c); \
} else { \
__index = (trie)->dataLength - UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET; \
} \
} \
(result) = dataAccess(trie, __index); \
}
/**
* UTF-8: Post-increments src and gets a value from the trie.
* Sets the trie error value for an ill-formed byte sequence.
*
* Unlike UCPTRIE_FAST_U16_NEXT() this UTF-8 macro does not provide the code point
* because it would be more work to do so and is often not needed.
* If the trie value differs from the error value, then the byte sequence is well-formed,
* and the code point can be assembled without revalidation.
*
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the tries value width
* @param src (const char *, in/out) the source text pointer
* @param limit (const char *, in) the limit pointer for the text (must not be NULL)
* @param result (out) variable for the trie lookup result
* @draft ICU 63
*/
#define UCPTRIE_FAST_U8_NEXT(trie, dataAccess, src, limit, result) { \
int32_t __lead = (uint8_t)*(src)++; \
if (!U8_IS_SINGLE(__lead)) { \
uint8_t __t1, __t2, __t3; \
if ((src) != (limit) && \
(__lead >= 0xe0 ? \
__lead < 0xf0 ? /* U+0800..U+FFFF except surrogates */ \
U8_LEAD3_T1_BITS[__lead &= 0xf] & (1 << ((__t1 = *(src)) >> 5)) && \
++(src) != (limit) && (__t2 = *(src) - 0x80) <= 0x3f && \
(__lead = ((int32_t)(trie)->index[(__lead << 6) + (__t1 & 0x3f)]) + __t2, 1) \
: /* U+10000..U+10FFFF */ \
(__lead -= 0xf0) <= 4 && \
U8_LEAD4_T1_BITS[(__t1 = *(src)) >> 4] & (1 << __lead) && \
(__lead = (__lead << 6) | (__t1 & 0x3f), ++(src) != (limit)) && \
(__t2 = *(src) - 0x80) <= 0x3f && \
++(src) != (limit) && (__t3 = *(src) - 0x80) <= 0x3f && \
(__lead = __lead >= (trie)->shifted12HighStart ? \
(trie)->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET : \
ucptrie_internalSmallU8Index((trie), __lead, __t2, __t3), 1) \
: /* U+0080..U+07FF */ \
__lead >= 0xc2 && (__t1 = *(src) - 0x80) <= 0x3f && \
(__lead = (int32_t)(trie)->index[__lead & 0x1f] + __t1, 1))) { \
++(src); \
} else { \
__lead = (trie)->dataLength - UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET; /* ill-formed*/ \
} \
} \
(result) = dataAccess(trie, __lead); \
}
/**
* UTF-8: Pre-decrements src and gets a value from the trie.
* Sets the trie error value for an ill-formed byte sequence.
*
* Unlike UCPTRIE_FAST_U16_PREV() this UTF-8 macro does not provide the code point
* because it would be more work to do so and is often not needed.
* If the trie value differs from the error value, then the byte sequence is well-formed,
* and the code point can be assembled without revalidation.
*
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the tries value width
* @param start (const char *, in) the start pointer for the text
* @param src (const char *, in/out) the source text pointer
* @param result (out) variable for the trie lookup result
* @draft ICU 63
*/
#define UCPTRIE_FAST_U8_PREV(trie, dataAccess, start, src, result) { \
int32_t __index = (uint8_t)*--(src); \
if (!U8_IS_SINGLE(__index)) { \
__index = ucptrie_internalU8PrevIndex((trie), __index, (const uint8_t *)(start), \
(const uint8_t *)(src)); \
(src) -= __index & 7; \
__index >>= 3; \
} \
(result) = dataAccess(trie, __index); \
}
/**
* Returns a trie value for an ASCII code point, without range checking.
*
* @param trie (const UCPTrie *, in) the trie (of either fast or small type)
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the tries value width
* @param c (UChar32, in) the input code point; must be U+0000..U+007F
* @return The ASCII code point's trie value.
* @draft ICU 63
*/
#define UCPTRIE_ASCII_GET(trie, dataAccess, c) dataAccess(trie, c)
/**
* Returns a trie value for a BMP code point (U+0000..U+FFFF), without range checking.
* Can be used to look up a value for a UTF-16 code unit if other parts of
* the string processing check for surrogates.
*
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the tries value width
* @param c (UChar32, in) the input code point, must be U+0000..U+FFFF
* @return The BMP code point's trie value.
* @draft ICU 63
*/
#define UCPTRIE_FAST_BMP_GET(trie, dataAccess, c) dataAccess(trie, _UCPTRIE_FAST_INDEX(trie, c))
/**
* Returns a trie value for a supplementary code point (U+10000..U+10FFFF),
* without range checking.
*
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the tries value width
* @param c (UChar32, in) the input code point, must be U+10000..U+10FFFF
* @return The supplementary code point's trie value.
* @draft ICU 63
*/
#define UCPTRIE_FAST_SUPP_GET(trie, dataAccess, c) dataAccess(trie, _UCPTRIE_SMALL_INDEX(trie, c))
/* Internal definitions ----------------------------------------------------- */
/** @internal */
typedef union UCPTrieData {
/** @internal */
const void *ptr0;
/** @internal */
const uint16_t *ptr16;
/** @internal */
const uint32_t *ptr32;
/** @internal */
const uint8_t *ptr8;
} UCPTrieData;
/**
* Internal trie structure definition.
* Visible only for use by API macros.
* @internal
*/
struct UCPTrie {
/** @internal */
const uint16_t *index;
/** @internal */
UCPTrieData data;
/** @internal */
int32_t indexLength;
/** @internal */
int32_t dataLength;
/** Start of the last range which ends at U+10FFFF. @internal */
UChar32 highStart;
/** highStart>>12 @internal */
uint16_t shifted12HighStart;
/** @internal */
int8_t type; // UCPTrieType
/** @internal */
int8_t valueWidth; // UCPTrieValueWidth
/** padding/reserved @internal */
uint32_t reserved32;
/** padding/reserved @internal */
uint16_t reserved16;
/**
* Internal index-3 null block offset.
* Set to an impossibly high value (e.g., 0xffff) if there is no dedicated index-3 null block.
* @internal
*/
uint16_t index3NullOffset;
/**
* Internal data null block offset, not shifted.
* Set to an impossibly high value (e.g., 0xfffff) if there is no dedicated data null block.
* @internal
*/
int32_t dataNullOffset;
/** @internal */
uint32_t nullValue;
#ifdef UCPTRIE_DEBUG
/** @internal */
const char *name;
#endif
};
/**
* Internal implementation constants.
* These are needed for the API macros, but users should not use these directly.
* @internal
*/
enum {
/** @internal */
UCPTRIE_FAST_SHIFT = 6,
/** Number of entries in a data block for code points below the fast limit. 64=0x40 @internal */
UCPTRIE_FAST_DATA_BLOCK_LENGTH = 1 << UCPTRIE_FAST_SHIFT,
/** Mask for getting the lower bits for the in-fast-data-block offset. @internal */
UCPTRIE_FAST_DATA_MASK = UCPTRIE_FAST_DATA_BLOCK_LENGTH - 1,
/** @internal */
UCPTRIE_SMALL_MAX = 0xfff,
/**
* Offset from dataLength (to be subtracted) for fetching the
* value returned for out-of-range code points and ill-formed UTF-8/16.
* @internal
*/
UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET = 1,
/**
* Offset from dataLength (to be subtracted) for fetching the
* value returned for code points highStart..U+10FFFF.
* @internal
*/
UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET = 2
};
/* Internal functions and macros -------------------------------------------- */
/** @internal */
U_INTERNAL int32_t U_EXPORT2
ucptrie_internalSmallIndex(const UCPTrie *trie, UChar32 c);
/** @internal */
U_INTERNAL int32_t U_EXPORT2
ucptrie_internalSmallU8Index(const UCPTrie *trie, int32_t lt1, uint8_t t2, uint8_t t3);
/**
* Internal function for part of the UCPTRIE_FAST_U8_PREVxx() macro implementations.
* Do not call directly.
* @internal
*/
U_INTERNAL int32_t U_EXPORT2
ucptrie_internalU8PrevIndex(const UCPTrie *trie, UChar32 c,
const uint8_t *start, const uint8_t *src);
/** Internal trie getter for a code point below the fast limit. Returns the data index. @internal */
#define _UCPTRIE_FAST_INDEX(trie, c) \
((int32_t)(trie)->index[(c) >> UCPTRIE_FAST_SHIFT] + ((c) & UCPTRIE_FAST_DATA_MASK))
/** Internal trie getter for a code point at or above the fast limit. Returns the data index. @internal */
#define _UCPTRIE_SMALL_INDEX(trie, c) \
((c) >= (trie)->highStart ? \
(trie)->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET : \
ucptrie_internalSmallIndex(trie, c))
/**
* Internal trie getter for a code point, with checking that c is in U+0000..10FFFF.
* Returns the data index.
* @internal
*/
#define _UCPTRIE_CP_INDEX(trie, fastMax, c) \
((uint32_t)(c) <= (uint32_t)(fastMax) ? \
_UCPTRIE_FAST_INDEX(trie, c) : \
(uint32_t)(c) <= 0x10ffff ? \
_UCPTRIE_SMALL_INDEX(trie, c) : \
(trie)->dataLength - UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET)
U_CDECL_END
#endif

View file

@ -0,0 +1,215 @@
// © 2017 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
// umutablecptrie.h (split out of ucptrie.h)
// created: 2018jan24 Markus W. Scherer
#ifndef __UMUTABLECPTRIE_H__
#define __UMUTABLECPTRIE_H__
#include "unicode/utypes.h"
#include "unicode/localpointer.h"
#include "unicode/ucptrie.h"
#include "unicode/utf8.h"
#include "putilimp.h"
#include "udataswp.h"
U_CDECL_BEGIN
/**
* \file
*
* This file defines a mutable Unicode code point trie.
*
* @see UCPTrie
* @see UMutableCPTrie
*/
/**
* Mutable Unicode code point trie.
* Fast map from Unicode code points (U+0000..U+10FFFF) to 32-bit integer values.
* For details see http://site.icu-project.org/design/struct/utrie
*
* Setting values (especially ranges) and lookup is fast.
* The mutable trie is only somewhat space-efficient.
* It builds a compacted, immutable UCPTrie.
*
* This trie can be modified while iterating over its contents.
* For example, it is possible to merge its values with those from another
* set of ranges (e.g., another mutable or immutable trie):
* Iterate over those source ranges; for each of them iterate over this trie;
* add the source value into the value of each trie range.
*
* @see UCPTrie
* @see umutablecptrie_buildImmutable
* @draft ICU 63
*/
struct UMutableCPTrie;
typedef struct UMutableCPTrie UMutableCPTrie;
/**
* Creates a mutable trie that initially maps each Unicode code point to the same value.
* It uses 32-bit data values until umutablecptrie_buildImmutable() is called.
* umutablecptrie_buildImmutable() takes a valueWidth parameter which
* determines the number of bits in the data value in the resulting UCPTrie.
* You must umutablecptrie_close() the trie once you are done using it.
*
* @param initialValue the initial value that is set for all code points
* @param errorValue the value for out-of-range code points and ill-formed UTF-8/16
* @param pErrorCode an in/out ICU UErrorCode
* @return the trie
* @draft ICU 63
*/
U_CAPI UMutableCPTrie * U_EXPORT2
umutablecptrie_open(uint32_t initialValue, uint32_t errorValue, UErrorCode *pErrorCode);
/**
* Clones a mutable trie.
* You must umutablecptrie_close() the clone once you are done using it.
*
* @param other the trie to clone
* @param pErrorCode an in/out ICU UErrorCode
* @return the trie clone
* @draft ICU 63
*/
U_CAPI UMutableCPTrie * U_EXPORT2
umutablecptrie_clone(const UMutableCPTrie *other, UErrorCode *pErrorCode);
/**
* Closes a mutable trie and releases associated memory.
*
* @param trie the trie
* @draft ICU 63
*/
U_CAPI void U_EXPORT2
umutablecptrie_close(UMutableCPTrie *trie);
#if U_SHOW_CPLUSPLUS_API
U_NAMESPACE_BEGIN
/**
* \class LocalUMutableCPTriePointer
* "Smart pointer" class, closes a UMutableCPTrie via umutablecptrie_close().
* For most methods see the LocalPointerBase base class.
*
* @see LocalPointerBase
* @see LocalPointer
* @draft ICU 63
*/
U_DEFINE_LOCAL_OPEN_POINTER(LocalUMutableCPTriePointer, UMutableCPTrie, umutablecptrie_close);
U_NAMESPACE_END
#endif
/**
* Creates a mutable trie with the same contents as the immutable one.
* You must umutablecptrie_close() the mutable trie once you are done using it.
*
* @param trie the immutable trie
* @param pErrorCode an in/out ICU UErrorCode
* @return the mutable trie
* @draft ICU 63
*/
U_CAPI UMutableCPTrie * U_EXPORT2
umutablecptrie_fromUCPTrie(const UCPTrie *trie, UErrorCode *pErrorCode);
/**
* Returns the value for a code point as stored in the trie.
*
* @param trie the trie
* @param c the code point
* @return the value
* @draft ICU 63
*/
U_CAPI uint32_t U_EXPORT2
umutablecptrie_get(const UMutableCPTrie *trie, UChar32 c);
/**
* Returns the last code point such that all those from start to there have the same value.
* Can be used to efficiently iterate over all same-value ranges in a trie.
* The trie can be modified between calls to this function.
*
* If the UCPTrieValueFilter function pointer is not NULL, then
* the value to be delivered is passed through that function, and the return value is the end
* of the range where all values are modified to the same actual value.
* The value is unchanged if that function pointer is NULL.
*
* See the same-signature ucptrie_getRange() for a code sample.
*
* @param trie the trie
* @param start range start
* @param option defines whether surrogates are treated normally,
* or as having the surrogateValue; usually UCPTRIE_RANGE_NORMAL
* @param surrogateValue value for surrogates; ignored if option==UCPTRIE_RANGE_NORMAL
* @param filter a pointer to a function that may modify the trie data value,
* or NULL if the values from the trie are to be used unmodified
* @param context an opaque pointer that is passed on to the filter function
* @param pValue if not NULL, receives the value that every code point start..end has;
* may have been modified by filter(context, trie value)
* if that function pointer is not NULL
* @return the range end code point, or -1 if start is not a valid code point
* @draft ICU 63
*/
U_CAPI UChar32 U_EXPORT2
umutablecptrie_getRange(const UMutableCPTrie *trie, UChar32 start,
UCPTrieRangeOption option, uint32_t surrogateValue,
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue);
/**
* Sets a value for a code point.
*
* @param trie the trie
* @param c the code point
* @param value the value
* @param pErrorCode an in/out ICU UErrorCode
* @draft ICU 63
*/
U_CAPI void U_EXPORT2
umutablecptrie_set(UMutableCPTrie *trie, UChar32 c, uint32_t value, UErrorCode *pErrorCode);
/**
* Sets a value for each code point [start..end].
* Faster and more space-efficient than setting the value for each code point separately.
*
* @param trie the trie
* @param start the first code point to get the value
* @param end the last code point to get the value (inclusive)
* @param value the value
* @param pErrorCode an in/out ICU UErrorCode
* @draft ICU 63
*/
U_CAPI void U_EXPORT2
umutablecptrie_setRange(UMutableCPTrie *trie,
UChar32 start, UChar32 end,
uint32_t value, UErrorCode *pErrorCode);
/**
* Compacts the data and builds an immutable UCPTrie according to the parameters.
* After this, the mutable trie will be empty.
*
* Not every possible set of mappings can be built into a UCPTrie,
* because of limitations resulting from speed and space optimizations.
* Every Unicode assigned character can be mapped to a unique value.
* Typical data yields data structures far smaller than the limitations.
*
* It is possible to construct extremely unusual mappings that exceed the data structure limits.
* In such a case this function will fail with a U_INDEX_OUTOFBOUNDS_ERROR.
*
* @param trie the trie trie
* @param type selects the trie type
* @param valueWidth selects the number of bits in a trie data value; if smaller than 32 bits,
* then the values stored in the trie will be truncated first
* @param pErrorCode an in/out ICU UErrorCode
*
* @see umutablecptrie_fromUCPTrie
* @draft ICU 63
*/
U_CAPI UCPTrie * U_EXPORT2
umutablecptrie_buildImmutable(UMutableCPTrie *trie, UCPTrieType type, UCPTrieValueWidth valueWidth,
UErrorCode *pErrorCode);
U_CDECL_END
#endif

View file

@ -21,7 +21,6 @@
#include "unicode/utypes.h"
#include "unicode/utf16.h"
#include "udataswp.h"
U_CDECL_BEGIN
@ -732,17 +731,13 @@ utrie_serialize(UNewTrie *trie, void *data, int32_t capacity,
UBool reduceTo16Bits,
UErrorCode *pErrorCode);
/**
* Swap a serialized UTrie.
* @internal
*/
U_CAPI int32_t U_EXPORT2
utrie_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode);
/* serialization ------------------------------------------------------------ */
// UTrie signature values, in platform endianness and opposite endianness.
// The UTrie signature ASCII byte values spell "Trie".
#define UTRIE_SIG 0x54726965
#define UTRIE_OE_SIG 0x65697254
/**
* Trie data structure in serialized form:
*

View file

@ -24,11 +24,10 @@
* This file contains only the runtime and enumeration code, for read-only access.
* See utrie2_builder.c for the builder code.
*/
#ifdef UTRIE2_DEBUG
# include <stdio.h>
#endif
#include "unicode/utypes.h"
#ifdef UCPTRIE_DEBUG
#include "unicode/umutablecptrie.h"
#endif
#include "unicode/utf.h"
#include "unicode/utf8.h"
#include "unicode/utf16.h"
@ -202,6 +201,9 @@ utrie2_openFromSerialized(UTrie2ValueBits valueBits,
trie->memory=(uint32_t *)data;
trie->length=actualLength;
trie->isMemoryOwned=FALSE;
#ifdef UTRIE2_DEBUG
trie->name="fromSerialized";
#endif
/* set the pointers to its index and data arrays */
p16=(const uint16_t *)(header+1);
@ -294,6 +296,9 @@ utrie2_openDummy(UTrie2ValueBits valueBits,
trie->errorValue=errorValue;
trie->highStart=0;
trie->highValueIndex=dataMove+UTRIE2_DATA_START_OFFSET;
#ifdef UTRIE2_DEBUG
trie->name="dummy";
#endif
/* set the header fields */
header=(UTrie2Header *)trie->memory;
@ -373,34 +378,15 @@ utrie2_close(UTrie2 *trie) {
}
if(trie->newTrie!=NULL) {
uprv_free(trie->newTrie->data);
#ifdef UCPTRIE_DEBUG
umutablecptrie_close(trie->newTrie->t3);
#endif
uprv_free(trie->newTrie);
}
uprv_free(trie);
}
}
U_CAPI int32_t U_EXPORT2
utrie2_getVersion(const void *data, int32_t length, UBool anyEndianOk) {
uint32_t signature;
if(length<16 || data==NULL || (U_POINTER_MASK_LSB(data, 3)!=0)) {
return 0;
}
signature=*(const uint32_t *)data;
if(signature==UTRIE2_SIG) {
return 2;
}
if(anyEndianOk && signature==UTRIE2_OE_SIG) {
return 2;
}
if(signature==UTRIE_SIG) {
return 1;
}
if(anyEndianOk && signature==UTRIE_OE_SIG) {
return 1;
}
return 0;
}
U_CAPI UBool U_EXPORT2
utrie2_isFrozen(const UTrie2 *trie) {
return (UBool)(trie->newTrie==NULL);
@ -430,96 +416,6 @@ utrie2_serialize(const UTrie2 *trie,
return trie->length;
}
U_CAPI int32_t U_EXPORT2
utrie2_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
const UTrie2Header *inTrie;
UTrie2Header trie;
int32_t dataLength, size;
UTrie2ValueBits valueBits;
if(U_FAILURE(*pErrorCode)) {
return 0;
}
if(ds==NULL || inData==NULL || (length>=0 && outData==NULL)) {
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
return 0;
}
/* setup and swapping */
if(length>=0 && length<(int32_t)sizeof(UTrie2Header)) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
inTrie=(const UTrie2Header *)inData;
trie.signature=ds->readUInt32(inTrie->signature);
trie.options=ds->readUInt16(inTrie->options);
trie.indexLength=ds->readUInt16(inTrie->indexLength);
trie.shiftedDataLength=ds->readUInt16(inTrie->shiftedDataLength);
valueBits=(UTrie2ValueBits)(trie.options&UTRIE2_OPTIONS_VALUE_BITS_MASK);
dataLength=(int32_t)trie.shiftedDataLength<<UTRIE2_INDEX_SHIFT;
if( trie.signature!=UTRIE2_SIG ||
valueBits<0 || UTRIE2_COUNT_VALUE_BITS<=valueBits ||
trie.indexLength<UTRIE2_INDEX_1_OFFSET ||
dataLength<UTRIE2_DATA_START_OFFSET
) {
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UTrie */
return 0;
}
size=sizeof(UTrie2Header)+trie.indexLength*2;
switch(valueBits) {
case UTRIE2_16_VALUE_BITS:
size+=dataLength*2;
break;
case UTRIE2_32_VALUE_BITS:
size+=dataLength*4;
break;
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
if(length>=0) {
UTrie2Header *outTrie;
if(length<size) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
outTrie=(UTrie2Header *)outData;
/* swap the header */
ds->swapArray32(ds, &inTrie->signature, 4, &outTrie->signature, pErrorCode);
ds->swapArray16(ds, &inTrie->options, 12, &outTrie->options, pErrorCode);
/* swap the index and the data */
switch(valueBits) {
case UTRIE2_16_VALUE_BITS:
ds->swapArray16(ds, inTrie+1, (trie.indexLength+dataLength)*2, outTrie+1, pErrorCode);
break;
case UTRIE2_32_VALUE_BITS:
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, dataLength*4,
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
break;
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
}
return size;
}
// utrie2_swapAnyVersion() should be defined here but lives in utrie2_builder.c
// to avoid a dependency from utrie2.cpp on utrie.c.
/* enumeration -------------------------------------------------------------- */
#define MIN_VALUE(a, b) ((a)<(b) ? (a) : (b))

View file

@ -22,7 +22,6 @@
#include "unicode/utypes.h"
#include "unicode/utf8.h"
#include "putilimp.h"
#include "udataswp.h"
U_CDECL_BEGIN
@ -330,40 +329,6 @@ utrie2_serialize(const UTrie2 *trie,
/* Public UTrie2 API: miscellaneous functions ------------------------------- */
/**
* Get the UTrie version from 32-bit-aligned memory containing the serialized form
* of either a UTrie (version 1) or a UTrie2 (version 2).
*
* @param data a pointer to 32-bit-aligned memory containing the serialized form
* of a UTrie, version 1 or 2
* @param length the number of bytes available at data;
* can be more than necessary (see return value)
* @param anyEndianOk If FALSE, only platform-endian serialized forms are recognized.
* If TRUE, opposite-endian serialized forms are recognized as well.
* @return the UTrie version of the serialized form, or 0 if it is not
* recognized as a serialized UTrie
*/
U_CAPI int32_t U_EXPORT2
utrie2_getVersion(const void *data, int32_t length, UBool anyEndianOk);
/**
* Swap a serialized UTrie2.
* @internal
*/
U_CAPI int32_t U_EXPORT2
utrie2_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode);
/**
* Swap a serialized UTrie or UTrie2.
* @internal
*/
U_CAPI int32_t U_EXPORT2
utrie2_swapAnyVersion(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode);
/**
* Build a UTrie2 (version 2) from a UTrie (version 1).
* Enumerates all values in the UTrie and builds a UTrie2 with the same values.
@ -709,6 +674,10 @@ struct UTrie2 {
UBool padding1;
int16_t padding2;
UNewTrie2 *newTrie; /* builder object; NULL when frozen */
#ifdef UTRIE2_DEBUG
const char *name;
#endif
};
/**

View file

@ -24,16 +24,23 @@
* This file contains only the builder code.
* See utrie2.c for the runtime and enumeration code.
*/
// #define UTRIE2_DEBUG
#ifdef UTRIE2_DEBUG
# include <stdio.h>
#endif
// #define UCPTRIE_DEBUG
#include "unicode/utypes.h"
#ifdef UCPTRIE_DEBUG
#include "unicode/ucptrie.h"
#include "unicode/umutablecptrie.h"
#include "ucptrie_impl.h"
#endif
#include "cmemory.h"
#include "utrie2.h"
#include "utrie2_impl.h"
#include "utrie.h" /* for utrie2_fromUTrie() and utrie_swap() */
#include "utrie.h" // for utrie2_fromUTrie()
/* Implementation notes ----------------------------------------------------- */
@ -132,8 +139,14 @@ utrie2_open(uint32_t initialValue, uint32_t errorValue, UErrorCode *pErrorCode)
trie->errorValue=errorValue;
trie->highStart=0x110000;
trie->newTrie=newTrie;
#ifdef UTRIE2_DEBUG
trie->name="open";
#endif
newTrie->data=data;
#ifdef UCPTRIE_DEBUG
newTrie->t3=umutablecptrie_open(initialValue, errorValue, pErrorCode);
#endif
newTrie->dataCapacity=UNEWTRIE2_INITIAL_DATA_LENGTH;
newTrie->initialValue=initialValue;
newTrie->errorValue=errorValue;
@ -246,6 +259,14 @@ cloneBuilder(const UNewTrie2 *other) {
uprv_free(trie);
return NULL;
}
#ifdef UCPTRIE_DEBUG
if(other->t3==nullptr) {
trie->t3=nullptr;
} else {
UErrorCode errorCode=U_ZERO_ERROR;
trie->t3=umutablecptrie_clone(other->t3, &errorCode);
}
#endif
trie->dataCapacity=other->dataCapacity;
/* clone data */
@ -343,6 +364,22 @@ copyEnumRange(const void *context, UChar32 start, UChar32 end, uint32_t value) {
}
#ifdef UTRIE2_DEBUG
static long countInitial(const UTrie2 *trie) {
uint32_t initialValue=trie->initialValue;
int32_t length=trie->dataLength;
long count=0;
if(trie->data16!=nullptr) {
for(int32_t i=0; i<length; ++i) {
if(trie->data16[i]==initialValue) { ++count; }
}
} else {
for(int32_t i=0; i<length; ++i) {
if(trie->data32[i]==initialValue) { ++count; }
}
}
return count;
}
static void
utrie_printLengths(const UTrie *trie) {
long indexLength=trie->indexLength;
@ -357,8 +394,8 @@ utrie2_printLengths(const UTrie2 *trie, const char *which) {
long indexLength=trie->indexLength;
long dataLength=(long)trie->dataLength;
long totalLength=(long)sizeof(UTrie2Header)+indexLength*2+dataLength*(trie->data32!=NULL ? 4 : 2);
printf("**UTrie2Lengths(%s)** index:%6ld data:%6ld serialized:%6ld\n",
which, indexLength, dataLength, totalLength);
printf("**UTrie2Lengths(%s %s)** index:%6ld data:%6ld countInitial:%6ld serialized:%6ld\n",
which, trie->name, indexLength, dataLength, countInitial(trie), totalLength);
}
#endif
@ -622,6 +659,9 @@ set32(UNewTrie2 *trie,
*pErrorCode=U_NO_WRITE_PERMISSION;
return;
}
#ifdef UCPTRIE_DEBUG
umutablecptrie_set(trie->t3, c, value, pErrorCode);
#endif
block=getDataBlock(trie, c, forLSCP);
if(block<0) {
@ -717,6 +757,9 @@ utrie2_setRange32(UTrie2 *trie,
*pErrorCode=U_NO_WRITE_PERMISSION;
return;
}
#ifdef UCPTRIE_DEBUG
umutablecptrie_setRange(newTrie->t3, start, end, value, pErrorCode);
#endif
if(!overwrite && value==newTrie->initialValue) {
return; /* nothing to do */
}
@ -732,7 +775,7 @@ utrie2_setRange32(UTrie2 *trie,
return;
}
nextStart=(start+UTRIE2_DATA_BLOCK_LENGTH)&~UTRIE2_DATA_MASK;
nextStart=(start+UTRIE2_DATA_MASK)&~UTRIE2_DATA_MASK;
if(nextStart<=limit) {
fillBlock(newTrie->data+block, start&UTRIE2_DATA_MASK, UTRIE2_DATA_BLOCK_LENGTH,
value, newTrie->initialValue, overwrite);
@ -983,6 +1026,10 @@ findHighStart(UNewTrie2 *trie, uint32_t highValue) {
*/
static void
compactData(UNewTrie2 *trie) {
#ifdef UTRIE2_DEBUG
int32_t countSame=0, sumOverlaps=0;
#endif
int32_t start, newStart, movedStart;
int32_t blockLength, overlap;
int32_t i, mapIndex, blockCount;
@ -1023,6 +1070,9 @@ compactData(UNewTrie2 *trie) {
if( (movedStart=findSameDataBlock(trie->data, newStart, start, blockLength))
>=0
) {
#ifdef UTRIE2_DEBUG
++countSame;
#endif
/* found an identical block, set the other block's index value for the current block */
for(i=blockCount, mapIndex=start>>UTRIE2_SHIFT_2; i>0; --i) {
trie->map[mapIndex++]=movedStart;
@ -1042,6 +1092,9 @@ compactData(UNewTrie2 *trie) {
overlap>0 && !equal_uint32(trie->data+(newStart-overlap), trie->data+start, overlap);
overlap-=UTRIE2_DATA_GRANULARITY) {}
#ifdef UTRIE2_DEBUG
sumOverlaps+=overlap;
#endif
if(overlap>0 || newStart<start) {
/* some overlap, or just move the whole block */
movedStart=newStart-overlap;
@ -1081,8 +1134,8 @@ compactData(UNewTrie2 *trie) {
#ifdef UTRIE2_DEBUG
/* we saved some space */
printf("compacting UTrie2: count of 32-bit data words %lu->%lu\n",
(long)trie->dataLength, (long)newStart);
printf("compacting UTrie2: count of 32-bit data words %lu->%lu countSame=%ld sumOverlaps=%ld\n",
(long)trie->dataLength, (long)newStart, (long)countSame, (long)sumOverlaps);
#endif
trie->dataLength=newStart;
@ -1163,7 +1216,7 @@ compactIndex2(UNewTrie2 *trie) {
#ifdef UTRIE2_DEBUG
/* we saved some space */
printf("compacting UTrie2: count of 16-bit index-2 words %lu->%lu\n",
printf("compacting UTrie2: count of 16-bit index words %lu->%lu\n",
(long)trie->index2Length, (long)newStart);
#endif
@ -1193,7 +1246,7 @@ compactTrie(UTrie2 *trie, UErrorCode *pErrorCode) {
trie->highStart=newTrie->highStart=highStart;
#ifdef UTRIE2_DEBUG
printf("UTrie2: highStart U+%04lx highValue 0x%lx initialValue 0x%lx\n",
printf("UTrie2: highStart U+%06lx highValue 0x%lx initialValue 0x%lx\n",
(long)highStart, (long)highValue, (long)trie->initialValue);
#endif
@ -1211,7 +1264,7 @@ compactTrie(UTrie2 *trie, UErrorCode *pErrorCode) {
compactIndex2(newTrie);
#ifdef UTRIE2_DEBUG
} else {
printf("UTrie2: highStart U+%04lx count of 16-bit index-2 words %lu->%lu\n",
printf("UTrie2: highStart U+%04lx count of 16-bit index words %lu->%lu\n",
(long)highStart, (long)trie->newTrie->index2Length, (long)UTRIE2_INDEX_1_OFFSET);
#endif
}
@ -1411,31 +1464,18 @@ utrie2_freeze(UTrie2 *trie, UTrie2ValueBits valueBits, UErrorCode *pErrorCode) {
return;
}
#ifdef UTRIE2_DEBUG
utrie2_printLengths(trie, "");
#endif
#ifdef UCPTRIE_DEBUG
umutablecptrie_setName(newTrie->t3, trie->name);
ucptrie_close(
umutablecptrie_buildImmutable(
newTrie->t3, UCPTRIE_TYPE_FAST, (UCPTrieValueWidth)valueBits, pErrorCode));
#endif
/* Delete the UNewTrie2. */
uprv_free(newTrie->data);
uprv_free(newTrie);
trie->newTrie=NULL;
}
/*
* This is here to avoid a dependency from utrie2.cpp on utrie.c.
* This file already depends on utrie.c.
* Otherwise, this should be in utrie2.cpp right after utrie2_swap().
*/
U_CAPI int32_t U_EXPORT2
utrie2_swapAnyVersion(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
if(U_SUCCESS(*pErrorCode)) {
switch(utrie2_getVersion(inData, length, TRUE)) {
case 1:
return utrie_swap(ds, inData, length, outData, pErrorCode);
case 2:
return utrie2_swap(ds, inData, length, outData, pErrorCode);
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
}
return 0;
}

View file

@ -22,22 +22,20 @@
#ifndef __UTRIE2_IMPL_H__
#define __UTRIE2_IMPL_H__
#ifdef UCPTRIE_DEBUG
#include "unicode/umutablecptrie.h"
#endif
#include "utrie2.h"
/* Public UTrie2 API implementation ----------------------------------------- */
/*
* These definitions are mostly needed by utrie2.c,
* These definitions are mostly needed by utrie2.cpp,
* but also by utrie2_serialize() and utrie2_swap().
*/
/*
* UTrie and UTrie2 signature values,
* in platform endianness and opposite endianness.
*/
#define UTRIE_SIG 0x54726965
#define UTRIE_OE_SIG 0x65697254
// UTrie2 signature values, in platform endianness and opposite endianness.
// The UTrie2 signature ASCII byte values spell "Tri2".
#define UTRIE2_SIG 0x54726932
#define UTRIE2_OE_SIG 0x32697254
@ -145,6 +143,9 @@ struct UNewTrie2 {
int32_t index1[UNEWTRIE2_INDEX_1_LENGTH];
int32_t index2[UNEWTRIE2_MAX_INDEX_2_LENGTH];
uint32_t *data;
#ifdef UCPTRIE_DEBUG
UMutableCPTrie *t3;
#endif
uint32_t initialValue, errorValue;
int32_t index2Length, dataCapacity, dataLength;

View file

@ -0,0 +1,344 @@
// © 2018 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
// utrie_swap.cpp
// created: 2018aug08 Markus W. Scherer
#include "unicode/utypes.h"
#include "cmemory.h"
#include "ucptrie_impl.h"
#include "udataswp.h"
#include "utrie.h"
#include "utrie2_impl.h"
// These functions for swapping different generations of ICU code point tries are here
// so that their implementation files need not depend on swapper code,
// need not depend on each other, and so that other swapper code
// need not depend on other trie code.
namespace {
constexpr int32_t ASCII_LIMIT = 0x80;
} // namespace
U_CAPI int32_t U_EXPORT2
utrie_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
const UTrieHeader *inTrie;
UTrieHeader trie;
int32_t size;
UBool dataIs32;
if(pErrorCode==NULL || U_FAILURE(*pErrorCode)) {
return 0;
}
if(ds==NULL || inData==NULL || (length>=0 && outData==NULL)) {
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
return 0;
}
/* setup and swapping */
if(length>=0 && (uint32_t)length<sizeof(UTrieHeader)) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
inTrie=(const UTrieHeader *)inData;
trie.signature=ds->readUInt32(inTrie->signature);
trie.options=ds->readUInt32(inTrie->options);
trie.indexLength=udata_readInt32(ds, inTrie->indexLength);
trie.dataLength=udata_readInt32(ds, inTrie->dataLength);
if( trie.signature!=0x54726965 ||
(trie.options&UTRIE_OPTIONS_SHIFT_MASK)!=UTRIE_SHIFT ||
((trie.options>>UTRIE_OPTIONS_INDEX_SHIFT)&UTRIE_OPTIONS_SHIFT_MASK)!=UTRIE_INDEX_SHIFT ||
trie.indexLength<UTRIE_BMP_INDEX_LENGTH ||
(trie.indexLength&(UTRIE_SURROGATE_BLOCK_COUNT-1))!=0 ||
trie.dataLength<UTRIE_DATA_BLOCK_LENGTH ||
(trie.dataLength&(UTRIE_DATA_GRANULARITY-1))!=0 ||
((trie.options&UTRIE_OPTIONS_LATIN1_IS_LINEAR)!=0 && trie.dataLength<(UTRIE_DATA_BLOCK_LENGTH+0x100))
) {
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UTrie */
return 0;
}
dataIs32=(UBool)((trie.options&UTRIE_OPTIONS_DATA_IS_32_BIT)!=0);
size=sizeof(UTrieHeader)+trie.indexLength*2+trie.dataLength*(dataIs32?4:2);
if(length>=0) {
UTrieHeader *outTrie;
if(length<size) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
outTrie=(UTrieHeader *)outData;
/* swap the header */
ds->swapArray32(ds, inTrie, sizeof(UTrieHeader), outTrie, pErrorCode);
/* swap the index and the data */
if(dataIs32) {
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, trie.dataLength*4,
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
} else {
ds->swapArray16(ds, inTrie+1, (trie.indexLength+trie.dataLength)*2, outTrie+1, pErrorCode);
}
}
return size;
}
U_CAPI int32_t U_EXPORT2
utrie2_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
const UTrie2Header *inTrie;
UTrie2Header trie;
int32_t dataLength, size;
UTrie2ValueBits valueBits;
if(U_FAILURE(*pErrorCode)) {
return 0;
}
if(ds==NULL || inData==NULL || (length>=0 && outData==NULL)) {
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
return 0;
}
/* setup and swapping */
if(length>=0 && length<(int32_t)sizeof(UTrie2Header)) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
inTrie=(const UTrie2Header *)inData;
trie.signature=ds->readUInt32(inTrie->signature);
trie.options=ds->readUInt16(inTrie->options);
trie.indexLength=ds->readUInt16(inTrie->indexLength);
trie.shiftedDataLength=ds->readUInt16(inTrie->shiftedDataLength);
valueBits=(UTrie2ValueBits)(trie.options&UTRIE2_OPTIONS_VALUE_BITS_MASK);
dataLength=(int32_t)trie.shiftedDataLength<<UTRIE2_INDEX_SHIFT;
if( trie.signature!=UTRIE2_SIG ||
valueBits<0 || UTRIE2_COUNT_VALUE_BITS<=valueBits ||
trie.indexLength<UTRIE2_INDEX_1_OFFSET ||
dataLength<UTRIE2_DATA_START_OFFSET
) {
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UTrie */
return 0;
}
size=sizeof(UTrie2Header)+trie.indexLength*2;
switch(valueBits) {
case UTRIE2_16_VALUE_BITS:
size+=dataLength*2;
break;
case UTRIE2_32_VALUE_BITS:
size+=dataLength*4;
break;
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
if(length>=0) {
UTrie2Header *outTrie;
if(length<size) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
outTrie=(UTrie2Header *)outData;
/* swap the header */
ds->swapArray32(ds, &inTrie->signature, 4, &outTrie->signature, pErrorCode);
ds->swapArray16(ds, &inTrie->options, 12, &outTrie->options, pErrorCode);
/* swap the index and the data */
switch(valueBits) {
case UTRIE2_16_VALUE_BITS:
ds->swapArray16(ds, inTrie+1, (trie.indexLength+dataLength)*2, outTrie+1, pErrorCode);
break;
case UTRIE2_32_VALUE_BITS:
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, dataLength*4,
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
break;
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
}
return size;
}
U_CAPI int32_t U_EXPORT2
ucptrie_swap(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
const UCPTrieHeader *inTrie;
UCPTrieHeader trie;
int32_t dataLength, size;
UCPTrieValueWidth valueWidth;
if(U_FAILURE(*pErrorCode)) {
return 0;
}
if(ds==nullptr || inData==nullptr || (length>=0 && outData==nullptr)) {
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
return 0;
}
/* setup and swapping */
if(length>=0 && length<(int32_t)sizeof(UCPTrieHeader)) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
inTrie=(const UCPTrieHeader *)inData;
trie.signature=ds->readUInt32(inTrie->signature);
trie.options=ds->readUInt16(inTrie->options);
trie.indexLength=ds->readUInt16(inTrie->indexLength);
trie.dataLength = ds->readUInt16(inTrie->dataLength);
UCPTrieType type = (UCPTrieType)((trie.options >> 6) & 3);
valueWidth = (UCPTrieValueWidth)(trie.options & UCPTRIE_OPTIONS_VALUE_BITS_MASK);
dataLength = ((int32_t)(trie.options & UCPTRIE_OPTIONS_DATA_LENGTH_MASK) << 4) | trie.dataLength;
int32_t minIndexLength = type == UCPTRIE_TYPE_FAST ?
UCPTRIE_BMP_INDEX_LENGTH : UCPTRIE_SMALL_INDEX_LENGTH;
if( trie.signature!=UCPTRIE_SIG ||
type > UCPTRIE_TYPE_SMALL ||
(trie.options & UCPTRIE_OPTIONS_RESERVED_MASK) != 0 ||
valueWidth > UCPTRIE_VALUE_BITS_8 ||
trie.indexLength < minIndexLength ||
dataLength < ASCII_LIMIT
) {
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UCPTrie */
return 0;
}
size=sizeof(UCPTrieHeader)+trie.indexLength*2;
switch(valueWidth) {
case UCPTRIE_VALUE_BITS_16:
size+=dataLength*2;
break;
case UCPTRIE_VALUE_BITS_32:
size+=dataLength*4;
break;
case UCPTRIE_VALUE_BITS_8:
size+=dataLength;
break;
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
if(length>=0) {
UCPTrieHeader *outTrie;
if(length<size) {
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
return 0;
}
outTrie=(UCPTrieHeader *)outData;
/* swap the header */
ds->swapArray32(ds, &inTrie->signature, 4, &outTrie->signature, pErrorCode);
ds->swapArray16(ds, &inTrie->options, 12, &outTrie->options, pErrorCode);
/* swap the index and the data */
switch(valueWidth) {
case UCPTRIE_VALUE_BITS_16:
ds->swapArray16(ds, inTrie+1, (trie.indexLength+dataLength)*2, outTrie+1, pErrorCode);
break;
case UCPTRIE_VALUE_BITS_32:
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, dataLength*4,
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
break;
case UCPTRIE_VALUE_BITS_8:
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
if(inTrie!=outTrie) {
uprv_memmove((outTrie+1)+trie.indexLength, (inTrie+1)+trie.indexLength, dataLength);
}
break;
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
}
return size;
}
namespace {
/**
* Gets the trie version from 32-bit-aligned memory containing the serialized form
* of a UTrie (version 1), a UTrie2 (version 2), or a UCPTrie (version 3).
*
* @param data a pointer to 32-bit-aligned memory containing the serialized form of a trie
* @param length the number of bytes available at data;
* can be more than necessary (see return value)
* @param anyEndianOk If FALSE, only platform-endian serialized forms are recognized.
* If TRUE, opposite-endian serialized forms are recognized as well.
* @return the trie version of the serialized form, or 0 if it is not
* recognized as a serialized trie
*/
int32_t
getVersion(const void *data, int32_t length, UBool anyEndianOk) {
uint32_t signature;
if(length<16 || data==nullptr || (U_POINTER_MASK_LSB(data, 3)!=0)) {
return 0;
}
signature=*(const uint32_t *)data;
if(signature==UCPTRIE_SIG) {
return 3;
}
if(anyEndianOk && signature==UCPTRIE_OE_SIG) {
return 3;
}
if(signature==UTRIE2_SIG) {
return 2;
}
if(anyEndianOk && signature==UTRIE2_OE_SIG) {
return 2;
}
if(signature==UTRIE_SIG) {
return 1;
}
if(anyEndianOk && signature==UTRIE_OE_SIG) {
return 1;
}
return 0;
}
} // namespace
U_CAPI int32_t U_EXPORT2
utrie_swapAnyVersion(const UDataSwapper *ds,
const void *inData, int32_t length, void *outData,
UErrorCode *pErrorCode) {
if(U_FAILURE(*pErrorCode)) { return 0; }
switch(getVersion(inData, length, TRUE)) {
case 1:
return utrie_swap(ds, inData, length, outData, pErrorCode);
case 2:
return utrie2_swap(ds, inData, length, outData, pErrorCode);
case 3:
return ucptrie_swap(ds, inData, length, outData, pErrorCode);
default:
*pErrorCode=U_INVALID_FORMAT_ERROR;
return 0;
}
}

View file

@ -557,7 +557,10 @@ UTS46::processUnicode(const UnicodeString &src,
destArray=dest.getBuffer();
destLength+=newLength-labelLength;
labelLimit=labelStart+=newLength+1;
} else if(0xdf<=c && c<=0x200d && (c==0xdf || c==0x3c2 || c>=0x200c)) {
continue;
} else if(c<0xdf) {
// pass
} else if(c<=0x200d && (c==0xdf || c==0x3c2 || c>=0x200c)) {
info.isTransDiff=TRUE;
if(doMapDevChars) {
destLength=mapDevChars(dest, labelStart, labelLimit, errorCode);
@ -565,15 +568,23 @@ UTS46::processUnicode(const UnicodeString &src,
return dest;
}
destArray=dest.getBuffer();
// Do not increment labelLimit in case c was removed.
// All deviation characters have been mapped, no need to check for them again.
doMapDevChars=FALSE;
} else {
++labelLimit;
// Do not increment labelLimit in case c was removed.
continue;
}
} else if(U16_IS_SURROGATE(c)) {
if(U16_IS_SURROGATE_LEAD(c) ?
(labelLimit+1)==destLength || !U16_IS_TRAIL(destArray[labelLimit+1]) :
labelLimit==labelStart || !U16_IS_LEAD(destArray[labelLimit-1])) {
// Map an unpaired surrogate to U+FFFD before normalization so that when
// that removes characters we do not turn two unpaired ones into a pair.
info.labelErrors|=UIDNA_ERROR_DISALLOWED;
dest.setCharAt(labelLimit, 0xfffd);
destArray=dest.getBuffer();
}
} else {
++labelLimit;
}
++labelLimit;
}
// Permit an empty label at the end (0<labelStart==labelLimit==destLength is ok)
// but not an empty label elsewhere nor a completely empty domain name.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View file

@ -4305,7 +4305,7 @@ D7A4..D7AF >FFFD # NA <reserved-D7A4>..<reserved-D7AF>
D7C7..D7CA >FFFD # NA <reserved-D7C7>..<reserved-D7CA>
# D7CB..D7FB valid # 5.2 HANGUL JONGSEONG NIEUN-RIEUL..HANGUL JONGSEONG PHIEUPH-THIEUTH
D7FC..D7FF >FFFD # NA <reserved-D7FC>..<reserved-D7FF>
D800..DFFF >FFFD # 2.0 <surrogate-D800>..<surrogate-DFFF>
# D800..DFFF >FFFD # 2.0 <surrogate-D800>..<surrogate-DFFF>
E000..F8FF >FFFD # 1.1 <private-use-E000>..<private-use-F8FF>
F900 >8C48 # 1.1 CJK COMPATIBILITY IDEOGRAPH-F900
F901 >66F4 # 1.1 CJK COMPATIBILITY IDEOGRAPH-F901

View file

@ -20,7 +20,7 @@
#include "unicode/uspoof.h"
#include "unicode/uscript.h"
#include "unicode/udata.h"
#include "udataswp.h"
#include "utrie2.h"
#if !UCONFIG_NO_NORMALIZATION

View file

@ -48,7 +48,7 @@ cnmdptst.o cnormtst.o cnumtst.o crelativedateformattest.o crestst.o creststn.o c
cucdapi.o cucdtst.o custrtst.o cstrcase.o cutiltst.o nucnvtst.o nccbtst.o bocu1tst.o \
cbiditst.o cbididat.o eurocreg.o udatatst.o utf16tst.o utransts.o \
ncnvfbts.o ncnvtst.o putiltst.o cstrtest.o udatpg_test.o utf8tst.o \
stdnmtst.o usrchtst.o custrtrn.o sorttest.o trietest.o trie2test.o usettest.o \
stdnmtst.o usrchtst.o custrtrn.o sorttest.o trietest.o trie2test.o ucptrietest.o usettest.o \
uenumtst.o utmstest.o currtest.o \
idnatest.o nfsprep.o spreptst.o sprpdata.o \
hpmufn.o tracetst.o reapits.o uregiontest.o ulistfmttest.o\

View file

@ -182,6 +182,7 @@
<ClCompile Include="sorttest.c" />
<ClCompile Include="trie2test.c" />
<ClCompile Include="trietest.c" />
<ClCompile Include="ucptrietest.c" />
<ClCompile Include="uenumtst.c" />
<ClCompile Include="bocu1tst.c" />
<ClCompile Include="ccapitst.c" />
@ -284,4 +285,4 @@
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
<ImportGroup Label="ExtensionTargets">
</ImportGroup>
</Project>
</Project>

View file

@ -123,6 +123,9 @@
<ClCompile Include="trietest.c">
<Filter>collections</Filter>
</ClCompile>
<ClCompile Include="ucptrietest.c">
<Filter>collections</Filter>
</ClCompile>
<ClCompile Include="uenumtst.c">
<Filter>collections</Filter>
</ClCompile>
@ -417,4 +420,4 @@
<Filter>sprep &amp; idna</Filter>
</ClInclude>
</ItemGroup>
</Project>
</Project>

View file

@ -27,6 +27,7 @@ void addHashtableTest(TestNode** root);
void addCStringTest(TestNode** root);
void addTrieTest(TestNode** root);
void addTrie2Test(TestNode** root);
void addUCPTrieTest(TestNode** root);
void addEnumerationTest(TestNode** root);
void addPosixTest(TestNode** root);
void addSortTest(TestNode** root);
@ -38,6 +39,7 @@ void addUtility(TestNode** root)
addCStringTest(root);
addTrieTest(root);
addTrie2Test(root);
addUCPTrieTest(root);
addLocaleTest(root);
addCLDRTest(root);
addUnicodeTest(root);

View file

@ -421,7 +421,7 @@ testTrieUTF8(const char *testName,
prevCP=c;
--c; /* end of the range */
U8_APPEND_UNSAFE(s, length, c);
if(U_IS_SURROGATE(prevCP)) {
if(U_IS_SURROGATE(c)) {
// A surrogate byte sequence counts as 3 single-byte errors.
values[countValues++]=errorValue;
values[countValues++]=errorValue;
@ -1287,31 +1287,6 @@ GrowDataArrayTest(void) {
/* versions 1 and 2 --------------------------------------------------------- */
static void
GetVersionTest(void) {
uint32_t data[4];
if( /* version 1 */
(data[0]=0x54726965, 1!=utrie2_getVersion(data, sizeof(data), FALSE)) ||
(data[0]=0x54726965, 1!=utrie2_getVersion(data, sizeof(data), TRUE)) ||
(data[0]=0x65697254, 0!=utrie2_getVersion(data, sizeof(data), FALSE)) ||
(data[0]=0x65697254, 1!=utrie2_getVersion(data, sizeof(data), TRUE)) ||
/* version 2 */
(data[0]=0x54726932, 2!=utrie2_getVersion(data, sizeof(data), FALSE)) ||
(data[0]=0x54726932, 2!=utrie2_getVersion(data, sizeof(data), TRUE)) ||
(data[0]=0x32697254, 0!=utrie2_getVersion(data, sizeof(data), FALSE)) ||
(data[0]=0x32697254, 2!=utrie2_getVersion(data, sizeof(data), TRUE)) ||
/* illegal arguments */
(data[0]=0x54726932, 0!=utrie2_getVersion(NULL, sizeof(data), FALSE)) ||
(data[0]=0x54726932, 0!=utrie2_getVersion(data, 3, FALSE)) ||
(data[0]=0x54726932, 0!=utrie2_getVersion((char *)data+1, sizeof(data), FALSE)) ||
/* unknown signature values */
(data[0]=0x11223344, 0!=utrie2_getVersion(data, sizeof(data), FALSE)) ||
(data[0]=0x54726933, 0!=utrie2_getVersion(data, sizeof(data), FALSE))
) {
log_err("error: utrie2_getVersion() is not working as expected\n");
}
}
static UNewTrie *
makeNewTrie1WithRanges(const char *testName,
const SetRange setRanges[], int32_t countSetRanges,
@ -1455,6 +1430,5 @@ addTrie2Test(TestNode** root) {
addTest(root, &DummyTrieTest, "tsutil/trie2test/DummyTrieTest");
addTest(root, &FreeBlocksTest, "tsutil/trie2test/FreeBlocksTest");
addTest(root, &GrowDataArrayTest, "tsutil/trie2test/GrowDataArrayTest");
addTest(root, &GetVersionTest, "tsutil/trie2test/GetVersionTest");
addTest(root, &Trie12ConversionTest, "tsutil/trie2test/Trie12ConversionTest");
}

File diff suppressed because it is too large Load diff

View file

@ -633,6 +633,29 @@ BasicNormalizerTest::TestPreviousNext(const UChar *src, int32_t srcLength,
const char *moves,
UNormalizationMode mode,
const char *name) {
// Sanity check non-iterative normalization.
{
IcuTestErrorCode errorCode(*this, "TestPreviousNext");
UnicodeString result;
Normalizer::normalize(UnicodeString(src, srcLength), mode, 0, result, errorCode);
if (errorCode.isFailure()) {
dataerrln("error: non-iterative normalization of %s failed: %s",
name, errorCode.errorName());
errorCode.reset();
return;
}
// UnicodeString::fromUTF32(expect, expectLength)
// would turn unpaired surrogates into U+FFFD.
for (int32_t i = 0, j = 0; i < result.length(); ++j) {
UChar32 c = result.char32At(i);
if (c != expect[j]) {
errln("error: non-iterative normalization of %s did not yield the expected result",
name);
}
i += U16_LENGTH(c);
}
}
// iterators
Normalizer iter(src, srcLength, mode);
@ -1432,9 +1455,14 @@ struct StringPair { const char *input, *expected; };
void
BasicNormalizerTest::TestCustomComp() {
static const StringPair pairs[]={
{ "\\uD801\\uE000\\uDFFE", "" },
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
// ICU 63 normalization with UCPTrie requires inert surrogate code points.
// { "\\uD801\\uE000\\uDFFE", "" },
// { "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
// { "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
{ "\\uD801\\uE000\\uDFFE", "\\uD801\\uDFFE" },
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD800\\uD801\\uDFFE\\uDFFF" },
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD800\\U000107FE\\uDFFF" },
{ "\\uE001\\U000110B9\\u0345\\u0308\\u0327", "\\uE002\\U000110B9\\u0327\\u0345" },
{ "\\uE010\\U000F0011\\uE012", "\\uE011\\uE012" },
{ "\\uE010\\U000F0011\\U000F0011\\uE012", "\\uE011\\U000F0010" },
@ -1462,9 +1490,14 @@ BasicNormalizerTest::TestCustomComp() {
void
BasicNormalizerTest::TestCustomFCC() {
static const StringPair pairs[]={
{ "\\uD801\\uE000\\uDFFE", "" },
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
// ICU 63 normalization with UCPTrie requires inert surrogate code points.
// { "\\uD801\\uE000\\uDFFE", "" },
// { "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
// { "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
{ "\\uD801\\uE000\\uDFFE", "\\uD801\\uDFFE" },
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD800\\uD801\\uDFFE\\uDFFF" },
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD800\\U000107FE\\uDFFF" },
// The following expected result is different from CustomComp
// because of only-contiguous composition.
{ "\\uE001\\U000110B9\\u0345\\u0308\\u0327", "\\uE001\\U000110B9\\u0327\\u0308\\u0345" },

View file

@ -17,17 +17,20 @@ include $(top_builddir)/icudefs.mk
subdir = test/perf/normperf
## Extra files to remove for 'make clean'
CLEANFILES = *~ $(DEPS)
CLEANFILES = *~ $(DEPS) $(SIMPLE_DEPS)
## Target information
TARGET = normperf
SIMPLE = simplenormperf
CPPFLAGS += -I$(top_srcdir)/common -I$(top_srcdir)/tools/toolutil -I$(top_srcdir)/tools/ctestfw
LIBS = $(LIBCTESTFW) $(LIBICUI18N) $(LIBICUUC) $(LIBICUTOOLUTIL) $(DEFAULT_LIBS) $(LIB_M)
OBJECTS = normperf.o
SIMPLE_OBJ = simplenormperf.o
DEPS = $(OBJECTS:.o=.d)
SIMPLE_DEPS = $(SIMPLE_OBJ:.o=.d)
## List of phony targets
.PHONY : all all-local install install-local clean clean-local \
@ -44,7 +47,7 @@ distclean : distclean-local
dist: dist-local
check: all check-local
all-local: $(TARGET)
all-local: $(TARGET) $(SIMPLE)
install-local:
@ -52,7 +55,7 @@ dist-local:
clean-local:
test -z "$(CLEANFILES)" || $(RMV) $(CLEANFILES)
$(RMV) $(OBJECTS) $(TARGET)
$(RMV) $(OBJECTS) $(SIMPLE_OBJ) $(TARGET) $(SIMPLE)
distclean-local: clean-local
$(RMV) Makefile
@ -67,16 +70,21 @@ $(TARGET) : $(OBJECTS)
$(LINK.cc) -o $@ $^ $(LIBS)
$(POST_BUILD_STEP)
$(SIMPLE) : $(SIMPLE_OBJ)
$(LINK.cc) -o $@ $^ $(LIBS)
$(POST_BUILD_STEP)
invoke:
ICU_DATA=$${ICU_DATA:-$(top_builddir)/data/} TZ=PST8PDT $(INVOKE) $(INVOCATION)
ifeq (,$(MAKECMDGOALS))
-include $(DEPS)
-include $(SIMPLE_DEPS)
else
ifneq ($(patsubst %clean,,$(MAKECMDGOALS)),)
ifneq ($(patsubst %install,,$(MAKECMDGOALS)),)
-include $(DEPS)
-include $(SIMPLE_DEPS)
endif
endif
endif

View file

@ -0,0 +1,352 @@
// © 2018 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html
// simplenormperf.cpp
// created: 2018mar15 Markus W. Scherer
#include <stdio.h>
#include <string>
#include "unicode/utypes.h"
#include "unicode/bytestream.h"
#include "unicode/normalizer2.h"
#include "unicode/stringpiece.h"
#include "unicode/unistr.h"
#include "unicode/utf8.h"
#include "unicode/utimer.h"
#include "cmemory.h"
using icu::Normalizer2;
using icu::UnicodeString;
namespace {
// Strings with commonly occurring BMP characters.
class CommonChars {
public:
static UnicodeString getMixed(int32_t minLength) {
return extend(UnicodeString(latin1).append(japanese).append(arabic), minLength);
}
static UnicodeString getLatin1(int32_t minLength) { return extend(latin1, minLength); }
static UnicodeString getLowercaseLatin1(int32_t minLength) { return extend(lowercaseLatin1, minLength); }
static UnicodeString getASCII(int32_t minLength) { return extend(ascii, minLength); }
static UnicodeString getJapanese(int32_t minLength) { return extend(japanese, minLength); }
// Returns an array of UTF-8 offsets, one per code point.
// Assumes all BMP characters.
static int32_t *toUTF8WithOffsets(const UnicodeString &s16, std::string &s8, int32_t &numCodePoints) {
s8.clear();
s8.reserve(s16.length());
s16.toUTF8String(s8);
const char *s = s8.data();
int32_t length = s8.length();
int32_t *offsets = new int32_t[length + 1];
int32_t numCP = 0;
for (int32_t i = 0; i < length;) {
offsets[numCP++] = i;
U8_FWD_1(s, i, length);
}
offsets[numCP] = length;
numCodePoints = numCP;
return offsets;
}
private:
static UnicodeString extend(const UnicodeString &s, int32_t minLength) {
UnicodeString result(s);
while (result.length() < minLength) {
UnicodeString twice = result + result;
result = std::move(twice);
}
return result;
}
static const UChar *const latin1;
static const UChar *const lowercaseLatin1;
static const UChar *const ascii;
static const UChar *const japanese;
static const UChar *const arabic;
};
const UChar *const CommonChars::latin1 =
// Goethes Bergschloß in normal sentence case.
u"Da droben auf jenem Berge, da steht ein altes Schloß, "
u"wo hinter Toren und Türen sonst lauerten Ritter und Roß.\n"
u"Verbrannt sind Türen und Tore, und überall ist es so still; "
u"das alte verfallne Gemäuer durchklettr ich, wie ich nur will.\n"
u"Hierneben lag ein Keller, so voll von köstlichem Wein; "
u"nun steiget nicht mehr mit Krügen die Kellnerin heiter hinein.\n"
u"Sie setzt den Gästen im Saale nicht mehr die Becher umher, "
u"sie füllt zum Heiligen Mahle dem Pfaffen das Fläschchen nicht mehr.\n"
u"Sie reicht dem lüsternen Knappen nicht mehr auf dem Gange den Trank, "
u"und nimmt für flüchtige Gabe nicht mehr den flüchtigen Dank.\n"
u"Denn alle Balken und Decken, sie sind schon lange verbrannt, "
u"und Trepp und Gang und Kapelle in Schutt und Trümmer verwandt.\n"
u"Doch als mit Zither und Flasche nach diesen felsigen Höhn "
u"ich an dem heitersten Tage mein Liebchen steigen gesehn,\n"
u"da drängte sich frohes Behagen hervor aus verödeter Ruh, "
u"da gings wie in alten Tagen recht feierlich wieder zu.\n"
u"Als wären für stattliche Gäste die weitesten Räume bereit, "
u"als käm ein Pärchen gegangen aus jener tüchtigen Zeit.\n"
u"Als stünd in seiner Kapelle der würdige Pfaffe schon da "
u"und fragte: Wollt ihr einander? Wir aber lächelten: Ja!\n"
u"Und tief bewegten Gesänge des Herzens innigsten Grund, "
u"Es zeugte, statt der Menge, der Echo schallender Mund.\n"
u"Und als sich gegen Abend im stillen alles verlor,"
u"da blickte die glühende Sonne zum schroffen Gipfel empor.\n"
u"Und Knapp und Kellnerin glänzen als Herren weit und breit; "
u"sie nimmt sich zum Kredenzen und er zum Danke sich Zeit.\n";
const UChar *const CommonChars::lowercaseLatin1 =
// Goethes Bergschloß in all lowercase
u"da droben auf jenem berge, da steht ein altes schloß, "
u"wo hinter toren und türen sonst lauerten ritter und roß.\n"
u"verbrannt sind türen und tore, und überall ist es so still; "
u"das alte verfallne gemäuer durchklettr ich, wie ich nur will.\n"
u"hierneben lag ein keller, so voll von köstlichem wein; "
u"nun steiget nicht mehr mit krügen die kellnerin heiter hinein.\n"
u"sie setzt den gästen im saale nicht mehr die becher umher, "
u"sie füllt zum heiligen mahle dem pfaffen das fläschchen nicht mehr.\n"
u"sie reicht dem lüsternen knappen nicht mehr auf dem gange den trank, "
u"und nimmt für flüchtige gabe nicht mehr den flüchtigen dank.\n"
u"denn alle balken und decken, sie sind schon lange verbrannt, "
u"und trepp und gang und kapelle in schutt und trümmer verwandt.\n"
u"doch als mit zither und flasche nach diesen felsigen höhn "
u"ich an dem heitersten tage mein liebchen steigen gesehn,\n"
u"da drängte sich frohes behagen hervor aus verödeter ruh, "
u"da gings wie in alten tagen recht feierlich wieder zu.\n"
u"als wären für stattliche gäste die weitesten räume bereit, "
u"als käm ein pärchen gegangen aus jener tüchtigen zeit.\n"
u"als stünd in seiner kapelle der würdige pfaffe schon da "
u"und fragte: wollt ihr einander? wir aber lächelten: ja!\n"
u"und tief bewegten gesänge des herzens innigsten grund, "
u"es zeugte, statt der menge, der echo schallender mund.\n"
u"und als sich gegen abend im stillen alles verlor,"
u"da blickte die glühende sonne zum schroffen gipfel empor.\n"
u"und knapp und kellnerin glänzen als herren weit und breit; "
u"sie nimmt sich zum kredenzen und er zum danke sich zeit.\n";
const UChar *const CommonChars::ascii =
// Goethes Bergschloß in normal sentence case but ASCII-fied
u"Da droben auf jenem Berge, da steht ein altes Schloss, "
u"wo hinter Toren und Tueren sonst lauerten Ritter und Ross.\n"
u"Verbrannt sind Tueren und Tore, und ueberall ist es so still; "
u"das alte verfallne Gemaeuer durchklettr ich, wie ich nur will.\n"
u"Hierneben lag ein Keller, so voll von koestlichem Wein; "
u"nun steiget nicht mehr mit Kruegen die Kellnerin heiter hinein.\n"
u"Sie setzt den Gaesten im Saale nicht mehr die Becher umher, "
u"sie fuellt zum Heiligen Mahle dem Pfaffen das Flaeschchen nicht mehr.\n"
u"Sie reicht dem luesternen Knappen nicht mehr auf dem Gange den Trank, "
u"und nimmt fuer fluechtige Gabe nicht mehr den fluechtigen Dank.\n"
u"Denn alle Balken und Decken, sie sind schon lange verbrannt, "
u"und Trepp und Gang und Kapelle in Schutt und Truemmer verwandt.\n"
u"Doch als mit Zither und Flasche nach diesen felsigen Hoehn "
u"ich an dem heitersten Tage mein Liebchen steigen gesehn,\n"
u"da draengte sich frohes Behagen hervor aus veroedeter Ruh, "
u"da gings wie in alten Tagen recht feierlich wieder zu.\n"
u"Als waeren fuer stattliche Gaeste die weitesten Raeume bereit, "
u"als kaem ein Paerchen gegangen aus jener tuechtigen Zeit.\n"
u"Als stuend in seiner Kapelle der wuerdige Pfaffe schon da "
u"und fragte: Wollt ihr einander? Wir aber laechelten: Ja!\n"
u"Und tief bewegten Gesaenge des Herzens innigsten Grund, "
u"Es zeugte, statt der Menge, der Echo schallender Mund.\n"
u"Und als sich gegen Abend im stillen alles verlor,"
u"da blickte die gluehende Sonne zum schroffen Gipfel empor.\n"
u"Und Knapp und Kellnerin glaenzen als Herren weit und breit; "
u"sie nimmt sich zum Kredenzen und er zum Danke sich Zeit.\n";
const UChar *const CommonChars::japanese =
// Ame ni mo makezu = Be not Defeated by the Rain, by Kenji Miyazawa.
u"雨にもまけず風にもまけず雪にも夏の暑さにもまけぬ"
u"丈夫なからだをもち慾はなく決して瞋らず"
u"いつもしずかにわらっている一日に玄米四合と"
u"味噌と少しの野菜をたべあらゆることを"
u"じぶんをかんじょうにいれずによくみききしわかり"
u"そしてわすれず野原の松の林の蔭の"
u"小さな萱ぶきの小屋にいて東に病気のこどもあれば"
u"行って看病してやり西につかれた母あれば"
u"行ってその稲の束を負い南に死にそうな人あれば"
u"行ってこわがらなくてもいいといい"
u"北にけんかやそしょうがあれば"
u"つまらないからやめろといいひでりのときはなみだをながし"
u"さむさのなつはおろおろあるきみんなにでくのぼうとよばれ"
u"ほめられもせずくにもされずそういうものにわたしはなりたい";
const UChar *const CommonChars::arabic =
// Some Arabic for variety. "What is Unicode?"
// http://www.unicode.org/standard/translations/arabic.html
u"تتعامل الحواسيب بالأسام مع الأرقام فقط، "
u"و تخزن الحروف و المحارف "
u"الأخرى بتخصيص رقم لكل واحد "
u"منها. قبل اختراع يونيكود كان هناك ";
// TODO: class BenchmarkPerCodePoint?
class Operation {
public:
Operation() {}
virtual ~Operation();
virtual double call(int32_t iterations, int32_t pieceLength) = 0;
protected:
UTimer startTime;
};
Operation::~Operation() {}
const int32_t kLengths[] = { 5, 12, 30, 100, 1000, 10000 };
int32_t getMaxLength() { return kLengths[UPRV_LENGTHOF(kLengths) - 1]; }
// Returns seconds per code point.
double measure(Operation &op, int32_t pieceLength) {
// Increase the number of iterations until we use at least one second.
int32_t iterations = 1;
for (;;) {
double seconds = op.call(iterations, pieceLength);
if (seconds >= 1) {
if (iterations > 1) {
return seconds / (iterations * pieceLength);
} else {
// Run it once more, to avoid measuring only the warm-up.
return op.call(1, pieceLength) / (iterations * pieceLength);
}
}
if (seconds < 0.01) {
iterations *= 10;
} else if (seconds < 0.55) {
iterations *= 1.1 / seconds;
} else {
iterations *= 2;
}
}
}
void benchmark(const char *name, Operation &op) {
for (int32_t i = 0; i < UPRV_LENGTHOF(kLengths); ++i) {
int32_t pieceLength = kLengths[i];
double secPerCp = measure(op, pieceLength);
printf("%s %6d %12f ns/cp\n", name, (int)pieceLength, secPerCp * 1000000000);
}
puts("");
}
class NormalizeUTF16 : public Operation {
public:
NormalizeUTF16(const Normalizer2 &n2, const UnicodeString &text) :
norm2(n2), src(text), s(src.getBuffer()) {}
virtual ~NormalizeUTF16();
virtual double call(int32_t iterations, int32_t pieceLength);
private:
const Normalizer2 &norm2;
UnicodeString src;
const UChar *s;
UnicodeString dest;
};
NormalizeUTF16::~NormalizeUTF16() {}
// Assumes all BMP characters.
double NormalizeUTF16::call(int32_t iterations, int32_t pieceLength) {
int32_t start = 0;
int32_t limit = src.length() - pieceLength;
UnicodeString piece;
UErrorCode errorCode = U_ZERO_ERROR;
utimer_getTime(&startTime);
for (int32_t i = 0; i < iterations; ++i) {
piece.setTo(FALSE, s + start, pieceLength);
norm2.normalize(piece, dest, errorCode);
start = (start + pieceLength) % limit;
}
return utimer_getElapsedSeconds(&startTime);
}
class NormalizeUTF8 : public Operation {
public:
NormalizeUTF8(const Normalizer2 &n2, const UnicodeString &text) : norm2(n2), sink(&dest) {
offsets = CommonChars::toUTF8WithOffsets(text, src, numCodePoints);
s = src.data();
}
virtual ~NormalizeUTF8();
virtual double call(int32_t iterations, int32_t pieceLength);
private:
const Normalizer2 &norm2;
std::string src;
const char *s;
int32_t *offsets;
int32_t numCodePoints;
std::string dest;
icu::StringByteSink<std::string> sink;
};
NormalizeUTF8::~NormalizeUTF8() {
delete[] offsets;
}
double NormalizeUTF8::call(int32_t iterations, int32_t pieceLength) {
int32_t start = 0;
int32_t limit = numCodePoints - pieceLength;
UErrorCode errorCode = U_ZERO_ERROR;
utimer_getTime(&startTime);
for (int32_t i = 0; i < iterations; ++i) {
int32_t start8 = offsets[start];
int32_t limit8 = offsets[start + pieceLength];
icu::StringPiece piece(s + start8, limit8 - start8);
norm2.normalizeUTF8(0, piece, sink, nullptr, errorCode);
start = (start + pieceLength) % limit;
}
return utimer_getElapsedSeconds(&startTime);
}
} // namespace
extern int main(int /*argc*/, const char * /*argv*/[]) {
// More than the longest piece length so that we read from different parts of the string
// for that piece length.
int32_t maxLength = getMaxLength() * 10;
UErrorCode errorCode = U_ZERO_ERROR;
const Normalizer2 *nfc = Normalizer2::getNFCInstance(errorCode);
const Normalizer2 *nfkc_cf = Normalizer2::getNFKCCasefoldInstance(errorCode);
if (U_FAILURE(errorCode)) {
fprintf(stderr,
"simplenormperf: failed to get Normalizer2 instances - %s\n",
u_errorName(errorCode));
}
{
// Base line: Should remain in the fast loop without trie lookups.
NormalizeUTF16 op(*nfc, CommonChars::getLatin1(maxLength));
benchmark("NFC/UTF-16/latin1", op);
}
{
// Base line 2: Read UTF-8, trie lookups, but should have nothing to do.
NormalizeUTF8 op(*nfc, CommonChars::getJapanese(maxLength));
benchmark("NFC/UTF-8/japanese", op);
}
{
NormalizeUTF16 op(*nfkc_cf, CommonChars::getMixed(maxLength));
benchmark("NFKC_CF/UTF-16/mixed", op);
}
{
NormalizeUTF16 op(*nfkc_cf, CommonChars::getLowercaseLatin1(maxLength));
benchmark("NFKC_CF/UTF-16/lowercaseLatin1", op);
}
{
NormalizeUTF16 op(*nfkc_cf, CommonChars::getJapanese(maxLength));
benchmark("NFKC_CF/UTF-16/japanese", op);
}
{
NormalizeUTF8 op(*nfkc_cf, CommonChars::getMixed(maxLength));
benchmark("NFKC_CF/UTF-8/mixed", op);
}
{
NormalizeUTF8 op(*nfkc_cf, CommonChars::getLowercaseLatin1(maxLength));
benchmark("NFKC_CF/UTF-8/lowercaseLatin1", op);
}
{
NormalizeUTF8 op(*nfkc_cf, CommonChars::getJapanese(maxLength));
benchmark("NFKC_CF/UTF-8/japanese", op);
}
return 0;
}

View file

@ -44,9 +44,10 @@
0360..0361:234
0362:233
0363..036F:230
D802:2 # surrogates with non-zero combining classes
D803:3
D804:4
# ICU 63 normalization with UCPTrie requires inert surrogate code points.
# D802:2 # surrogates with non-zero combining classes
# D803:3
# D804:4
110B9:9
110BA:7
@ -58,10 +59,11 @@ D804:4
00C4=0041 0308
00C5=0041 030A
00C7=0043 0327
D800>D7FF # surrogates with mappings, and mappings to empty strings
D801>
DFFE>
DFFF>FFFF
# ICU 63 normalization with UCPTrie requires inert surrogate code points.
# D800>D7FF # surrogates with mappings, and mappings to empty strings
# D801>
# DFFE>
# DFFF>FFFF
E000>
E001=61 338 # composition with trail<=33FF and composite>7FFF
E002=E001 308 # recursive mapping needs reordering

View file

@ -266,6 +266,11 @@ void parseFile(std::ifstream &f, Normalizer2DataBuilder &builder) {
fprintf(stderr, "gennorm2 error: parsing code point range from %s\n", line);
exit(errorCode.reset());
}
if (endCP >= 0xd800 && startCP <= 0xdfff) {
fprintf(stderr, "gennorm2 error: value or mapping for surrogate code points: %s\n",
line);
exit(U_ILLEGAL_ARGUMENT_ERROR);
}
delimiter=u_skipWhitespace(delimiter);
if(*delimiter==':') {
const char *s=u_skipWhitespace(delimiter+1);

View file

@ -29,7 +29,9 @@
#include "unicode/errorcode.h"
#include "unicode/localpointer.h"
#include "unicode/putil.h"
#include "unicode/ucptrie.h"
#include "unicode/udata.h"
#include "unicode/umutablecptrie.h"
#include "unicode/uniset.h"
#include "unicode/unistr.h"
#include "unicode/usetiter.h"
@ -41,7 +43,6 @@
#include "norms.h"
#include "toolutil.h"
#include "unewdata.h"
#include "utrie2.h"
#include "uvectr32.h"
#include "writesrc.h"
@ -58,8 +59,8 @@ static UDataInfo dataInfo={
0,
{ 0x4e, 0x72, 0x6d, 0x32 }, /* dataFormat="Nrm2" */
{ 3, 0, 0, 0 }, /* formatVersion */
{ 10, 0, 0, 0 } /* dataVersion (Unicode version) */
{ 4, 0, 0, 0 }, /* formatVersion */
{ 11, 0, 0, 0 } /* dataVersion (Unicode version) */
};
U_NAMESPACE_BEGIN
@ -94,14 +95,14 @@ const HangulIterator::Range HangulIterator::ranges[4]={
Normalizer2DataBuilder::Normalizer2DataBuilder(UErrorCode &errorCode) :
norms(errorCode),
phase(0), overrideHandling(OVERRIDE_PREVIOUS), optimization(OPTIMIZE_NORMAL),
norm16Trie(nullptr), norm16TrieLength(0) {
norm16TrieBytes(nullptr), norm16TrieLength(0) {
memset(unicodeVersion, 0, sizeof(unicodeVersion));
memset(indexes, 0, sizeof(indexes));
memset(smallFCD, 0, sizeof(smallFCD));
}
Normalizer2DataBuilder::~Normalizer2DataBuilder() {
utrie2_close(norm16Trie);
delete[] norm16TrieBytes;
}
void
@ -407,11 +408,13 @@ void Normalizer2DataBuilder::postProcess(Norm &norm) {
class Norm16Writer : public Norms::Enumerator {
public:
Norm16Writer(Norms &n, Normalizer2DataBuilder &b) : Norms::Enumerator(n), builder(b) {}
Norm16Writer(UMutableCPTrie *trie, Norms &n, Normalizer2DataBuilder &b) :
Norms::Enumerator(n), builder(b), norm16Trie(trie) {}
void rangeHandler(UChar32 start, UChar32 end, Norm &norm) U_OVERRIDE {
builder.writeNorm16(start, end, norm);
builder.writeNorm16(norm16Trie, start, end, norm);
}
Normalizer2DataBuilder &builder;
UMutableCPTrie *norm16Trie;
};
void Normalizer2DataBuilder::setSmallFCD(UChar32 c) {
@ -419,7 +422,7 @@ void Normalizer2DataBuilder::setSmallFCD(UChar32 c) {
smallFCD[lead>>8]|=(uint8_t)1<<((lead>>5)&7);
}
void Normalizer2DataBuilder::writeNorm16(UChar32 start, UChar32 end, Norm &norm) {
void Normalizer2DataBuilder::writeNorm16(UMutableCPTrie *norm16Trie, UChar32 start, UChar32 end, Norm &norm) {
if((norm.leadCC|norm.trailCC)!=0) {
for(UChar32 c=start; c<=end; ++c) {
setSmallFCD(c);
@ -484,7 +487,7 @@ void Normalizer2DataBuilder::writeNorm16(UChar32 start, UChar32 end, Norm &norm)
norm16|=Normalizer2Impl::HAS_COMP_BOUNDARY_AFTER;
}
IcuToolErrorCode errorCode("gennorm2/writeNorm16()");
utrie2_setRange32(norm16Trie, start, end, (uint32_t)norm16, TRUE, errorCode);
umutablecptrie_setRange(norm16Trie, start, end, (uint32_t)norm16, errorCode);
// Set the minimum code points for real data lookups in the quick check loops.
UBool isDecompNo=
@ -502,13 +505,13 @@ void Normalizer2DataBuilder::writeNorm16(UChar32 start, UChar32 end, Norm &norm)
}
}
void Normalizer2DataBuilder::setHangulData() {
void Normalizer2DataBuilder::setHangulData(UMutableCPTrie *norm16Trie) {
HangulIterator hi;
const HangulIterator::Range *range;
// Check that none of the Hangul/Jamo code points have data.
while((range=hi.nextRange())!=NULL) {
for(UChar32 c=range->start; c<=range->end; ++c) {
if(utrie2_get32(norm16Trie, c)>Normalizer2Impl::INERT) {
if(umutablecptrie_get(norm16Trie, c)>Normalizer2Impl::INERT) {
fprintf(stderr,
"gennorm2 error: "
"illegal mapping/composition/ccc data for Hangul or Jamo U+%04lX\n",
@ -524,13 +527,13 @@ void Normalizer2DataBuilder::setHangulData() {
if(Hangul::JAMO_V_BASE<indexes[Normalizer2Impl::IX_MIN_COMP_NO_MAYBE_CP]) {
indexes[Normalizer2Impl::IX_MIN_COMP_NO_MAYBE_CP]=Hangul::JAMO_V_BASE;
}
utrie2_setRange32(norm16Trie, Hangul::JAMO_L_BASE, Hangul::JAMO_L_END,
Normalizer2Impl::JAMO_L, TRUE, errorCode);
utrie2_setRange32(norm16Trie, Hangul::JAMO_V_BASE, Hangul::JAMO_V_END,
Normalizer2Impl::JAMO_VT, TRUE, errorCode);
umutablecptrie_setRange(norm16Trie, Hangul::JAMO_L_BASE, Hangul::JAMO_L_END,
Normalizer2Impl::JAMO_L, errorCode);
umutablecptrie_setRange(norm16Trie, Hangul::JAMO_V_BASE, Hangul::JAMO_V_END,
Normalizer2Impl::JAMO_VT, errorCode);
// JAMO_T_BASE+1: not U+11A7
utrie2_setRange32(norm16Trie, Hangul::JAMO_T_BASE+1, Hangul::JAMO_T_END,
Normalizer2Impl::JAMO_VT, TRUE, errorCode);
umutablecptrie_setRange(norm16Trie, Hangul::JAMO_T_BASE+1, Hangul::JAMO_T_END,
Normalizer2Impl::JAMO_VT, errorCode);
// Hangul LV encoded as minYesNo
uint32_t lv=indexes[Normalizer2Impl::IX_MIN_YES_NO];
@ -542,49 +545,16 @@ void Normalizer2DataBuilder::setHangulData() {
}
// Set the first LV, then write all other Hangul syllables as LVT,
// then overwrite the remaining LV.
// The UTrie2 should be able to compact this into 7 32-item blocks
// because JAMO_T_COUNT is 28 and the UTrie2 granularity is 4.
// (7*32=8*28 smallest common multiple)
utrie2_set32(norm16Trie, Hangul::HANGUL_BASE, lv, errorCode);
utrie2_setRange32(norm16Trie, Hangul::HANGUL_BASE+1, Hangul::HANGUL_END,
lvt, TRUE, errorCode);
umutablecptrie_set(norm16Trie, Hangul::HANGUL_BASE, lv, errorCode);
umutablecptrie_setRange(norm16Trie, Hangul::HANGUL_BASE+1, Hangul::HANGUL_END, lvt, errorCode);
UChar32 c=Hangul::HANGUL_BASE;
while((c+=Hangul::JAMO_T_COUNT)<=Hangul::HANGUL_END) {
utrie2_set32(norm16Trie, c, lv, errorCode);
umutablecptrie_set(norm16Trie, c, lv, errorCode);
}
errorCode.assertSuccess();
}
namespace {
struct Norm16Summary {
uint32_t maxNorm16;
// ANDing values yields 0 bits where any value has a 0.
// Used for worst-case HAS_COMP_BOUNDARY_AFTER.
uint32_t andedNorm16;
};
} // namespace
U_CDECL_BEGIN
static UBool U_CALLCONV
enumRangeMaxValue(const void *context, UChar32 /*start*/, UChar32 /*end*/, uint32_t value) {
Norm16Summary *p=(Norm16Summary *)context;
if(value>p->maxNorm16) {
p->maxNorm16=value;
}
p->andedNorm16&=value;
return TRUE;
}
U_CDECL_END
void Normalizer2DataBuilder::processData() {
IcuToolErrorCode errorCode("gennorm2/processData()");
norm16Trie=utrie2_open(Normalizer2Impl::INERT, Normalizer2Impl::INERT, errorCode);
errorCode.assertSuccess();
LocalUCPTriePointer Normalizer2DataBuilder::processData() {
// Build composition lists before recursive decomposition,
// so that we still have the raw, pair-wise mappings.
CompositionBuilder compBuilder(norms);
@ -652,13 +622,19 @@ void Normalizer2DataBuilder::processData() {
indexes[Normalizer2Impl::IX_MIN_COMP_NO_MAYBE_CP]=0x110000;
indexes[Normalizer2Impl::IX_MIN_LCCC_CP]=0x110000;
IcuToolErrorCode errorCode("gennorm2/processData()");
UMutableCPTrie *norm16Trie = umutablecptrie_open(
Normalizer2Impl::INERT, Normalizer2Impl::INERT, errorCode);
errorCode.assertSuccess();
// Map each code point to its norm16 value,
// including the properties that fit directly,
// and the offset to the "extra data" if necessary.
Norm16Writer norm16Writer(norms, *this);
Norm16Writer norm16Writer(norm16Trie, norms, *this);
norms.enumRanges(norm16Writer);
// TODO: iterate via getRange() instead of callback?
setHangulData();
setHangulData(norm16Trie);
// Look for the "worst" norm16 value of any supplementary code point
// corresponding to a lead surrogate, and set it as that surrogate's value.
@ -670,22 +646,63 @@ void Normalizer2DataBuilder::processData() {
// and select the best value that only breaks the composition and/or decomposition
// inner loops if necessary.
// However, that seems like overkill for an optimization for supplementary characters.
for(UChar lead=0xd800; lead<0xdc00; ++lead) {
uint32_t surrogateCPNorm16=utrie2_get32(norm16Trie, lead);
Norm16Summary summary={ surrogateCPNorm16, surrogateCPNorm16 };
utrie2_enumForLeadSurrogate(norm16Trie, lead, NULL, enumRangeMaxValue, &summary);
uint32_t norm16=summary.maxNorm16;
if(norm16>=(uint32_t)indexes[Normalizer2Impl::IX_LIMIT_NO_NO] &&
norm16>(uint32_t)indexes[Normalizer2Impl::IX_MIN_NO_NO]) {
// Set noNo ("worst" value) if it got into "less-bad" maybeYes or ccc!=0.
// Otherwise it might end up at something like JAMO_VT which stays in
// the inner decomposition quick check loop.
norm16=(uint32_t)indexes[Normalizer2Impl::IX_LIMIT_NO_NO]-1;
//
// First check that surrogate code *points* are inert.
// The parser should have rejected values/mappings for them.
uint32_t value;
UChar32 end = umutablecptrie_getRange(norm16Trie, 0xd800, UCPTRIE_RANGE_NORMAL, 0,
nullptr, nullptr, &value);
if (value != Normalizer2Impl::INERT || end < 0xdfff) {
fprintf(stderr,
"gennorm2 error: not all surrogate code points are inert: U+d800..U+%04x=%lx\n",
(int)end, (long)value);
exit(U_INTERNAL_PROGRAM_ERROR);
}
uint32_t maxNorm16 = 0;
// ANDing values yields 0 bits where any value has a 0.
// Used for worst-case HAS_COMP_BOUNDARY_AFTER.
uint32_t andedNorm16 = 0;
end = 0;
for (UChar32 start = 0x10000;;) {
if (start > end) {
end = umutablecptrie_getRange(norm16Trie, start, UCPTRIE_RANGE_NORMAL, 0,
nullptr, nullptr, &value);
if (end < 0) { break; }
}
if ((start & 0x3ff) == 0) {
// Data for a new lead surrogate.
maxNorm16 = andedNorm16 = value;
} else {
if (value > maxNorm16) {
maxNorm16 = value;
}
andedNorm16 &= value;
}
// Intersect each range with the code points for one lead surrogate.
UChar32 leadEnd = start | 0x3ff;
if (leadEnd <= end) {
// End of the supplementary block for a lead surrogate.
if (maxNorm16 >= (uint32_t)indexes[Normalizer2Impl::IX_LIMIT_NO_NO]) {
// Set noNo ("worst" value) if it got into "less-bad" maybeYes or ccc!=0.
// Otherwise it might end up at something like JAMO_VT which stays in
// the inner decomposition quick check loop.
maxNorm16 = (uint32_t)indexes[Normalizer2Impl::IX_LIMIT_NO_NO];
}
maxNorm16 =
(maxNorm16 & ~Normalizer2Impl::HAS_COMP_BOUNDARY_AFTER)|
(andedNorm16 & Normalizer2Impl::HAS_COMP_BOUNDARY_AFTER);
if (maxNorm16 != Normalizer2Impl::INERT) {
umutablecptrie_set(norm16Trie, U16_LEAD(start), maxNorm16, errorCode);
}
if (value == Normalizer2Impl::INERT) {
// Potentially skip inert supplementary blocks for several lead surrogates.
start = (end + 1) & ~0x3ff;
} else {
start = leadEnd + 1;
}
} else {
start = end + 1;
}
norm16=
(norm16&~Normalizer2Impl::HAS_COMP_BOUNDARY_AFTER)|
(summary.andedNorm16&Normalizer2Impl::HAS_COMP_BOUNDARY_AFTER);
utrie2_set32ForLeadSurrogateCodeUnit(norm16Trie, lead, norm16, errorCode);
}
// Adjust supplementary minimum code points to break quick check loops at their lead surrogates.
@ -705,14 +722,19 @@ void Normalizer2DataBuilder::processData() {
indexes[Normalizer2Impl::IX_MIN_LCCC_CP]=U16_LEAD(minCP);
}
utrie2_freeze(norm16Trie, UTRIE2_16_VALUE_BITS, errorCode);
norm16TrieLength=utrie2_serialize(norm16Trie, NULL, 0, errorCode);
LocalUCPTriePointer builtTrie(
umutablecptrie_buildImmutable(norm16Trie, UCPTRIE_TYPE_FAST, UCPTRIE_VALUE_BITS_16, errorCode));
norm16TrieLength=ucptrie_toBinary(builtTrie.getAlias(), nullptr, 0, errorCode);
if(errorCode.get()!=U_BUFFER_OVERFLOW_ERROR) {
fprintf(stderr, "gennorm2 error: unable to freeze/serialize the normalization trie - %s\n",
fprintf(stderr, "gennorm2 error: unable to build/serialize the normalization trie - %s\n",
errorCode.errorName());
exit(errorCode.reset());
}
umutablecptrie_close(norm16Trie);
errorCode.reset();
norm16TrieBytes=new uint8_t[norm16TrieLength];
ucptrie_toBinary(builtTrie.getAlias(), norm16TrieBytes, norm16TrieLength, errorCode);
errorCode.assertSuccess();
int32_t offset=(int32_t)sizeof(indexes);
indexes[Normalizer2Impl::IX_NORM_TRIE_OFFSET]=offset;
@ -750,16 +772,13 @@ void Normalizer2DataBuilder::processData() {
u_versionFromString(unicodeVersion, U_UNICODE_VERSION);
}
memcpy(dataInfo.dataVersion, unicodeVersion, 4);
return builtTrie;
}
void Normalizer2DataBuilder::writeBinaryFile(const char *filename) {
processData();
IcuToolErrorCode errorCode("gennorm2/writeBinaryFile()");
LocalArray<uint8_t> norm16TrieBytes(new uint8_t[norm16TrieLength]);
utrie2_serialize(norm16Trie, norm16TrieBytes.getAlias(), norm16TrieLength, errorCode);
errorCode.assertSuccess();
UNewDataMemory *pData=
udata_create(NULL, NULL, filename, &dataInfo,
haveCopyright ? U_COPYRIGHT_STRING : NULL, errorCode);
@ -769,7 +788,7 @@ void Normalizer2DataBuilder::writeBinaryFile(const char *filename) {
exit(errorCode.reset());
}
udata_writeBlock(pData, indexes, sizeof(indexes));
udata_writeBlock(pData, norm16TrieBytes.getAlias(), norm16TrieLength);
udata_writeBlock(pData, norm16TrieBytes, norm16TrieLength);
udata_writeUString(pData, toUCharPtr(extraData.getBuffer()), extraData.length());
udata_writeBlock(pData, smallFCD, sizeof(smallFCD));
int32_t writtenSize=udata_finish(pData, errorCode);
@ -787,7 +806,7 @@ void Normalizer2DataBuilder::writeBinaryFile(const char *filename) {
void
Normalizer2DataBuilder::writeCSourceFile(const char *filename) {
processData();
LocalUCPTriePointer norm16Trie = processData();
IcuToolErrorCode errorCode("gennorm2/writeCSourceFile()");
const char *basename=findBasename(filename);
@ -797,10 +816,7 @@ Normalizer2DataBuilder::writeCSourceFile(const char *filename) {
if(extension!=NULL) {
dataName.truncate((int32_t)(extension-basename));
}
errorCode.assertSuccess();
LocalArray<uint8_t> norm16TrieBytes(new uint8_t[norm16TrieLength]);
utrie2_serialize(norm16Trie, norm16TrieBytes.getAlias(), norm16TrieLength, errorCode);
const char *name=dataName.data();
errorCode.assertSuccess();
FILE *f=usrc_create(path.data(), basename, "icu/source/tools/gennorm2/n2builder.cpp");
@ -808,43 +824,31 @@ Normalizer2DataBuilder::writeCSourceFile(const char *filename) {
fprintf(stderr, "gennorm2/writeCSourceFile() error: unable to create the output file %s\n",
filename);
exit(U_FILE_ACCESS_ERROR);
return;
}
fputs("#ifdef INCLUDED_FROM_NORMALIZER2_CPP\n\n", f);
char line[100];
sprintf(line, "static const UVersionInfo %s_formatVersion={", dataName.data());
char line[100], line2[100], line3[100];
sprintf(line, "static const UVersionInfo %s_formatVersion={", name);
usrc_writeArray(f, line, dataInfo.formatVersion, 8, 4, "};\n");
sprintf(line, "static const UVersionInfo %s_dataVersion={", dataName.data());
sprintf(line, "static const UVersionInfo %s_dataVersion={", name);
usrc_writeArray(f, line, dataInfo.dataVersion, 8, 4, "};\n\n");
sprintf(line, "static const int32_t %s_indexes[Normalizer2Impl::IX_COUNT]={\n",
dataName.data());
usrc_writeArray(f,
line,
indexes, 32, Normalizer2Impl::IX_COUNT,
"\n};\n\n");
sprintf(line, "static const uint16_t %s_trieIndex[%%ld]={\n", dataName.data());
usrc_writeUTrie2Arrays(f,
line, NULL,
norm16Trie,
"\n};\n\n");
sprintf(line, "static const uint16_t %s_extraData[%%ld]={\n", dataName.data());
usrc_writeArray(f,
line,
extraData.getBuffer(), 16, extraData.length(),
"\n};\n\n");
sprintf(line, "static const uint8_t %s_smallFCD[%%ld]={\n", dataName.data());
usrc_writeArray(f,
line,
smallFCD, 8, sizeof(smallFCD),
"\n};\n\n");
sprintf(line, "static const UTrie2 %s_trie={\n", dataName.data());
char line2[100];
sprintf(line2, "%s_trieIndex", dataName.data());
usrc_writeUTrie2Struct(f,
line,
norm16Trie, line2, NULL,
"};\n");
fputs("\n#endif // INCLUDED_FROM_NORMALIZER2_CPP\n", f);
sprintf(line, "static const int32_t %s_indexes[Normalizer2Impl::IX_COUNT]={\n", name);
usrc_writeArray(f, line, indexes, 32, Normalizer2Impl::IX_COUNT, "\n};\n\n");
sprintf(line, "static const uint16_t %s_trieIndex[%%ld]={\n", name);
sprintf(line2, "static const uint16_t %s_trieData[%%ld]={\n", name);
usrc_writeUCPTrieArrays(f, line, line2, norm16Trie.getAlias(), "\n};\n\n");
sprintf(line, "static const UCPTrie %s_trie={\n", name);
sprintf(line2, "%s_trieIndex", name);
sprintf(line3, "%s_trieData", name);
usrc_writeUCPTrieStruct(f, line, norm16Trie.getAlias(), line2, line3, "};\n\n");
sprintf(line, "static const uint16_t %s_extraData[%%ld]={\n", name);
usrc_writeArray(f, line, extraData.getBuffer(), 16, extraData.length(), "\n};\n\n");
sprintf(line, "static const uint8_t %s_smallFCD[%%ld]={\n", name);
usrc_writeArray(f, line, smallFCD, 8, sizeof(smallFCD), "\n};\n\n");
fputs("#endif // INCLUDED_FROM_NORMALIZER2_CPP\n", f);
fclose(f);
}

View file

@ -24,10 +24,10 @@
#if !UCONFIG_NO_NORMALIZATION
#include "unicode/errorcode.h"
#include "unicode/umutablecptrie.h"
#include "unicode/unistr.h"
#include "normalizer2impl.h" // for IX_COUNT
#include "toolutil.h"
#include "utrie2.h"
#include "norms.h"
U_NAMESPACE_BEGIN
@ -95,9 +95,9 @@ private:
return indexes[Normalizer2Impl::IX_MIN_MAYBE_YES]-
((2*Normalizer2Impl::MAX_DELTA+1)<<Normalizer2Impl::DELTA_SHIFT);
}
void writeNorm16(UChar32 start, UChar32 end, Norm &norm);
void setHangulData();
void processData();
void writeNorm16(UMutableCPTrie *norm16Trie, UChar32 start, UChar32 end, Norm &norm);
void setHangulData(UMutableCPTrie *norm16Trie);
LocalUCPTriePointer processData();
Norms norms;
@ -107,7 +107,7 @@ private:
Optimization optimization;
int32_t indexes[Normalizer2Impl::IX_COUNT];
UTrie2 *norm16Trie;
uint8_t *norm16TrieBytes;
int32_t norm16TrieLength;
UnicodeString extraData;
uint8_t smallFCD[0x100];

View file

@ -12,12 +12,12 @@
#include <stdio.h>
#include <stdlib.h>
#include "unicode/errorcode.h"
#include "unicode/umutablecptrie.h"
#include "unicode/unistr.h"
#include "unicode/utf16.h"
#include "normalizer2impl.h"
#include "norms.h"
#include "toolutil.h"
#include "utrie2.h"
#include "uvectr32.h"
U_NAMESPACE_BEGIN
@ -67,7 +67,7 @@ UChar32 Norm::combine(UChar32 trail) const {
}
Norms::Norms(UErrorCode &errorCode) {
normTrie=utrie2_open(0, 0, &errorCode);
normTrie = umutablecptrie_open(0, 0, &errorCode);
normMem=utm_open("gennorm2 normalization structs", 10000, 0x110100, sizeof(Norm));
// Default "inert" Norm struct at index 0. Practically immutable.
norms=allocNorm();
@ -75,7 +75,7 @@ Norms::Norms(UErrorCode &errorCode) {
}
Norms::~Norms() {
utrie2_close(normTrie);
umutablecptrie_close(normTrie);
int32_t normsLength=utm_countItems(normMem);
for(int32_t i=1; i<normsLength; ++i) {
delete norms[i].mapping;
@ -92,7 +92,7 @@ Norm *Norms::allocNorm() {
}
Norm *Norms::getNorm(UChar32 c) {
uint32_t i=utrie2_get32(normTrie, c);
uint32_t i = umutablecptrie_get(normTrie, c);
if(i==0) {
return nullptr;
}
@ -100,7 +100,7 @@ Norm *Norms::getNorm(UChar32 c) {
}
const Norm *Norms::getNorm(UChar32 c) const {
uint32_t i=utrie2_get32(normTrie, c);
uint32_t i = umutablecptrie_get(normTrie, c);
if(i==0) {
return nullptr;
}
@ -108,18 +108,18 @@ const Norm *Norms::getNorm(UChar32 c) const {
}
const Norm &Norms::getNormRef(UChar32 c) const {
return norms[utrie2_get32(normTrie, c)];
return norms[umutablecptrie_get(normTrie, c)];
}
Norm *Norms::createNorm(UChar32 c) {
uint32_t i=utrie2_get32(normTrie, c);
uint32_t i=umutablecptrie_get(normTrie, c);
if(i!=0) {
return norms+i;
} else {
/* allocate Norm */
Norm *p=allocNorm();
IcuToolErrorCode errorCode("gennorm2/createNorm()");
utrie2_set32(normTrie, c, (uint32_t)(p-norms), errorCode);
umutablecptrie_set(normTrie, c, (uint32_t)(p - norms), errorCode);
return p;
}
}
@ -153,28 +153,20 @@ UBool Norms::combinesWithCCBetween(const Norm &norm, uint8_t lowCC, int32_t high
return FALSE;
}
U_CDECL_BEGIN
static UBool U_CALLCONV
enumRangeHandler(const void *context, UChar32 start, UChar32 end, uint32_t value) {
return ((Norms::Enumerator *)context)->rangeHandler(start, end, value);
}
U_CDECL_END
void Norms::enumRanges(Enumerator &e) {
utrie2_enum(normTrie, nullptr, enumRangeHandler, &e);
UChar32 start = 0, end;
uint32_t i;
while ((end = umutablecptrie_getRange(normTrie, start, UCPTRIE_RANGE_NORMAL, 0,
nullptr, nullptr, &i)) >= 0) {
if (i > 0) {
e.rangeHandler(start, end, norms[i]);
}
start = end + 1;
}
}
Norms::Enumerator::~Enumerator() {}
UBool Norms::Enumerator::rangeHandler(UChar32 start, UChar32 end, uint32_t value) {
if(value!=0) {
rangeHandler(start, end, norms.getNormRefByIndex(value));
}
return TRUE;
}
void CompositionBuilder::rangeHandler(UChar32 start, UChar32 end, Norm &norm) {
if(norm.mappingType!=Norm::ROUND_TRIP) { return; }
if(start!=end) {

View file

@ -15,12 +15,12 @@
#if !UCONFIG_NO_NORMALIZATION
#include "unicode/errorcode.h"
#include "unicode/umutablecptrie.h"
#include "unicode/uniset.h"
#include "unicode/unistr.h"
#include "unicode/utf16.h"
#include "normalizer2impl.h"
#include "toolutil.h"
#include "utrie2.h"
#include "uvectr32.h"
U_NAMESPACE_BEGIN
@ -176,8 +176,6 @@ public:
virtual ~Enumerator();
/** Called for enumerated value!=0. */
virtual void rangeHandler(UChar32 start, UChar32 end, Norm &norm) = 0;
/** @internal Public only for C callback. */
UBool rangeHandler(UChar32 start, UChar32 end, uint32_t value);
protected:
Norms &norms;
};
@ -190,7 +188,7 @@ private:
Norms(const Norms &other) = delete;
Norms &operator=(const Norms &other) = delete;
UTrie2 *normTrie;
UMutableCPTrie *normTrie;
UToolMemory *normMem;
Norm *norms;
};

View file

@ -1018,6 +1018,11 @@ addCollation(ParseState* state, TableResource *result, const char *collationTyp
icu::CollationInfo::printReorderRanges(
*t->data, t->settings->reorderCodes, t->settings->reorderCodesLength);
}
#if 0 // debugging output
} else {
printf("%s~%s collation tailoring part sizes:\n", state->filename, collationType);
icu::CollationInfo::printSizes(totalSize, indexes);
#endif
}
struct SResource *collationBin = bin_open(state->bundle, "%%CollationBin", totalSize, dest, NULL, NULL, status);
result->add(collationBin, line, *status);

View file

@ -243,7 +243,7 @@ uprops_swap(const UDataSwapper *ds,
* swap the main properties UTrie
* PT serialized properties trie, see utrie.h (byte size: 4*(i0-16))
*/
utrie2_swapAnyVersion(ds,
utrie_swapAnyVersion(ds,
inData32+UPROPS_INDEX_COUNT,
4*(dataIndexes[UPROPS_PROPS32_INDEX]-UPROPS_INDEX_COUNT),
outData32+UPROPS_INDEX_COUNT,
@ -274,7 +274,7 @@ uprops_swap(const UDataSwapper *ds,
* swap the additional UTrie
* i3 additionalTrieIndex; -- 32-bit unit index to the additional trie for more properties
*/
utrie2_swapAnyVersion(ds,
utrie_swapAnyVersion(ds,
inData32+dataIndexes[UPROPS_ADDITIONAL_TRIE_INDEX],
4*(dataIndexes[UPROPS_ADDITIONAL_VECTORS_INDEX]-dataIndexes[UPROPS_ADDITIONAL_TRIE_INDEX]),
outData32+dataIndexes[UPROPS_ADDITIONAL_TRIE_INDEX],
@ -391,7 +391,7 @@ ucase_swap(const UDataSwapper *ds,
/* swap the UTrie */
count=indexes[UCASE_IX_TRIE_SIZE];
utrie2_swapAnyVersion(ds, inBytes+offset, count, outBytes+offset, pErrorCode);
utrie_swapAnyVersion(ds, inBytes+offset, count, outBytes+offset, pErrorCode);
offset+=count;
/* swap the uint16_t exceptions[] and unfold[] */
@ -493,7 +493,7 @@ ubidi_swap(const UDataSwapper *ds,
/* swap the UTrie */
count=indexes[UBIDI_IX_TRIE_SIZE];
utrie2_swapAnyVersion(ds, inBytes+offset, count, outBytes+offset, pErrorCode);
utrie_swapAnyVersion(ds, inBytes+offset, count, outBytes+offset, pErrorCode);
offset+=count;
/* swap the uint32_t mirrors[] */

View file

@ -22,6 +22,7 @@
#include <time.h>
#include "unicode/utypes.h"
#include "unicode/putil.h"
#include "unicode/ucptrie.h"
#include "utrie2.h"
#include "cstring.h"
#include "writesrc.h"
@ -228,6 +229,52 @@ usrc_writeUTrie2Struct(FILE *f,
}
}
U_CAPI void U_EXPORT2
usrc_writeUCPTrieArrays(FILE *f,
const char *indexPrefix, const char *dataPrefix,
const UCPTrie *pTrie,
const char *postfix) {
usrc_writeArray(f, indexPrefix, pTrie->index, 16, pTrie->indexLength, postfix);
int32_t width=
pTrie->valueWidth==UCPTRIE_VALUE_BITS_16 ? 16 :
pTrie->valueWidth==UCPTRIE_VALUE_BITS_32 ? 32 :
pTrie->valueWidth==UCPTRIE_VALUE_BITS_8 ? 8 : 0;
usrc_writeArray(f, dataPrefix, pTrie->data.ptr0, width, pTrie->dataLength, postfix);
}
U_CAPI void U_EXPORT2
usrc_writeUCPTrieStruct(FILE *f,
const char *prefix,
const UCPTrie *pTrie,
const char *indexName, const char *dataName,
const char *postfix) {
if(prefix!=NULL) {
fputs(prefix, f);
}
fprintf(
f,
" %s,\n" // index
" { %s },\n", // data (union)
indexName,
dataName);
fprintf(
f,
" %ld, %ld,\n" // indexLength, dataLength
" 0x%lx, 0x%x,\n" // highStart, shifted12HighStart
" %d, %d,\n" // type, valueWidth
" 0, 0,\n" // reserved32, reserved16
" 0x%x, 0x%lx,\n" // index3NullOffset, dataNullOffset
" 0x%lx,\n", // nullValue
(long)pTrie->indexLength, (long)pTrie->dataLength,
(long)pTrie->highStart, pTrie->shifted12HighStart,
pTrie->type, pTrie->valueWidth,
pTrie->index3NullOffset, (long)pTrie->dataNullOffset,
(long)pTrie->nullValue);
if(postfix!=NULL) {
fputs(postfix, f);
}
}
U_CAPI void U_EXPORT2
usrc_writeArrayOfMostlyInvChars(FILE *f,
const char *prefix,

View file

@ -23,6 +23,7 @@
#include <stdio.h>
#include "unicode/utypes.h"
#include "unicode/ucptrie.h"
#include "utrie2.h"
/**
@ -75,6 +76,27 @@ usrc_writeUTrie2Struct(FILE *f,
const char *indexName, const char *dataName,
const char *postfix);
/**
* Calls usrc_writeArray() for the index and data arrays of a UCPTrie.
*/
U_CAPI void U_EXPORT2
usrc_writeUCPTrieArrays(FILE *f,
const char *indexPrefix, const char *dataPrefix,
const UCPTrie *pTrie,
const char *postfix);
/**
* Writes the UCPTrie struct values.
* The {} and declaration etc. need to be included in prefix/postfix or
* printed before and after the array contents.
*/
U_CAPI void U_EXPORT2
usrc_writeUCPTrieStruct(FILE *f,
const char *prefix,
const UCPTrie *pTrie,
const char *indexName, const char *dataName,
const char *postfix);
/**
* Writes the contents of an array of mostly invariant characters.
* Characters 0..0x1f are printed as numbers,

View file

@ -652,6 +652,15 @@ public final class ICUBinary {
}
}
public static byte[] getBytes(ByteBuffer bytes, int length, int additionalSkipLength) {
byte[] dest = new byte[length];
bytes.get(dest);
if (additionalSkipLength > 0) {
skipBytes(bytes, additionalSkipLength);
}
return dest;
}
public static String getString(ByteBuffer bytes, int length, int additionalSkipLength) {
CharSequence cs = bytes.asCharBuffer();
String s = cs.subSequence(0, length).toString();

View file

@ -12,11 +12,13 @@ package com.ibm.icu.impl;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.Iterator;
import com.ibm.icu.text.UTF16;
import com.ibm.icu.text.UnicodeSet;
import com.ibm.icu.util.CodePointMap;
import com.ibm.icu.util.CodePointTrie;
import com.ibm.icu.util.ICUUncheckedIOException;
import com.ibm.icu.util.MutableCodePointTrie;
import com.ibm.icu.util.VersionInfo;
/**
@ -180,8 +182,7 @@ public final class Normalizer2Impl {
insert(c, cc);
}
}
// s must be in NFD, otherwise change the implementation.
public void append(CharSequence s, int start, int limit,
public void append(CharSequence s, int start, int limit, boolean isNFD,
int leadCC, int trailCC) {
if(start==limit) {
return;
@ -202,8 +203,11 @@ public final class Normalizer2Impl {
c=Character.codePointAt(s, start);
start+=Character.charCount(c);
if(start<limit) {
// s must be in NFD, otherwise we need to use getCC().
leadCC=getCCFromYesOrMaybe(impl.getNorm16(c));
if (isNFD) {
leadCC = getCCFromYesOrMaybe(impl.getNorm16(c));
} else {
leadCC = impl.getCC(impl.getNorm16(c));
}
} else {
leadCC=trailCC;
}
@ -359,6 +363,24 @@ public final class Normalizer2Impl {
// TODO: Propose widening UTF16 methods that take char to take int.
// TODO: Propose widening UTF16 methods that take String to take CharSequence.
public static final class UTF16Plus {
/**
* Is this code point a lead surrogate (U+d800..U+dbff)?
* @param c code unit or code point
* @return true or false
*/
public static boolean isLeadSurrogate(int c) { return (c & 0xfffffc00) == 0xd800; }
/**
* Is this code point a trail surrogate (U+dc00..U+dfff)?
* @param c code unit or code point
* @return true or false
*/
public static boolean isTrailSurrogate(int c) { return (c & 0xfffffc00) == 0xdc00; }
/**
* Is this code point a surrogate (U+d800..U+dfff)?
* @param c code unit or code point
* @return true or false
*/
public static boolean isSurrogate(int c) { return (c & 0xfffff800) == 0xd800; }
/**
* Assuming c is a surrogate code point (UTF16.isSurrogate(c)),
* is it a lead surrogate?
@ -420,7 +442,7 @@ public final class Normalizer2Impl {
private static final class IsAcceptable implements ICUBinary.Authenticate {
@Override
public boolean isDataVersionAcceptable(byte version[]) {
return version[0]==3;
return version[0]==4;
}
}
private static final IsAcceptable IS_ACCEPTABLE = new IsAcceptable();
@ -457,8 +479,9 @@ public final class Normalizer2Impl {
// Read the normTrie.
int offset=inIndexes[IX_NORM_TRIE_OFFSET];
int nextOffset=inIndexes[IX_EXTRA_DATA_OFFSET];
normTrie=Trie2_16.createFromSerialized(bytes);
int trieLength=normTrie.getSerializedLength();
int triePosition = bytes.position();
normTrie = CodePointTrie.Fast16.fromBinary(bytes);
int trieLength = bytes.position() - triePosition;
if(trieLength>(nextOffset-offset)) {
throw new ICUUncheckedIOException("Normalizer2 data: not enough bytes for normTrie");
}
@ -487,46 +510,46 @@ public final class Normalizer2Impl {
return load(ICUBinary.getRequiredData(name));
}
private void enumLcccRange(int start, int end, int norm16, UnicodeSet set) {
if (norm16 > MIN_NORMAL_MAYBE_YES && norm16 != JAMO_VT) {
set.add(start, end);
} else if (minNoNoCompNoMaybeCC <= norm16 && norm16 < limitNoNo) {
int fcd16=getFCD16(start);
if(fcd16>0xff) { set.add(start, end); }
}
}
private void enumNorm16PropertyStartsRange(int start, int end, int value, UnicodeSet set) {
/* add the start code point to the USet */
set.add(start);
if(start!=end && isAlgorithmicNoNo(value) && (value & DELTA_TCCC_MASK) > DELTA_TCCC_1) {
// Range of code points with same-norm16-value algorithmic decompositions.
// They might have different non-zero FCD16 values.
int prevFCD16=getFCD16(start);
while(++start<=end) {
int fcd16=getFCD16(start);
if(fcd16!=prevFCD16) {
set.add(start);
prevFCD16=fcd16;
}
}
}
}
public void addLcccChars(UnicodeSet set) {
Iterator<Trie2.Range> trieIterator=normTrie.iterator();
Trie2.Range range;
while(trieIterator.hasNext() && !(range=trieIterator.next()).leadSurrogate) {
enumLcccRange(range.startCodePoint, range.endCodePoint, range.value, set);
int start = 0;
CodePointMap.Range range = new CodePointMap.Range();
while (normTrie.getRange(start, CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, INERT,
null, range)) {
int end = range.getEnd();
int norm16 = range.getValue();
if (norm16 > MIN_NORMAL_MAYBE_YES && norm16 != JAMO_VT) {
set.add(start, end);
} else if (minNoNoCompNoMaybeCC <= norm16 && norm16 < limitNoNo) {
int fcd16 = getFCD16(start);
if (fcd16 > 0xff) { set.add(start, end); }
}
start = end + 1;
}
}
public void addPropertyStarts(UnicodeSet set) {
/* add the start code point of each same-value range of each trie */
Iterator<Trie2.Range> trieIterator=normTrie.iterator();
Trie2.Range range;
while(trieIterator.hasNext() && !(range=trieIterator.next()).leadSurrogate) {
enumNorm16PropertyStartsRange(range.startCodePoint, range.endCodePoint, range.value, set);
// Add the start code point of each same-value range of the trie.
int start = 0;
CodePointMap.Range range = new CodePointMap.Range();
while (normTrie.getRange(start, CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, INERT,
null, range)) {
int end = range.getEnd();
int value = range.getValue();
set.add(start);
if (start != end && isAlgorithmicNoNo(value) &&
(value & DELTA_TCCC_MASK) > DELTA_TCCC_1) {
// Range of code points with same-norm16-value algorithmic decompositions.
// They might have different non-zero FCD16 values.
int prevFCD16 = getFCD16(start);
while (++start <= end) {
int fcd16 = getFCD16(start);
if (fcd16 != prevFCD16) {
set.add(start);
prevFCD16 = fcd16;
}
}
}
start = end + 1;
}
/* add Hangul LV syllables and LV+1 because of skippables */
@ -538,20 +561,21 @@ public final class Normalizer2Impl {
}
public void addCanonIterPropertyStarts(UnicodeSet set) {
/* add the start code point of each same-value range of the canonical iterator data trie */
// Add the start code point of each same-value range of the canonical iterator data trie.
ensureCanonIterData();
// currently only used for the SEGMENT_STARTER property
Iterator<Trie2.Range> trieIterator=canonIterData.iterator(segmentStarterMapper);
Trie2.Range range;
while(trieIterator.hasNext() && !(range=trieIterator.next()).leadSurrogate) {
/* add the start code point to the USet */
set.add(range.startCodePoint);
// Currently only used for the SEGMENT_STARTER property.
int start = 0;
CodePointMap.Range range = new CodePointMap.Range();
while (canonIterData.getRange(start, segmentStarterMapper, range)) {
set.add(start);
start = range.getEnd() + 1;
}
}
private static final Trie2.ValueMapper segmentStarterMapper=new Trie2.ValueMapper() {
private static final CodePointMap.ValueFilter segmentStarterMapper =
new CodePointMap.ValueFilter() {
@Override
public int map(int in) {
return in&CANON_NOT_SEGMENT_STARTER;
public int apply(int value) {
return value & CANON_NOT_SEGMENT_STARTER;
}
};
@ -574,12 +598,14 @@ public final class Normalizer2Impl {
*/
public synchronized Normalizer2Impl ensureCanonIterData() {
if(canonIterData==null) {
Trie2Writable newData=new Trie2Writable(0, 0);
MutableCodePointTrie mutableTrie = new MutableCodePointTrie(0, 0);
canonStartSets=new ArrayList<UnicodeSet>();
Iterator<Trie2.Range> trieIterator=normTrie.iterator();
Trie2.Range range;
while(trieIterator.hasNext() && !(range=trieIterator.next()).leadSurrogate) {
final int norm16=range.value;
int start = 0;
CodePointMap.Range range = new CodePointMap.Range();
while (normTrie.getRange(start, CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, INERT,
null, range)) {
final int end = range.getEnd();
final int norm16 = range.getValue();
if(isInert(norm16) || (minYesNo<=norm16 && norm16<minNoNo)) {
// Inert, or 2-way mapping (including Hangul syllable).
// We do not write a canonStartSet for any yesNo character.
@ -587,10 +613,11 @@ public final class Normalizer2Impl {
// starter's compositions list, and the other characters in
// 2-way mappings get CANON_NOT_SEGMENT_STARTER set because they are
// "maybe" characters.
start = end + 1;
continue;
}
for(int c=range.startCodePoint; c<=range.endCodePoint; ++c) {
final int oldValue=newData.get(c);
for (int c = start; c <= end; ++c) {
final int oldValue = mutableTrie.get(c);
int newValue=oldValue;
if(isMaybeOrNonZeroCC(norm16)) {
// not a segment starter if it occurs in a decomposition or has cc!=0
@ -608,7 +635,7 @@ public final class Normalizer2Impl {
if (isDecompNoAlgorithmic(norm16_2)) {
// Maps to an isCompYesAndZeroCC.
c2 = mapAlgorithmic(c2, norm16_2);
norm16_2 = getNorm16(c2);
norm16_2 = getRawNorm16(c2);
// No compatibility mappings for the CanonicalIterator.
assert(!(isHangulLV(norm16_2) || isHangulLVT(norm16_2)));
}
@ -628,36 +655,43 @@ public final class Normalizer2Impl {
// add c to first code point's start set
int limit=mapping+length;
c2=extraData.codePointAt(mapping);
addToStartSet(newData, c, c2);
addToStartSet(mutableTrie, c, c2);
// Set CANON_NOT_SEGMENT_STARTER for each remaining code point of a
// one-way mapping. A 2-way mapping is possible here after
// intermediate algorithmic mapping.
if(norm16_2>=minNoNo) {
while((mapping+=Character.charCount(c2))<limit) {
c2=extraData.codePointAt(mapping);
int c2Value=newData.get(c2);
int c2Value = mutableTrie.get(c2);
if((c2Value&CANON_NOT_SEGMENT_STARTER)==0) {
newData.set(c2, c2Value|CANON_NOT_SEGMENT_STARTER);
mutableTrie.set(c2, c2Value|CANON_NOT_SEGMENT_STARTER);
}
}
}
}
} else {
// c decomposed to c2 algorithmically; c has cc==0
addToStartSet(newData, c, c2);
addToStartSet(mutableTrie, c, c2);
}
}
if(newValue!=oldValue) {
newData.set(c, newValue);
mutableTrie.set(c, newValue);
}
}
start = end + 1;
}
canonIterData=newData.toTrie2_32();
canonIterData = mutableTrie.buildImmutable(
CodePointTrie.Type.SMALL, CodePointTrie.ValueWidth.BITS_32);
}
return this;
}
public int getNorm16(int c) { return normTrie.get(c); }
// The trie stores values for lead surrogate code *units*.
// Surrogate code *points* are inert.
public int getNorm16(int c) {
return UTF16Plus.isLeadSurrogate(c) ? INERT : normTrie.get(c);
}
public int getRawNorm16(int c) { return normTrie.get(c); }
public int getCompQuickCheck(int norm16) {
if(norm16<minNoNo || MIN_YES_YES_WITH_CC<=norm16) {
@ -730,7 +764,7 @@ public final class Normalizer2Impl {
}
// Maps to an isCompYesAndZeroCC.
c=mapAlgorithmic(c, norm16);
norm16=getNorm16(c);
norm16 = getRawNorm16(c);
}
}
if(norm16<=minYesNo || isHangulLVT(norm16)) {
@ -763,7 +797,7 @@ public final class Normalizer2Impl {
// Maps to an isCompYesAndZeroCC.
decomp=c=mapAlgorithmic(c, norm16);
// The mapping might decompose further.
norm16 = getNorm16(c);
norm16 = getRawNorm16(c);
}
if (norm16 < minYesNo) {
if(decomp<0) {
@ -857,7 +891,7 @@ public final class Normalizer2Impl {
set.add(value);
}
if((canonValue&CANON_HAS_COMPOSITIONS)!=0) {
int norm16=getNorm16(c);
int norm16 = getRawNorm16(c);
if(norm16==JAMO_L) {
int syllable=Hangul.HANGUL_BASE+(c-Hangul.JAMO_L_BASE)*Hangul.JAMO_VT_COUNT;
set.add(syllable, syllable+Hangul.JAMO_VT_COUNT-1);
@ -975,27 +1009,23 @@ public final class Normalizer2Impl {
// count code units below the minimum or with irrelevant data for the quick check
for(prevSrc=src; src!=limit;) {
if( (c=s.charAt(src))<minNoCP ||
isMostDecompYesAndZeroCC(norm16=normTrie.getFromU16SingleLead((char)c))
isMostDecompYesAndZeroCC(norm16=normTrie.bmpGet(c))
) {
++src;
} else if(!UTF16.isSurrogate((char)c)) {
} else if (!UTF16Plus.isLeadSurrogate(c)) {
break;
} else {
char c2;
if(UTF16Plus.isSurrogateLead(c)) {
if((src+1)!=limit && Character.isLowSurrogate(c2=s.charAt(src+1))) {
c=Character.toCodePoint((char)c, c2);
if ((src + 1) != limit && Character.isLowSurrogate(c2 = s.charAt(src + 1))) {
c = Character.toCodePoint((char)c, c2);
norm16 = normTrie.suppGet(c);
if (isMostDecompYesAndZeroCC(norm16)) {
src += 2;
} else {
break;
}
} else /* trail surrogate */ {
if(prevSrc<src && Character.isHighSurrogate(c2=s.charAt(src-1))) {
--src;
c=Character.toCodePoint(c2, (char)c);
}
}
if(isMostDecompYesAndZeroCC(norm16=getNorm16(c))) {
src+=Character.charCount(c);
} else {
break;
++src; // unpaired lead surrogate: inert
}
}
}
@ -1055,7 +1085,7 @@ public final class Normalizer2Impl {
c=Character.codePointAt(s, src);
cc=getCC(getNorm16(c));
};
buffer.append(s, 0, src, firstCC, prevCC);
buffer.append(s, 0, src, false, firstCC, prevCC);
buffer.append(s, src, limit);
}
@ -1083,28 +1113,22 @@ public final class Normalizer2Impl {
return true;
}
if( (c=s.charAt(src))<minNoMaybeCP ||
isCompYesAndZeroCC(norm16=normTrie.getFromU16SingleLead((char)c))
isCompYesAndZeroCC(norm16=normTrie.bmpGet(c))
) {
++src;
} else {
prevSrc = src++;
if(!UTF16.isSurrogate((char)c)) {
if (!UTF16Plus.isLeadSurrogate(c)) {
break;
} else {
char c2;
if(UTF16Plus.isSurrogateLead(c)) {
if(src!=limit && Character.isLowSurrogate(c2=s.charAt(src))) {
++src;
c=Character.toCodePoint((char)c, c2);
if (src != limit && Character.isLowSurrogate(c2 = s.charAt(src))) {
++src;
c = Character.toCodePoint((char)c, c2);
norm16 = normTrie.suppGet(c);
if (!isCompYesAndZeroCC(norm16)) {
break;
}
} else /* trail surrogate */ {
if(prevBoundary<prevSrc && Character.isHighSurrogate(c2=s.charAt(prevSrc-1))) {
--prevSrc;
c=Character.toCodePoint(c2, (char)c);
}
}
if(!isCompYesAndZeroCC(norm16=getNorm16(c))) {
break;
}
}
}
@ -1325,28 +1349,22 @@ public final class Normalizer2Impl {
return (src<<1)|qcResult; // "yes" or "maybe"
}
if( (c=s.charAt(src))<minNoMaybeCP ||
isCompYesAndZeroCC(norm16=normTrie.getFromU16SingleLead((char)c))
isCompYesAndZeroCC(norm16=normTrie.bmpGet(c))
) {
++src;
} else {
prevSrc = src++;
if(!UTF16.isSurrogate((char)c)) {
if (!UTF16Plus.isLeadSurrogate(c)) {
break;
} else {
char c2;
if(UTF16Plus.isSurrogateLead(c)) {
if(src!=limit && Character.isLowSurrogate(c2=s.charAt(src))) {
++src;
c=Character.toCodePoint((char)c, c2);
if (src != limit && Character.isLowSurrogate(c2 = s.charAt(src))) {
++src;
c = Character.toCodePoint((char)c, c2);
norm16 = normTrie.suppGet(c);
if (!isCompYesAndZeroCC(norm16)) {
break;
}
} else /* trail surrogate */ {
if(prevBoundary<prevSrc && Character.isHighSurrogate(c2=s.charAt(prevSrc-1))) {
--prevSrc;
c=Character.toCodePoint(c2, (char)c);
}
}
if(!isCompYesAndZeroCC(norm16=getNorm16(c))) {
break;
}
}
}
@ -1468,17 +1486,10 @@ public final class Normalizer2Impl {
prevFCD16=0;
++src;
} else {
if(UTF16.isSurrogate((char)c)) {
if (UTF16Plus.isLeadSurrogate(c)) {
char c2;
if(UTF16Plus.isSurrogateLead(c)) {
if((src+1)!=limit && Character.isLowSurrogate(c2=s.charAt(src+1))) {
c=Character.toCodePoint((char)c, c2);
}
} else /* trail surrogate */ {
if(prevSrc<src && Character.isHighSurrogate(c2=s.charAt(src-1))) {
--src;
c=Character.toCodePoint(c2, (char)c);
}
if ((src + 1) != limit && Character.isLowSurrogate(c2 = s.charAt(src + 1))) {
c = Character.toCodePoint((char)c, c2);
}
}
if((fcd16=getFCD16FromNormData(c))<=0xff) {
@ -1810,7 +1821,7 @@ public final class Normalizer2Impl {
}
// Maps to an isCompYesAndZeroCC.
c=mapAlgorithmic(c, norm16);
norm16=getNorm16(c);
norm16 = getRawNorm16(c);
}
if (norm16 < minYesNo) {
// c does not decompose
@ -1831,7 +1842,7 @@ public final class Normalizer2Impl {
leadCC=0;
}
++mapping; // skip over the firstUnit
buffer.append(extraData, mapping, mapping+length, leadCC, trailCC);
buffer.append(extraData, mapping, mapping+length, true, leadCC, trailCC);
}
}
@ -1921,7 +1932,7 @@ public final class Normalizer2Impl {
}
int composite=compositeAndFwd>>1;
if((compositeAndFwd&1)!=0) {
addComposites(getCompositionsListForComposite(getNorm16(composite)), set);
addComposites(getCompositionsListForComposite(getRawNorm16(composite)), set);
}
set.add(composite);
} while((firstUnit&COMP_1_LAST_TUPLE)==0);
@ -2045,7 +2056,7 @@ public final class Normalizer2Impl {
// Is the composite a starter that combines forward?
if((compositeAndFwd&1)!=0) {
compositionsList=
getCompositionsListForComposite(getNorm16(composite));
getCompositionsListForComposite(getRawNorm16(composite));
} else {
compositionsList=-1;
}
@ -2083,7 +2094,7 @@ public final class Normalizer2Impl {
}
public int composePair(int a, int b) {
int norm16=getNorm16(a); // maps an out-of-range 'a' to inert norm16=0
int norm16=getNorm16(a); // maps an out-of-range 'a' to inert norm16
int list;
if(isInert(norm16)) {
return -1;
@ -2220,19 +2231,19 @@ public final class Normalizer2Impl {
return getFCD16(Character.codePointBefore(s, p));
}
private void addToStartSet(Trie2Writable newData, int origin, int decompLead) {
int canonValue=newData.get(decompLead);
private void addToStartSet(MutableCodePointTrie mutableTrie, int origin, int decompLead) {
int canonValue = mutableTrie.get(decompLead);
if((canonValue&(CANON_HAS_SET|CANON_VALUE_MASK))==0 && origin!=0) {
// origin is the first character whose decomposition starts with
// the character for which we are setting the value.
newData.set(decompLead, canonValue|origin);
mutableTrie.set(decompLead, canonValue|origin);
} else {
// origin is not the first character, or it is U+0000.
UnicodeSet set;
if((canonValue&CANON_HAS_SET)==0) {
int firstOrigin=canonValue&CANON_VALUE_MASK;
canonValue=(canonValue&~CANON_VALUE_MASK)|CANON_HAS_SET|canonStartSets.size();
newData.set(decompLead, canonValue);
mutableTrie.set(decompLead, canonValue);
canonStartSets.add(set=new UnicodeSet());
if(firstOrigin!=0) {
set.add(firstOrigin);
@ -2263,12 +2274,12 @@ public final class Normalizer2Impl {
private int centerNoNoDelta;
private int minMaybeYes;
private Trie2_16 normTrie;
private CodePointTrie.Fast16 normTrie;
private String maybeYesCompositions;
private String extraData; // mappings and/or compositions for yesYes, yesNo & noNo characters
private byte[] smallFCD; // [0x100] one bit per 32 BMP code points, set if any FCD!=0
private Trie2_32 canonIterData;
private CodePointTrie canonIterData;
private ArrayList<UnicodeSet> canonStartSets;
// bits in canonIterData

View file

@ -10,6 +10,7 @@ package com.ibm.icu.impl;
import java.util.EnumSet;
import com.ibm.icu.impl.Normalizer2Impl.UTF16Plus;
import com.ibm.icu.lang.UCharacter;
import com.ibm.icu.lang.UCharacterCategory;
import com.ibm.icu.lang.UCharacterDirection;
@ -223,19 +224,31 @@ public final class UTS46 extends IDNA {
promoteAndResetLabelErrors(info);
destLength+=newLength-labelLength;
labelLimit=labelStart+=newLength+1;
} else if(0xdf<=c && c<=0x200d && (c==0xdf || c==0x3c2 || c>=0x200c)) {
continue;
} else if(c<0xdf) {
// pass
} else if(c<=0x200d && (c==0xdf || c==0x3c2 || c>=0x200c)) {
setTransitionalDifferent(info);
if(doMapDevChars) {
destLength=mapDevChars(dest, labelStart, labelLimit);
// Do not increment labelLimit in case c was removed.
// All deviation characters have been mapped, no need to check for them again.
doMapDevChars=false;
} else {
++labelLimit;
// Do not increment labelLimit in case c was removed.
continue;
}
} else if(Character.isSurrogate(c)) {
if(UTF16Plus.isSurrogateLead(c) ?
(labelLimit+1)==destLength ||
!Character.isLowSurrogate(dest.charAt(labelLimit+1)) :
labelLimit==labelStart ||
!Character.isHighSurrogate(dest.charAt(labelLimit-1))) {
// Map an unpaired surrogate to U+FFFD before normalization so that when
// that removes characters we do not turn two unpaired ones into a pair.
addLabelError(info, Error.DISALLOWED);
dest.setCharAt(labelLimit, '\ufffd');
}
} else {
++labelLimit;
}
++labelLimit;
}
// Permit an empty label at the end (0<labelStart==labelLimit==destLength is ok)
// but not an empty label elsewhere nor a completely empty domain name.

View file

@ -0,0 +1,460 @@
// © 2018 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html#License
// created: 2018may10 Markus W. Scherer
package com.ibm.icu.util;
import java.util.Iterator;
import java.util.NoSuchElementException;
/**
* Abstract map from Unicode code points (U+0000..U+10FFFF) to integer values.
* This does not implement java.util.Map.
*
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public abstract class CodePointMap implements Iterable<CodePointMap.Range> {
/**
* Selectors for how getRange() should report value ranges overlapping with surrogates.
* Most users should use NORMAL.
*
* @see #getRange
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public enum RangeOption {
/**
* getRange() enumerates all same-value ranges as stored in the trie.
* Most users should use this option.
*
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
NORMAL,
/**
* getRange() enumerates all same-value ranges as stored in the trie,
* except that lead surrogates (U+D800..U+DBFF) are treated as having the
* surrogateValue, which is passed to getRange() as a separate parameter.
* The surrogateValue is not transformed via filter().
* See {@link Character#isHighSurrogate}.
*
* <p>Most users should use NORMAL instead.
*
* <p>This option is useful for tries that map surrogate code *units* to
* special values optimized for UTF-16 string processing
* or for special error behavior for unpaired surrogates,
* but those values are not to be associated with the lead surrogate code *points*.
*
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
FIXED_LEAD_SURROGATES,
/**
* getRange() enumerates all same-value ranges as stored in the trie,
* except that all surrogates (U+D800..U+DFFF) are treated as having the
* surrogateValue, which is passed to getRange() as a separate parameter.
* The surrogateValue is not transformed via filter().
* See {@link Character#isSurrogate}.
*
* <p>Most users should use NORMAL instead.
*
* <p>This option is useful for tries that map surrogate code *units* to
* special values optimized for UTF-16 string processing
* or for special error behavior for unpaired surrogates,
* but those values are not to be associated with the lead surrogate code *points*.
*
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
FIXED_ALL_SURROGATES
}
/**
* Callback function interface: Modifies a trie value.
* Optionally called by getRange().
* The modified value will be returned by the getRange() function.
*
* <p>Can be used to ignore some of the value bits,
* make a filter for one of several values,
* return a value index computed from the trie value, etc.
*
* @see #getRange
* @see #iterator
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public interface ValueFilter {
/**
* Modifies the trie value.
*
* @param value trie value
* @return modified value
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public int apply(int value);
}
/**
* Range iteration result data.
* Code points from start to end map to the same value.
* The value may have been modified by {@link ValueFilter#apply(int)},
* or it may be the surrogateValue if a RangeOption other than "normal" was used.
*
* @see #getRange
* @see #iterator
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public static final class Range {
private int start;
private int end;
private int value;
/**
* Constructor. Sets start and end to -1 and value to 0.
*
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public Range() {
start = end = -1;
value = 0;
}
/**
* @return the start code point
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public int getStart() { return start; }
/**
* @return the (inclusive) end code point
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public int getEnd() { return end; }
/**
* @return the range value
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public int getValue() { return value; }
/**
* Sets the range. When using {@link #iterator()},
* iteration will resume after the newly set end.
*
* @param start new start code point
* @param end new end code point
* @param value new value
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public void set(int start, int end, int value) {
this.start = start;
this.end = end;
this.value = value;
}
}
private final class RangeIterator implements Iterator<Range> {
private Range range = new Range();
@Override
public boolean hasNext() {
return -1 <= range.end && range.end < 0x10ffff;
}
@Override
public Range next() {
if (getRange(range.end + 1, null, range)) {
return range;
} else {
throw new NoSuchElementException();
}
}
@Override
public final void remove() {
throw new UnsupportedOperationException();
}
}
/**
* Iterates over code points of a string and fetches trie values.
* This does not implement java.util.Iterator.
*
* <pre>
* void onString(CodePointMap map, CharSequence s, int start) {
* CodePointMap.StringIterator iter = map.stringIterator(s, start);
* while (iter.next()) {
* int end = iter.getIndex(); // code point from between start and end
* useValue(s, start, end, iter.getCodePoint(), iter.getValue());
* start = end;
* }
* }
* </pre>
*
* <p>This class is not intended for public subclassing.
*
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public class StringIterator {
/**
* @internal
* @deprecated This API is ICU internal only.
*/
@Deprecated
protected CharSequence s;
/**
* @internal
* @deprecated This API is ICU internal only.
*/
@Deprecated
protected int sIndex;
/**
* @internal
* @deprecated This API is ICU internal only.
*/
@Deprecated
protected int c;
/**
* @internal
* @deprecated This API is ICU internal only.
*/
@Deprecated
protected int value;
/**
* @internal
* @deprecated This API is ICU internal only.
*/
@Deprecated
protected StringIterator(CharSequence s, int sIndex) {
this.s = s;
this.sIndex = sIndex;
c = -1;
value = 0;
}
/**
* Resets the iterator to a new string and/or a new string index.
*
* @param s string to iterate over
* @param sIndex string index where the iteration will start
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public void reset(CharSequence s, int sIndex) {
this.s = s;
this.sIndex = sIndex;
c = -1;
value = 0;
}
/**
* Reads the next code point, post-increments the string index,
* and gets a value from the trie.
* Sets the trie error value if the code point is an unpaired surrogate.
*
* @return true if the string index was not yet at the end of the string;
* otherwise the iterator did not advance
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public boolean next() {
if (sIndex >= s.length()) {
return false;
}
c = Character.codePointAt(s, sIndex);
sIndex += Character.charCount(c);
value = get(c);
return true;
}
/**
* Reads the previous code point, pre-decrements the string index,
* and gets a value from the trie.
* Sets the trie error value if the code point is an unpaired surrogate.
*
* @return true if the string index was not yet at the start of the string;
* otherwise the iterator did not advance
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public boolean previous() {
if (sIndex <= 0) {
return false;
}
c = Character.codePointBefore(s, sIndex);
sIndex -= Character.charCount(c);
value = get(c);
return true;
}
/**
* @return the string index
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public final int getIndex() { return sIndex; }
/**
* @return the code point
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public final int getCodePoint() { return c; }
/**
* @return the trie value,
* or the trie error value if the code point is an unpaired surrogate
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public final int getValue() { return value; }
}
/**
* Returns the value for a code point as stored in the trie, with range checking.
* Returns the trie error value if c is not in the range 0..U+10FFFF.
*
* @param c the code point
* @return the trie value,
* or the trie error value if the code point is not in the range 0..U+10FFFF
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public abstract int get(int c);
/**
* Sets the range object to a range of code points beginning with the start parameter.
* The range end is the the last code point such that
* all those from start to there have the same value.
* Returns false if start is not 0..U+10FFFF.
* Can be used to efficiently iterate over all same-value ranges in a trie.
*
* <p>If the {@link ValueFilter} parameter is not null, then
* the value to be delivered is passed through that filter, and the return value is the end
* of the range where all values are modified to the same actual value.
* The value is unchanged if that parameter is null.
*
* <p>Example:
* <pre>
* int start = 0;
* CodePointMap.Range range = new CodePointMap.Range();
* while (trie.getRange(start, null, range)) {
* int end = range.getEnd();
* int value = range.getValue();
* // Work with the range start..end and its value.
* start = end + 1;
* }
* </pre>
*
* @param start range start
* @param filter an object that may modify the trie data value,
* or null if the values from the trie are to be used unmodified
* @param range the range object that will be set to the code point range and value
* @return true if start is 0..U+10FFFF; otherwise no new range is fetched
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public abstract boolean getRange(int start, ValueFilter filter, Range range);
/**
* Sets the range object to a range of code points beginning with the start parameter.
* The range end is the the last code point such that
* all those from start to there have the same value.
* Returns false if start is not 0..U+10FFFF.
*
* <p>Same as the simpler {@link #getRange(int, ValueFilter, Range)} but optionally
* modifies the range if it overlaps with surrogate code points.
*
* @param start range start
* @param option defines whether surrogates are treated normally,
* or as having the surrogateValue; usually {@value RangeOption#NORMAL}
* @param surrogateValue value for surrogates; ignored if option=={@value RangeOption#NORMAL}
* @param filter an object that may modify the trie data value,
* or null if the values from the trie are to be used unmodified
* @param range the range object that will be set to the code point range and value
* @return true if start is 0..U+10FFFF; otherwise no new range is fetched
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public boolean getRange(int start, RangeOption option, int surrogateValue,
ValueFilter filter, Range range) {
assert option != null;
if (!getRange(start, filter, range)) {
return false;
}
if (option == RangeOption.NORMAL) {
return true;
}
int surrEnd = option == RangeOption.FIXED_ALL_SURROGATES ? 0xdfff : 0xdbff;
int end = range.end;
if (end < 0xd7ff || start > surrEnd) {
return true;
}
// The range overlaps with surrogates, or ends just before the first one.
if (range.value == surrogateValue) {
if (end >= surrEnd) {
// Surrogates followed by a non-surrValue range,
// or surrogates are part of a larger surrValue range.
return true;
}
} else {
if (start <= 0xd7ff) {
range.end = 0xd7ff; // Non-surrValue range ends before surrValue surrogates.
return true;
}
// Start is a surrogate with a non-surrValue code *unit* value.
// Return a surrValue code *point* range.
range.value = surrogateValue;
if (end > surrEnd) {
range.end = surrEnd; // Surrogate range ends before non-surrValue rest of range.
return true;
}
}
// See if the surrValue surrogate range can be merged with
// an immediately following range.
if (getRange(surrEnd + 1, filter, range) && range.value == surrogateValue) {
range.start = start;
return true;
}
range.start = start;
range.end = surrEnd;
range.value = surrogateValue;
return true;
}
/**
* Convenience iterator over same-trie-value code point ranges.
* Same as looping over all ranges with {@link #getRange(int, ValueFilter, Range)}
* without filtering.
* Adjacent ranges have different trie values.
*
* <p>The iterator always returns the same Range object.
*
* @return a Range iterator
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
@Override
public Iterator<Range> iterator() {
return new RangeIterator();
}
/**
* Returns an iterator (not a java.util.Iterator) over code points of a string
* for fetching trie values.
*
* @param s string to iterate over
* @param sIndex string index where the iteration will start
* @return the iterator
* @draft ICU 63
* @provisional This API might change or be removed in a future release.
*/
public StringIterator stringIterator(CharSequence s, int sIndex) {
return new StringIterator(s, sIndex);
}
}

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

4
icu4j/main/shared/data/icudata.jar Executable file → Normal file
View file

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:70c249360d5cc010c75203f5add8040cbcc4f33229e1d82d34b6185d69832143
size 12510210
oid sha256:a8be41753876c867630b4e740d692e0ae7ced119086a22cd4844ea7bf174d6f7
size 12509408

View file

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:93a0bf4221a173b33aeda78f4646092caad816a6832310a89278de249ec18634
oid sha256:55923dda88f8bf3affc2cf6d774a92a49e5fbc4be5583769bfe90fc7f319d2b1
size 92857

4
icu4j/main/shared/data/testdata.jar Executable file → Normal file
View file

@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:47978ca4c19730c3d4387d9058679115dbf1e21964b993a889a38680fd3dfe47
size 813186
oid sha256:0d399ead8487d2beff526c723212022ba354501bb3777481f16b53241d24a8d1
size 813119

View file

@ -2632,9 +2632,14 @@ public class BasicTest extends TestFmwk {
@Test
public void TestCustomComp() {
String [][] pairs={
{ "\\uD801\\uE000\\uDFFE", "" },
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
// ICU 63 normalization with CodePointTrie requires inert surrogate code points.
// { "\\uD801\\uE000\\uDFFE", "" },
// { "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
// { "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
{ "\\uD801\\uE000\\uDFFE", "\\uD801\\uDFFE" },
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD800\\uD801\\uDFFE\\uDFFF" },
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD800\\U000107FE\\uDFFF" },
{ "\\uE001\\U000110B9\\u0345\\u0308\\u0327", "\\uE002\\U000110B9\\u0327\\u0345" },
{ "\\uE010\\U000F0011\\uE012", "\\uE011\\uE012" },
{ "\\uE010\\U000F0011\\U000F0011\\uE012", "\\uE011\\U000F0010" },
@ -2661,9 +2666,14 @@ public class BasicTest extends TestFmwk {
@Test
public void TestCustomFCC() {
String[][] pairs={
{ "\\uD801\\uE000\\uDFFE", "" },
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
// ICU 63 normalization with CodePointTrie requires inert surrogate code points.
// { "\\uD801\\uE000\\uDFFE", "" },
// { "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
// { "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
{ "\\uD801\\uE000\\uDFFE", "\\uD801\\uDFFE" },
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD800\\uD801\\uDFFE\\uDFFF" },
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD800\\U000107FE\\uDFFF" },
// The following expected result is different from CustomComp
// because of only-contiguous composition.
{ "\\uE001\\U000110B9\\u0345\\u0308\\u0327", "\\uE001\\U000110B9\\u0327\\u0308\\u0345" },

View file

@ -0,0 +1,985 @@
// © 2018 and later: Unicode, Inc. and others.
// License & terms of use: http://www.unicode.org/copyright.html#License
// created: 2018jul10 Markus W. Scherer
// This is a fairly straight port from cintltst/ucptrietest.c.
// It wants to remain close to the C code, rather than be completely colloquial Java.
package com.ibm.icu.dev.test.util;
import java.io.ByteArrayOutputStream;
import java.nio.ByteBuffer;
import java.util.Arrays;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
import com.ibm.icu.dev.test.TestFmwk;
import com.ibm.icu.impl.Normalizer2Impl.UTF16Plus;
import com.ibm.icu.util.CodePointMap;
import com.ibm.icu.util.CodePointTrie;
import com.ibm.icu.util.MutableCodePointTrie;
@RunWith(JUnit4.class)
public final class CodePointTrieTest extends TestFmwk {
/* Values for setting possibly overlapping, out-of-order ranges of values */
private static class SetRange {
SetRange(int start, int limit, int value) {
this.start = start;
this.limit = limit;
this.value = value;
}
final int start, limit;
final int value;
}
// Returned from getSpecialValues(). Values extracted from an array of CheckRange.
private static class SpecialValues {
SpecialValues(int i, int initialValue, int errorValue) {
this.i = i;
this.initialValue = initialValue;
this.errorValue = errorValue;
}
final int i;
final int initialValue;
final int errorValue;
}
/*
* Values for testing:
* value is set from the previous boundary's limit to before
* this boundary's limit
*
* There must be an entry with limit 0 and the intialValue.
* It may be preceded by an entry with negative limit and the errorValue.
*/
private static class CheckRange {
CheckRange(int limit, int value) {
this.limit = limit;
this.value = value;
}
final int limit;
final int value;
}
private static int skipSpecialValues(CheckRange checkRanges[]) {
int i;
for(i=0; i<checkRanges.length && checkRanges[i].limit<=0; ++i) {}
return i;
}
private static SpecialValues getSpecialValues(CheckRange checkRanges[]) {
int i=0;
int initialValue, errorValue;
if(i<checkRanges.length && checkRanges[i].limit<0) {
errorValue=checkRanges[i++].value;
} else {
errorValue=0xad;
}
if(i<checkRanges.length && checkRanges[i].limit==0) {
initialValue=checkRanges[i++].value;
} else {
initialValue=0;
}
return new SpecialValues(i, initialValue, errorValue);
}
/* ucptrie_enum() callback, modifies a value */
private static class TestValueFilter implements CodePointMap.ValueFilter {
@Override
public int apply(int value) {
return value ^ 0x5555;
}
}
private static final TestValueFilter testFilter = new TestValueFilter();
private boolean
doCheckRange(String name, String variant,
int start, boolean getRangeResult, CodePointMap.Range range,
int expEnd, int expValue) {
if (!getRangeResult) {
if (expEnd >= 0) {
fail(String.format( // log_err(
"error: %s getRanges (%s) fails to deliver range [U+%04x..U+%04x].0x%x\n",
name, variant, start, expEnd, expValue));
}
return false;
}
if (expEnd < 0) {
fail(String.format(
"error: %s getRanges (%s) delivers unexpected range [U+%04x..U+%04x].0x%x\n",
name, variant, range.getStart(), range.getEnd(), range.getValue()));
return false;
}
if (range.getStart() != start || range.getEnd() != expEnd || range.getValue() != expValue) {
fail(String.format(
"error: %s getRanges (%s) delivers wrong range [U+%04x..U+%04x].0x%x " +
"instead of [U+%04x..U+%04x].0x%x\n",
name, variant, range.getStart(), range.getEnd(), range.getValue(),
start, expEnd, expValue));
return false;
}
return true;
}
// Test iteration starting from various UTF-8/16 and trie structure boundaries.
// Also test starting partway through lead & trail surrogates for fixed-surrogate-value options,
// and partway through supplementary code points.
private static int iterStarts[] = {
0, 0x7f, 0x80, 0x7ff, 0x800, 0xfff, 0x1000,
0xd7ff, 0xd800, 0xd888, 0xdddd, 0xdfff, 0xe000,
0xffff, 0x10000, 0x12345, 0x10ffff, 0x110000
};
private void
testTrieGetRanges(String testName, CodePointMap trie,
CodePointMap.RangeOption option, int surrValue,
CheckRange checkRanges[]) {
String typeName = trie instanceof MutableCodePointTrie ? "mutableTrie" : "trie";
CodePointMap.Range range = new CodePointMap.Range();
for (int s = 0; s < iterStarts.length; ++s) {
int start = iterStarts[s];
int i, i0;
int expEnd;
int expValue;
boolean getRangeResult;
// No need to go from each iteration start to the very end.
int innerLoopCount;
String name = String.format("%s/%s(%s) min=U+%04x", typeName, option, testName, start);
// Skip over special values and low ranges.
for (i = 0; i < checkRanges.length && checkRanges[i].limit <= start; ++i) {}
i0 = i;
// without value handler
for (innerLoopCount = 0;; ++i, start = range.getEnd() + 1) {
if (i < checkRanges.length) {
expEnd = checkRanges[i].limit - 1;
expValue = checkRanges[i].value;
} else {
expEnd = -1;
expValue = 0x5005;
}
getRangeResult = option != CodePointMap.RangeOption.NORMAL ?
trie.getRange(start, option, surrValue, null, range) :
trie.getRange(start, null, range);
if (!doCheckRange(name, "without value handler",
start, getRangeResult, range, expEnd, expValue)) {
break;
}
if (s != 0 && ++innerLoopCount == 5) { break; }
}
// with value handler
for (i = i0, start = iterStarts[s], innerLoopCount = 0;;
++i, start = range.getEnd() + 1) {
if (i < checkRanges.length) {
expEnd = checkRanges[i].limit - 1;
expValue = checkRanges[i].value ^ 0x5555;
} else {
expEnd = -1;
expValue = 0x5005;
}
getRangeResult = trie.getRange(start, option, surrValue ^ 0x5555, testFilter, range);
if (!doCheckRange(name, "with value handler",
start, getRangeResult, range, expEnd, expValue)) {
break;
}
if (s != 0 && ++innerLoopCount == 5) { break; }
}
// C also tests without value (with a NULL value pointer),
// but that does not apply to Java.
}
}
// Note: There is much less to do here in polymorphic Java than in C
// where we have many specialized macros in addition to generic functions.
private void
testTrieGetters(String testName, CodePointTrie trie,
CodePointTrie.Type type, CodePointTrie.ValueWidth valueWidth,
CheckRange checkRanges[]) {
int value, value2;
int start, limit;
int i;
int countErrors=0;
CodePointTrie.Fast fastTrie =
type == CodePointTrie.Type.FAST ? (CodePointTrie.Fast)trie : null;
String typeName = "trie";
SpecialValues specials = getSpecialValues(checkRanges);
start=0;
for(i=specials.i; i<checkRanges.length; ++i) {
limit=checkRanges[i].limit;
value=checkRanges[i].value;
while(start<limit) {
if (start <= 0x7f) {
value2 = trie.asciiGet(start);
if (value != value2) {
fail(String.format(
"error: %s(%s).fromASCII(U+%04x)==0x%x instead of 0x%x\n",
typeName, testName, start, value2, value));
++countErrors;
}
}
if (fastTrie != null) {
if(start<=0xffff) {
value2 = fastTrie.bmpGet(start);
if(value!=value2) {
fail(String.format(
"error: %s(%s).fromBMP(U+%04x)==0x%x instead of 0x%x\n",
typeName, testName, start, value2, value));
++countErrors;
}
} else {
value2 = fastTrie.suppGet(start);
if(value!=value2) {
fail(String.format(
"error: %s(%s).fromSupp(U+%04x)==0x%x instead of 0x%x\n",
typeName, testName, start, value2, value));
++countErrors;
}
}
}
value2 = trie.get(start);
if(value!=value2) {
fail(String.format(
"error: %s(%s).get(U+%04x)==0x%x instead of 0x%x\n",
typeName, testName, start, value2, value));
++countErrors;
}
++start;
if(countErrors>10) {
return;
}
}
}
/* test errorValue */
value = trie.get(-1);
value2 = trie.get(0x110000);
if(value!=specials.errorValue || value2!=specials.errorValue) {
fail(String.format(
"error: %s(%s).get(out of range) != errorValue\n",
typeName, testName));
}
}
private void
testBuilderGetters(String testName, MutableCodePointTrie mutableTrie, CheckRange checkRanges[]) {
int value, value2;
int start, limit;
int i;
int countErrors=0;
String typeName = "mutableTrie";
SpecialValues specials=getSpecialValues(checkRanges);
start=0;
for(i=specials.i; i<checkRanges.length; ++i) {
limit=checkRanges[i].limit;
value=checkRanges[i].value;
while(start<limit) {
value2=mutableTrie.get(start);
if(value!=value2) {
fail(String.format(
"error: %s(%s).get(U+%04x)==0x%x instead of 0x%x\n",
typeName, testName, start, value2, value));
++countErrors;
}
++start;
if(countErrors>10) {
return;
}
}
}
/* test errorValue */
value=mutableTrie.get(-1);
value2=mutableTrie.get(0x110000);
if(value!=specials.errorValue || value2!=specials.errorValue) {
fail(String.format(
"error: %s(%s).get(out of range) != errorValue\n",
typeName, testName));
}
}
private static boolean ACCIDENTAL_SURROGATE_PAIR(CharSequence s, int cp) {
return s.length() > 0 &&
Character.isHighSurrogate(s.charAt(s.length() - 1)) &&
UTF16Plus.isTrailSurrogate(cp);
}
private void
testTrieUTF16(String testName,
CodePointTrie trie, CodePointTrie.ValueWidth valueWidth,
CheckRange checkRanges[]) {
StringBuilder s = new StringBuilder();
int[] values = new int[16000];
int errorValue = trie.get(-1);
int value, expected;
int prevCP, c, c2;
int i, sIndex, countValues;
/* write a string */
prevCP=0;
countValues=0;
for(i=skipSpecialValues(checkRanges); i<checkRanges.length; ++i) {
value=checkRanges[i].value;
/* write three code points */
if(!ACCIDENTAL_SURROGATE_PAIR(s, prevCP)) {
s.appendCodePoint(prevCP); /* start of the range */
values[countValues++]=value;
}
c=checkRanges[i].limit;
prevCP=(prevCP+c)/2; /* middle of the range */
if(!ACCIDENTAL_SURROGATE_PAIR(s, prevCP)) {
s.appendCodePoint(prevCP);
values[countValues++]=value;
}
prevCP=c;
--c; /* end of the range */
if(!ACCIDENTAL_SURROGATE_PAIR(s, c)) {
s.appendCodePoint(c);
values[countValues++]=value;
}
}
CodePointMap.StringIterator si = trie.stringIterator(s, 0);
/* try forward */
sIndex = 0;
i=0;
while (sIndex < s.length()) {
c2 = s.codePointAt(sIndex);
sIndex += Character.charCount(c2);
assertTrue("next() at " + si.getIndex(), si.next());
c = si.getCodePoint();
value = si.getValue();
expected = UTF16Plus.isSurrogate(c) ? errorValue : values[i];
if(value!=expected) {
fail(String.format(
"error: wrong value from UCPTRIE_NEXT(%s)(U+%04x): 0x%x instead of 0x%x\n",
testName, c, value, expected));
}
if(c!=c2) {
fail(String.format(
"error: wrong code point from UCPTRIE_NEXT(%s): U+%04x != U+%04x\n",
testName, c, c2));
continue;
}
++i;
}
assertFalse("next() at the end", si.next());
/* try backward */
sIndex = s.length();
i=countValues;
while (sIndex > 0) {
--i;
c2 = s.codePointBefore(sIndex);
sIndex -= Character.charCount(c2);
assertTrue("previous() at " + si.getIndex(), si.previous());
c = si.getCodePoint();
value = si.getValue();
expected = UTF16Plus.isSurrogate(c) ? errorValue : values[i];
if(value!=expected) {
fail(String.format(
"error: wrong value from UCPTRIE_PREV(%s)(U+%04x): 0x%x instead of 0x%x\n",
testName, c, value, expected));
}
if(c!=c2) {
fail(String.format(
"error: wrong code point from UCPTRIE_PREV(%s): U+%04x != U+%04x\n",
testName, c, c2));
}
}
assertFalse("previous() at the start", si.previous());
}
private void
testTrie(String testName, CodePointTrie trie,
CodePointTrie.Type type, CodePointTrie.ValueWidth valueWidth,
CheckRange checkRanges[]) {
testTrieGetters(testName, trie, type, valueWidth, checkRanges);
testTrieGetRanges(testName, trie, CodePointMap.RangeOption.NORMAL, 0, checkRanges);
if (type == CodePointTrie.Type.FAST) {
testTrieUTF16(testName, trie, valueWidth, checkRanges);
// Java: no testTrieUTF8(testName, trie, valueWidth, checkRanges);
}
}
private void
testBuilder(String testName, MutableCodePointTrie mutableTrie, CheckRange checkRanges[]) {
testBuilderGetters(testName, mutableTrie, checkRanges);
testTrieGetRanges(testName, mutableTrie, CodePointMap.RangeOption.NORMAL, 0, checkRanges);
}
private void
testTrieSerialize(String testName, MutableCodePointTrie mutableTrie,
CodePointTrie.Type type, CodePointTrie.ValueWidth valueWidth, boolean withSwap,
CheckRange checkRanges[]) {
CodePointTrie trie;
int length1;
/* clone the trie so that the caller can reuse the original */
mutableTrie = mutableTrie.clone();
/*
* This is not a loop, but simply a block that we can exit with "break"
* when something goes wrong.
*/
do {
trie = mutableTrie.buildImmutable(type, valueWidth);
ByteArrayOutputStream os = new ByteArrayOutputStream();
length1=trie.toBinary(os);
assertEquals(testName + ".toBinary() length", os.size(), length1);
ByteBuffer storage = ByteBuffer.wrap(os.toByteArray());
// Java: no preflighting
testTrie(testName, trie, type, valueWidth, checkRanges);
trie=null;
// Java: There is no code for "swapping" the endianness of data.
// withSwap is unused.
trie = CodePointTrie.fromBinary(type, valueWidth, storage);
if(type != trie.getType()) {
fail(String.format(
"error: trie serialization (%s) did not preserve trie type\n", testName));
break;
}
if(valueWidth != trie.getValueWidth()) {
fail(String.format(
"error: trie serialization (%s) did not preserve data value width\n", testName));
break;
}
if(os.size()!=storage.position()) {
fail(String.format(
"error: trie serialization (%s) lengths different: " +
"serialize vs. unserialize\n", testName));
break;
}
{
storage.rewind();
CodePointTrie any = CodePointTrie.fromBinary(null, null, storage);
if (type != any.getType()) {
fail(String.format(
"error: ucptrie_openFromBinary(" +
"UCPTRIE_TYPE_ANY, UCPTRIE_VALUE_BITS_ANY).getType() wrong\n"));
}
if (valueWidth != any.getValueWidth()) {
fail(String.format(
"error: ucptrie_openFromBinary(" +
"UCPTRIE_TYPE_ANY, UCPTRIE_VALUE_BITS_ANY).getValueWidth() wrong\n"));
}
}
testTrie(testName, trie, type, valueWidth, checkRanges);
{
/* make a mutable trie from an immutable one */
int value, value2;
MutableCodePointTrie mutable2 = MutableCodePointTrie.fromCodePointMap(trie);
value=mutable2.get(0xa1);
mutable2.set(0xa1, 789);
value2=mutable2.get(0xa1);
mutable2.set(0xa1, value);
if(value2!=789) {
fail(String.format(
"error: modifying a mutableTrie-from-UCPTrie (%s) failed\n",
testName));
}
testBuilder(testName, mutable2, checkRanges);
}
} while(false);
}
private MutableCodePointTrie
testTrieSerializeAllValueWidth(String testName,
MutableCodePointTrie mutableTrie, boolean withClone,
CheckRange checkRanges[]) {
int oredValues = 0;
int i;
for (i = 0; i < checkRanges.length; ++i) {
oredValues |= checkRanges[i].value;
}
testBuilder(testName, mutableTrie, checkRanges);
if (oredValues <= 0xffff) {
String name = testName + ".16";
testTrieSerialize(name, mutableTrie,
CodePointTrie.Type.FAST, CodePointTrie.ValueWidth.BITS_16, withClone,
checkRanges);
}
String name = testName + ".32";
testTrieSerialize(name, mutableTrie,
CodePointTrie.Type.FAST, CodePointTrie.ValueWidth.BITS_32, withClone,
checkRanges);
if (oredValues <= 0xff) {
name = testName + ".8";
testTrieSerialize(name, mutableTrie,
CodePointTrie.Type.FAST, CodePointTrie.ValueWidth.BITS_8, withClone,
checkRanges);
}
if (oredValues <= 0xffff) {
name = testName + ".small16";
testTrieSerialize(name, mutableTrie,
CodePointTrie.Type.SMALL, CodePointTrie.ValueWidth.BITS_16, withClone,
checkRanges);
}
return mutableTrie;
}
private MutableCodePointTrie
makeTrieWithRanges(String testName, boolean withClone,
SetRange setRanges[], CheckRange checkRanges[]) {
MutableCodePointTrie mutableTrie;
int value;
int start, limit;
int i;
System.out.println("\ntesting Trie " + testName);
SpecialValues specials = getSpecialValues(checkRanges);
mutableTrie = new MutableCodePointTrie(specials.initialValue, specials.errorValue);
/* set values from setRanges[] */
for(i=0; i<setRanges.length; ++i) {
if(withClone && i==setRanges.length/2) {
/* switch to a clone in the middle of setting values */
MutableCodePointTrie clone = mutableTrie.clone();
mutableTrie = clone;
}
start=setRanges[i].start;
limit=setRanges[i].limit;
value=setRanges[i].value;
if ((limit - start) == 1) {
mutableTrie.set(start, value);
} else {
mutableTrie.setRange(start, limit-1, value);
}
}
return mutableTrie;
}
private void
testTrieRanges(String testName, boolean withClone, SetRange setRanges[], CheckRange checkRanges[]) {
MutableCodePointTrie mutableTrie = makeTrieWithRanges(
testName, withClone, setRanges, checkRanges);
if (mutableTrie != null) {
mutableTrie = testTrieSerializeAllValueWidth(testName, mutableTrie, withClone, checkRanges);
}
}
/* test data ----------------------------------------------------------------*/
/* set consecutive ranges, even with value 0 */
private static final SetRange
setRanges1[]={
new SetRange(0, 0x40, 0 ),
new SetRange(0x40, 0xe7, 0x34),
new SetRange(0xe7, 0x3400, 0 ),
new SetRange(0x3400, 0x9fa6, 0x61),
new SetRange(0x9fa6, 0xda9e, 0x31),
new SetRange(0xdada, 0xeeee, 0xff),
new SetRange(0xeeee, 0x11111, 1 ),
new SetRange(0x11111, 0x44444, 0x61),
new SetRange(0x44444, 0x60003, 0 ),
new SetRange(0xf0003, 0xf0004, 0xf ),
new SetRange(0xf0004, 0xf0006, 0x10),
new SetRange(0xf0006, 0xf0007, 0x11),
new SetRange(0xf0007, 0xf0040, 0x12),
new SetRange(0xf0040, 0x110000, 0 )
};
private static final CheckRange
checkRanges1[]={
new CheckRange(0, 0),
new CheckRange(0x40, 0),
new CheckRange(0xe7, 0x34),
new CheckRange(0x3400, 0),
new CheckRange(0x9fa6, 0x61),
new CheckRange(0xda9e, 0x31),
new CheckRange(0xdada, 0),
new CheckRange(0xeeee, 0xff),
new CheckRange(0x11111, 1),
new CheckRange(0x44444, 0x61),
new CheckRange(0xf0003, 0),
new CheckRange(0xf0004, 0xf),
new CheckRange(0xf0006, 0x10),
new CheckRange(0xf0007, 0x11),
new CheckRange(0xf0040, 0x12),
new CheckRange(0x110000, 0)
};
/* set some interesting overlapping ranges */
private static final SetRange
setRanges2[]={
new SetRange(0x21, 0x7f, 0x5555),
new SetRange(0x2f800, 0x2fedc, 0x7a ),
new SetRange(0x72, 0xdd, 3 ),
new SetRange(0xdd, 0xde, 4 ),
new SetRange(0x201, 0x240, 6 ), /* 3 consecutive blocks with the same pattern but */
new SetRange(0x241, 0x280, 6 ), /* discontiguous value ranges, testing iteration */
new SetRange(0x281, 0x2c0, 6 ),
new SetRange(0x2f987, 0x2fa98, 5 ),
new SetRange(0x2f777, 0x2f883, 0 ),
new SetRange(0x2fedc, 0x2ffaa, 1 ),
new SetRange(0x2ffaa, 0x2ffab, 2 ),
new SetRange(0x2ffbb, 0x2ffc0, 7 )
};
private static final CheckRange
checkRanges2[]={
new CheckRange(0, 0),
new CheckRange(0x21, 0),
new CheckRange(0x72, 0x5555),
new CheckRange(0xdd, 3),
new CheckRange(0xde, 4),
new CheckRange(0x201, 0),
new CheckRange(0x240, 6),
new CheckRange(0x241, 0),
new CheckRange(0x280, 6),
new CheckRange(0x281, 0),
new CheckRange(0x2c0, 6),
new CheckRange(0x2f883, 0),
new CheckRange(0x2f987, 0x7a),
new CheckRange(0x2fa98, 5),
new CheckRange(0x2fedc, 0x7a),
new CheckRange(0x2ffaa, 1),
new CheckRange(0x2ffab, 2),
new CheckRange(0x2ffbb, 0),
new CheckRange(0x2ffc0, 7),
new CheckRange(0x110000, 0)
};
/* use a non-zero initial value */
private static final SetRange
setRanges3[]={
new SetRange(0x31, 0xa4, 1),
new SetRange(0x3400, 0x6789, 2),
new SetRange(0x8000, 0x89ab, 9),
new SetRange(0x9000, 0xa000, 4),
new SetRange(0xabcd, 0xbcde, 3),
new SetRange(0x55555, 0x110000, 6), /* highStart<U+ffff with non-initialValue */
new SetRange(0xcccc, 0x55555, 6)
};
private static final CheckRange
checkRanges3[]={
new CheckRange(0, 9), /* non-zero initialValue */
new CheckRange(0x31, 9),
new CheckRange(0xa4, 1),
new CheckRange(0x3400, 9),
new CheckRange(0x6789, 2),
new CheckRange(0x9000, 9),
new CheckRange(0xa000, 4),
new CheckRange(0xabcd, 9),
new CheckRange(0xbcde, 3),
new CheckRange(0xcccc, 9),
new CheckRange(0x110000, 6)
};
/* empty or single-value tries, testing highStart==0 */
private static final SetRange
setRangesEmpty[]={
// new SetRange(0, 0, 0), /* need some values for it to compile */
};
private static final CheckRange
checkRangesEmpty[]={
new CheckRange(0, 3),
new CheckRange(0x110000, 3)
};
private static final SetRange
setRangesSingleValue[]={
new SetRange(0, 0x110000, 5),
};
private static final CheckRange
checkRangesSingleValue[]={
new CheckRange(0, 3),
new CheckRange(0x110000, 5)
};
@Test
public void TrieTestSet1() {
testTrieRanges("set1", false, setRanges1, checkRanges1);
}
@Test
public void TrieTestSet2Overlap() {
testTrieRanges("set2-overlap", false, setRanges2, checkRanges2);
}
@Test
public void TrieTestSet3Initial9() {
testTrieRanges("set3-initial-9", false, setRanges3, checkRanges3);
}
@Test
public void TrieTestSetEmpty() {
testTrieRanges("set-empty", false, setRangesEmpty, checkRangesEmpty);
}
@Test
public void TrieTestSetSingleValue() {
testTrieRanges("set-single-value", false, setRangesSingleValue, checkRangesSingleValue);
}
@Test
public void TrieTestSet2OverlapWithClone() {
testTrieRanges("set2-overlap.withClone", true, setRanges2, checkRanges2);
}
/* test mutable-trie memory management -------------------------------------- */
@Test
public void FreeBlocksTest() {
final CheckRange
checkRanges[]={
new CheckRange(0, 1),
new CheckRange(0x740, 1),
new CheckRange(0x780, 2),
new CheckRange(0x880, 3),
new CheckRange(0x110000, 1)
};
String testName="free-blocks";
MutableCodePointTrie mutableTrie;
int i;
mutableTrie=new MutableCodePointTrie(1, 0xad);
/*
* Repeatedly set overlapping same-value ranges to stress the free-data-block management.
* If it fails, it will overflow the data array.
*/
for(i=0; i<(0x120000>>4)/2; ++i) { // 4=UCPTRIE_SHIFT_3
mutableTrie.setRange(0x740, 0x840-1, 1);
mutableTrie.setRange(0x780, 0x880-1, 1);
mutableTrie.setRange(0x740, 0x840-1, 2);
mutableTrie.setRange(0x780, 0x880-1, 3);
}
/* make blocks that will be free during compaction */
mutableTrie.setRange(0x1000, 0x3000-1, 2);
mutableTrie.setRange(0x2000, 0x4000-1, 3);
mutableTrie.setRange(0x1000, 0x4000-1, 1);
mutableTrie = testTrieSerializeAllValueWidth(testName, mutableTrie, false, checkRanges);
}
@Test
public void GrowDataArrayTest() {
final CheckRange
checkRanges[]={
new CheckRange(0, 1),
new CheckRange(0x720, 2),
new CheckRange(0x7a0, 3),
new CheckRange(0x8a0, 4),
new CheckRange(0x110000, 5)
};
String testName="grow-data";
MutableCodePointTrie mutableTrie;
int i;
mutableTrie=new MutableCodePointTrie(1, 0xad);
/*
* Use umutablecptrie_set() not umutablecptrie_setRange() to write non-initialValue-data.
* Should grow/reallocate the data array to a sufficient length.
*/
for(i=0; i<0x1000; ++i) {
mutableTrie.set(i, 2);
}
for(i=0x720; i<0x1100; ++i) { /* some overlap */
mutableTrie.set(i, 3);
}
for(i=0x7a0; i<0x900; ++i) {
mutableTrie.set(i, 4);
}
for(i=0x8a0; i<0x110000; ++i) {
mutableTrie.set(i, 5);
}
mutableTrie = testTrieSerializeAllValueWidth(testName, mutableTrie, false, checkRanges);
}
@Test
public void ManyAllSameBlocksTest() {
String testName="many-all-same";
MutableCodePointTrie mutableTrie;
int i;
CheckRange[] checkRanges = new CheckRange[(0x110000 >> 12) + 1];
mutableTrie = new MutableCodePointTrie(0xff33, 0xad);
checkRanges[0] = new CheckRange(0, 0xff33); // initialValue
// Many all-same-value blocks.
for (i = 0; i < 0x110000; i += 0x1000) {
int value = i >> 12;
mutableTrie.setRange(i, i + 0xfff, value);
checkRanges[value + 1] = new CheckRange(i + 0x1000, value);
}
for (i = 0; i < 0x110000; i += 0x1000) {
int expected = i >> 12;
int v0 = mutableTrie.get(i);
int vfff = mutableTrie.get(i + 0xfff);
if (v0 != expected || vfff != expected) {
fail(String.format(
"error: MutableCodePointTrie U+%04x unexpected value\n", i));
}
}
mutableTrie = testTrieSerializeAllValueWidth(testName, mutableTrie, false, checkRanges);
}
@Test
public void MuchDataTest() {
String testName="much-data";
MutableCodePointTrie mutableTrie;
int r, c;
CheckRange[] checkRanges = new CheckRange[(0x10000 >> 6) + (0x10240 >> 4) + 10];
mutableTrie = new MutableCodePointTrie(0xff33, 0xad);
checkRanges[0] = new CheckRange(0, 0xff33); // initialValue
r = 1;
// Add much data that does not compact well,
// to get more than 128k data values after compaction.
for (c = 0; c < 0x10000; c += 0x40) {
int value = c >> 4;
mutableTrie.setRange(c, c + 0x3f, value);
checkRanges[r++] = new CheckRange(c + 0x40, value);
}
checkRanges[r++] = new CheckRange(0x20000, 0xff33);
for (c = 0x20000; c < 0x30230; c += 0x10) {
int value = c >> 4;
mutableTrie.setRange(c, c + 0xf, value);
checkRanges[r++] = new CheckRange(c + 0x10, value);
}
mutableTrie.setRange(0x30230, 0x30233, 0x3023);
checkRanges[r++] = new CheckRange(0x30234, 0x3023);
mutableTrie.setRange(0x30234, 0xdffff, 0x5005);
checkRanges[r++] = new CheckRange(0xe0000, 0x5005);
mutableTrie.setRange(0xe0000, 0x10ffff, 0x9009);
checkRanges[r++] = new CheckRange(0x110000, 0x9009);
checkRanges = Arrays.copyOf(checkRanges, r);
testBuilder(testName, mutableTrie, checkRanges);
testTrieSerialize("much-data.16", mutableTrie,
CodePointTrie.Type.FAST, CodePointTrie.ValueWidth.BITS_16, false,
checkRanges);
}
private void testGetRangesFixedSurr(String testName, MutableCodePointTrie mutableTrie,
CodePointMap.RangeOption option, CheckRange checkRanges[]) {
testTrieGetRanges(testName, mutableTrie, option, 5, checkRanges);
MutableCodePointTrie clone = mutableTrie.clone();
CodePointTrie trie =
clone.buildImmutable(CodePointTrie.Type.FAST, CodePointTrie.ValueWidth.BITS_16);
testTrieGetRanges(testName, trie, option, 5, checkRanges);
}
@Test
public void TrieTestGetRangesFixedSurr() {
final SetRange
setRangesFixedSurr[]={
new SetRange(0xd000, 0xd7ff, 5),
new SetRange(0xd7ff, 0xe001, 3),
new SetRange(0xe001, 0xf900, 5),
};
final CheckRange
checkRangesFixedLeadSurr1[]={
new CheckRange(0, 0),
new CheckRange(0xd000, 0),
new CheckRange(0xd7ff, 5),
new CheckRange(0xd800, 3),
new CheckRange(0xdc00, 5),
new CheckRange(0xe001, 3),
new CheckRange(0xf900, 5),
new CheckRange(0x110000, 0)
};
final CheckRange
checkRangesFixedAllSurr1[]={
new CheckRange(0, 0),
new CheckRange(0xd000, 0),
new CheckRange(0xd7ff, 5),
new CheckRange(0xd800, 3),
new CheckRange(0xe000, 5),
new CheckRange(0xe001, 3),
new CheckRange(0xf900, 5),
new CheckRange(0x110000, 0)
};
final CheckRange
checkRangesFixedLeadSurr3[]={
new CheckRange(0, 0),
new CheckRange(0xd000, 0),
new CheckRange(0xdc00, 5),
new CheckRange(0xe001, 3),
new CheckRange(0xf900, 5),
new CheckRange(0x110000, 0)
};
final CheckRange
checkRangesFixedAllSurr3[]={
new CheckRange(0, 0),
new CheckRange(0xd000, 0),
new CheckRange(0xe000, 5),
new CheckRange(0xe001, 3),
new CheckRange(0xf900, 5),
new CheckRange(0x110000, 0)
};
final CheckRange
checkRangesFixedSurr4[]={
new CheckRange(0, 0),
new CheckRange(0xd000, 0),
new CheckRange(0xf900, 5),
new CheckRange(0x110000, 0)
};
MutableCodePointTrie mutableTrie = makeTrieWithRanges(
"fixedSurr", false, setRangesFixedSurr, checkRangesFixedLeadSurr1);
testGetRangesFixedSurr("fixedLeadSurr1", mutableTrie,
CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, checkRangesFixedLeadSurr1);
testGetRangesFixedSurr("fixedAllSurr1", mutableTrie,
CodePointMap.RangeOption.FIXED_ALL_SURROGATES, checkRangesFixedAllSurr1);
// Setting a range in the middle of lead surrogates makes no difference.
mutableTrie.setRange(0xd844, 0xd899, 5);
testGetRangesFixedSurr("fixedLeadSurr2", mutableTrie,
CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, checkRangesFixedLeadSurr1);
// Bridge the gap before the lead surrogates.
mutableTrie.set(0xd7ff, 5);
testGetRangesFixedSurr("fixedLeadSurr3", mutableTrie,
CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, checkRangesFixedLeadSurr3);
testGetRangesFixedSurr("fixedAllSurr3", mutableTrie,
CodePointMap.RangeOption.FIXED_ALL_SURROGATES, checkRangesFixedAllSurr3);
// Bridge the gap after the trail surrogates.
mutableTrie.set(0xe000, 5);
testGetRangesFixedSurr("fixedSurr4", mutableTrie,
CodePointMap.RangeOption.FIXED_ALL_SURROGATES, checkRangesFixedSurr4);
}
}