mirror of
https://github.com/unicode-org/icu.git
synced 2025-04-05 13:35:32 +00:00
ICU-13530 add UCPTrie/CodePointTrie, switch normalization to use it (#48)
* ICU-13530 copy C/C++ files UTrie2 -> UTrie3 X-SVN-Rev: 40754 * ICU-13530 UTrie3 new files copied from UTrie2: rename types/functions/macros X-SVN-Rev: 40755 * ICU-13530 debug-print building each UTrie2 X-SVN-Rev: 40756 * ICU-13530 remove two-byte-UTF-8 errorValue block; move highValue from end of data array into header; add errorValue to header X-SVN-Rev: 40762 * ICU-13530 UTrie3 U16_NEXT/PREV: errorValue for unpaired surrogates X-SVN-Rev: 40763 * ICU-13530 no more separate values for lead surrogate code units X-SVN-Rev: 40764 * ICU-13530 change from 11:5 trie bits to 10:6 for simpler UTF-8 code X-SVN-Rev: 40766 * ICU-13530 UTrie2 build UTrie3 as well, print sizes X-SVN-Rev: 40767 * ICU-13530 debug-print countSame, sumOverlaps, countInitial X-SVN-Rev: 40768 * ICU-13530 debug-print whether trie is for CanonIterData X-SVN-Rev: 40769 * ICU-13530 no index-shift for BMP data, no separate index-2 for 2-byte UTF-8; builder changes incomplete X-SVN-Rev: 40777 * ICU-13530 remove errorValue and highStart from UNewTrie3 X-SVN-Rev: 40778 * ICU-13530 rewrite UTrie3 builder code X-SVN-Rev: 40783 * ICU-13530 UTrie3 bug fixes X-SVN-Rev: 40788 * ICU-13530 fully re-inline _UTRIE3_U8_NEXT() X-SVN-Rev: 40790 * ICU-13530 find most common all-same data block for dataNullBlock and initialValue X-SVN-Rev: 40792 * ICU-13530 UTrie3 iterator functions take start and return the end of a range, rather than callback call for each range X-SVN-Rev: 40800 * ICU-13530 mask off unused data value bits before building a UTrie3 with values less than 32 bits wide X-SVN-Rev: 40803 * ICU-13530 split utrie3builder.h out of utrie3.h X-SVN-Rev: 40804 * ICU-13530 separate types UTrie3 vs. UTrie3Builder, implement builder as wrapper over C++ class Trie3Builder in .cpp X-SVN-Rev: 40809 * ICU-13530 function to make a UTrie3Builder from a UTrie3 X-SVN-Rev: 40810 * ICU-13530 debug-print some data; some cleanup X-SVN-Rev: 40865 * ICU-13530 BMP 10:6 but supplementary 10:6:4 X-SVN-Rev: 40984 * ICU-13530 move errorValue & highValue to the end of the data table, minimal padding to 4 bytes X-SVN-Rev: 41011 * ICU-13530 index-1 table gap of index-2 null blocks X-SVN-Rev: 41018 * ICU-13530 test with more than 128k compacted data X-SVN-Rev: 41034 * ICU-13530 supplementary bits 11:5:4 saves a little space X-SVN-Rev: 41039 * ICU-13530 supplementary bits 6:5:5:4 instead of gap: about same size but simpler X-SVN-Rev: 41050 * ICU-13530 remove unnecessary utrie3_clone(built trie) X-SVN-Rev: 41058 * ICU-13530 remove unnecessary UTrie3StringIterator X-SVN-Rev: 41059 * ICU-13530 back to UTRIE3_GET...() macros *returning* data values X-SVN-Rev: 41060 * ICU-13530 fast vs. small X-SVN-Rev: 41066 * ICU-13530 always load NFC data, add simple normalization performance test X-SVN-Rev: 41110 * ICU-13530 change normalization main trie to UTrie3 with special values for lead surrogates; forbid non-inert surrogate code *points* because unable to store values different from code *units*; runtime code work around that for code point lookup and iteration; adjust UTS 46 for normalization no longer mapping unpaired surrogates to U+FFFD X-SVN-Rev: 41122 * ICU-13530 simplenormperf bug fix and NFC base line X-SVN-Rev: 41126 * ICU-13530 move normalization getRange skipping lead surrogates to API getRangeSkipLead() X-SVN-Rev: 41182 * ICU-13530 switch CanonIterData and gennorm2 Norms to UTrie3 X-SVN-Rev: 41183 * ICU-13530 remove unused overwrite parameter from setRange() X-SVN-Rev: 41184 * ICU-13530 getRange skip lead -> fixed surrogates X-SVN-Rev: 41219 * ICU-13530 minor cleanup X-SVN-Rev: 41221 * ICU-13530 UTS 46 code map unpaired surrogates to U+FFFD before normalization X-SVN-Rev: 41224 * ICU-13530 minor internal-docs cleanup X-SVN-Rev: 41225 * ICU-13530 rename UTrie3 to UCPTrie, and other name changes X-SVN-Rev: 41226 * ICU-13530 add 8-bit data option; add type-any & valueBits-any for fromBinary(); macros consistently source type then data width X-SVN-Rev: 41234 * ICU-13530 scrub the API docs for the proposal X-SVN-Rev: 41319 * ICU-13530 tag internal definitions as such, or move them to an internal header X-SVN-Rev: 41320 * ICU-13530 Java API skeleton X-SVN-Rev: 41326 * ICU-13530 API feedback: ValueWidth, MutableCodePointTrie, base CodePointMap, ... X-SVN-Rev: 41382 * ICU-13530 add UCPTrie valueWidth field and padding, and combine data pointers into a union X-SVN-Rev: 41408 * ICU-13530 switch some macros to using dataAccess parameter: separate index vs. data lookups, no macro variant for each value width X-SVN-Rev: 41409 * ICU-13530 StringIterator is no longer a java.util.Iterator (bad fit) X-SVN-Rev: 41455 * ICU-13530 CodePointTrie.java code complete X-SVN-Rev: 41518 * ICU-13530 finish Java port incl test; keep C++ parallel * ICU-13530 adjust API for feedback: rename HandleValue to FilterValue, change getRange+getRangeFixedSurr(bool allSurr) to enum RangeOption+getRange(enum option); change remaining C macros to use dataAccess for 16/32/8-bit value widths; fix/clarify some API docs * ICU-13530 add javadoc * ICU-13530 document UCPTrie binary data format * ICU-13530 update .nrm formatVersion 3->4, document change in surrogate handling with new trie * ICU-13530 re-hardcode NFC data * move trie swapper code into new file; add new files to Windows project files; turn off trie debugging * ICU-13530 minor cleanup * ICU-13530 test more range starts; fix a C test leak * ICU-13530 regenerate Java data from scratch * ICU-13530 review feedback changes: API docs typos, more @internal, C++11 field initializers, fix potential leak in MutableCodePointTrie::fromUCPTrie() * ICU-13530 rename interface FilterValue to ValueFilter
This commit is contained in:
parent
8a52f44951
commit
fe3eb3ed5c
60 changed files with 11129 additions and 1486 deletions
|
@ -81,7 +81,7 @@ LIBS = $(LIBICUDT) $(DEFAULT_LIBS)
|
|||
|
||||
OBJECTS = errorcode.o putil.o umath.o utypes.o uinvchar.o umutex.o ucln_cmn.o \
|
||||
uinit.o uobject.o cmemory.o charstr.o cstr.o \
|
||||
udata.o ucmndata.o udatamem.o umapfile.o udataswp.o ucol_swp.o utrace.o \
|
||||
udata.o ucmndata.o udatamem.o umapfile.o udataswp.o utrie_swap.o ucol_swp.o utrace.o \
|
||||
uhash.o uhash_us.o uenum.o ustrenum.o uvector.o ustack.o uvectr32.o uvectr64.o \
|
||||
ucnv.o ucnv_bld.o ucnv_cnv.o ucnv_io.o ucnv_cb.o ucnv_err.o ucnvlat1.o \
|
||||
ucnv_u7.o ucnv_u8.o ucnv_u16.o ucnv_u32.o ucnvscsu.o ucnvbocu.o \
|
||||
|
@ -102,7 +102,8 @@ normalizer2impl.o normalizer2.o filterednormalizer2.o normlzr.o unorm.o unormcmp
|
|||
chariter.o schriter.o uchriter.o uiter.o \
|
||||
patternprops.o uchar.o uprops.o ucase.o propname.o ubidi_props.o ubidi.o ubidiwrt.o ubidiln.o ushape.o \
|
||||
uscript.o uscript_props.o usc_impl.o unames.o \
|
||||
utrie.o utrie2.o utrie2_builder.o bmpset.o unisetspan.o uset_props.o uniset_props.o uniset_closure.o uset.o uniset.o usetiter.o ruleiter.o caniter.o unifilt.o unifunct.o \
|
||||
utrie.o utrie2.o utrie2_builder.o ucptrie.o umutablecptrie.o \
|
||||
bmpset.o unisetspan.o uset_props.o uniset_props.o uniset_closure.o uset.o uniset.o usetiter.o ruleiter.o caniter.o unifilt.o unifunct.o \
|
||||
uarrsort.o brkiter.o ubrk.o brkeng.o dictbe.o filteredbrk.o \
|
||||
rbbi.o rbbidata.o rbbinode.o rbbirb.o rbbiscan.o rbbisetb.o rbbistbl.o rbbitblb.o rbbi_cache.o \
|
||||
serv.o servnotf.o servls.o servlk.o servlkf.o servrbf.o servslkf.o \
|
||||
|
|
|
@ -181,6 +181,7 @@
|
|||
<ClCompile Include="ustack.cpp" />
|
||||
<ClCompile Include="ustrenum.cpp" />
|
||||
<ClCompile Include="utrie.cpp" />
|
||||
<ClCompile Include="utrie_swap.cpp" />
|
||||
<ClCompile Include="utrie2.cpp" />
|
||||
<ClCompile Include="utrie2_builder.cpp" />
|
||||
<ClCompile Include="uvector.cpp" />
|
||||
|
@ -314,8 +315,10 @@
|
|||
<ClCompile Include="ucharstriebuilder.cpp" />
|
||||
<ClCompile Include="ucharstrieiterator.cpp" />
|
||||
<ClCompile Include="uchriter.cpp" />
|
||||
<ClCompile Include="ucptrie.cpp" />
|
||||
<ClCompile Include="uinvchar.cpp" />
|
||||
<ClCompile Include="uiter.cpp" />
|
||||
<ClCompile Include="umutablecptrie.cpp" />
|
||||
<ClCompile Include="unistr.cpp" />
|
||||
<ClCompile Include="unistr_case.cpp" />
|
||||
<ClCompile Include="unistr_case_locale.cpp" />
|
||||
|
|
|
@ -139,6 +139,9 @@
|
|||
<ClCompile Include="utrie.cpp">
|
||||
<Filter>collections</Filter>
|
||||
</ClCompile>
|
||||
<ClCompile Include="utrie_swap.cpp">
|
||||
<Filter>collections</Filter>
|
||||
</ClCompile>
|
||||
<ClCompile Include="utrie2.cpp">
|
||||
<Filter>collections</Filter>
|
||||
</ClCompile>
|
||||
|
@ -589,6 +592,12 @@
|
|||
<ClCompile Include="ucharstrieiterator.cpp">
|
||||
<Filter>collections</Filter>
|
||||
</ClCompile>
|
||||
<ClCompile Include="ucptrie.cpp">
|
||||
<Filter>collections</Filter>
|
||||
</ClCompile>
|
||||
<ClCompile Include="umutablecptrie.cpp">
|
||||
<Filter>collections</Filter>
|
||||
</ClCompile>
|
||||
<ClCompile Include="patternprops.cpp">
|
||||
<Filter>properties & sets</Filter>
|
||||
</ClCompile>
|
||||
|
@ -1204,6 +1213,12 @@
|
|||
<CustomBuild Include="unicode\ucharstriebuilder.h">
|
||||
<Filter>collections</Filter>
|
||||
</CustomBuild>
|
||||
<CustomBuild Include="unicode\ucptrie.h">
|
||||
<Filter>collections</Filter>
|
||||
</CustomBuild>
|
||||
<CustomBuild Include="unicode\umutablecptrie.h">
|
||||
<Filter>collections</Filter>
|
||||
</CustomBuild>
|
||||
<CustomBuild Include="unicode\enumset.h">
|
||||
<Filter>data & memory</Filter>
|
||||
</CustomBuild>
|
||||
|
@ -1217,4 +1232,4 @@
|
|||
<Filter>strings</Filter>
|
||||
</CustomBuild>
|
||||
</ItemGroup>
|
||||
</Project>
|
||||
</Project>
|
||||
|
|
|
@ -304,6 +304,7 @@
|
|||
<ClCompile Include="ustack.cpp" />
|
||||
<ClCompile Include="ustrenum.cpp" />
|
||||
<ClCompile Include="utrie.cpp" />
|
||||
<ClCompile Include="utrie_swap.cpp" />
|
||||
<ClCompile Include="utrie2.cpp" />
|
||||
<ClCompile Include="utrie2_builder.cpp" />
|
||||
<ClCompile Include="uvector.cpp" />
|
||||
|
@ -439,9 +440,11 @@
|
|||
<ClCompile Include="ucharstrie.cpp" />
|
||||
<ClCompile Include="ucharstriebuilder.cpp" />
|
||||
<ClCompile Include="ucharstrieiterator.cpp" />
|
||||
<ClCompile Include="ucptrie.cpp" />
|
||||
<ClCompile Include="uchriter.cpp" />
|
||||
<ClCompile Include="uinvchar.cpp" />
|
||||
<ClCompile Include="uiter.cpp" />
|
||||
<ClCompile Include="umutablecptrie.cpp" />
|
||||
<ClCompile Include="unistr.cpp" />
|
||||
<ClCompile Include="unistr_case.cpp" />
|
||||
<ClCompile Include="unistr_case_locale.cpp" />
|
||||
|
|
|
@ -18,6 +18,7 @@
|
|||
#include "unicode/udata.h"
|
||||
#include "unicode/localpointer.h"
|
||||
#include "unicode/normalizer2.h"
|
||||
#include "unicode/ucptrie.h"
|
||||
#include "unicode/unistr.h"
|
||||
#include "unicode/unorm.h"
|
||||
#include "cstring.h"
|
||||
|
@ -42,12 +43,12 @@ private:
|
|||
isAcceptable(void *context, const char *type, const char *name, const UDataInfo *pInfo);
|
||||
|
||||
UDataMemory *memory;
|
||||
UTrie2 *ownedTrie;
|
||||
UCPTrie *ownedTrie;
|
||||
};
|
||||
|
||||
LoadedNormalizer2Impl::~LoadedNormalizer2Impl() {
|
||||
udata_close(memory);
|
||||
utrie2_close(ownedTrie);
|
||||
ucptrie_close(ownedTrie);
|
||||
}
|
||||
|
||||
UBool U_CALLCONV
|
||||
|
@ -62,7 +63,7 @@ LoadedNormalizer2Impl::isAcceptable(void * /*context*/,
|
|||
pInfo->dataFormat[1]==0x72 &&
|
||||
pInfo->dataFormat[2]==0x6d &&
|
||||
pInfo->dataFormat[3]==0x32 &&
|
||||
pInfo->formatVersion[0]==3
|
||||
pInfo->formatVersion[0]==4
|
||||
) {
|
||||
// Normalizer2Impl *me=(Normalizer2Impl *)context;
|
||||
// uprv_memcpy(me->dataVersion, pInfo->dataVersion, 4);
|
||||
|
@ -91,9 +92,9 @@ LoadedNormalizer2Impl::load(const char *packageName, const char *name, UErrorCod
|
|||
|
||||
int32_t offset=inIndexes[IX_NORM_TRIE_OFFSET];
|
||||
int32_t nextOffset=inIndexes[IX_EXTRA_DATA_OFFSET];
|
||||
ownedTrie=utrie2_openFromSerialized(UTRIE2_16_VALUE_BITS,
|
||||
inBytes+offset, nextOffset-offset, NULL,
|
||||
&errorCode);
|
||||
ownedTrie=ucptrie_openFromBinary(UCPTRIE_TYPE_FAST, UCPTRIE_VALUE_BITS_16,
|
||||
inBytes+offset, nextOffset-offset, NULL,
|
||||
&errorCode);
|
||||
if(U_FAILURE(errorCode)) {
|
||||
return;
|
||||
}
|
||||
|
@ -131,15 +132,26 @@ U_CDECL_BEGIN
|
|||
static UBool U_CALLCONV uprv_loaded_normalizer2_cleanup();
|
||||
U_CDECL_END
|
||||
|
||||
static Norm2AllModes *nfkcSingleton;
|
||||
static Norm2AllModes *nfkc_cfSingleton;
|
||||
static UHashtable *cache=NULL;
|
||||
#if !NORM2_HARDCODE_NFC_DATA
|
||||
static Norm2AllModes *nfcSingleton;
|
||||
static icu::UInitOnce nfcInitOnce = U_INITONCE_INITIALIZER;
|
||||
#endif
|
||||
|
||||
static Norm2AllModes *nfkcSingleton;
|
||||
static icu::UInitOnce nfkcInitOnce = U_INITONCE_INITIALIZER;
|
||||
|
||||
static Norm2AllModes *nfkc_cfSingleton;
|
||||
static icu::UInitOnce nfkc_cfInitOnce = U_INITONCE_INITIALIZER;
|
||||
|
||||
static UHashtable *cache=NULL;
|
||||
|
||||
// UInitOnce singleton initialization function
|
||||
static void U_CALLCONV initSingletons(const char *what, UErrorCode &errorCode) {
|
||||
#if !NORM2_HARDCODE_NFC_DATA
|
||||
if (uprv_strcmp(what, "nfc") == 0) {
|
||||
nfcSingleton = Norm2AllModes::createInstance(NULL, "nfc", errorCode);
|
||||
} else
|
||||
#endif
|
||||
if (uprv_strcmp(what, "nfkc") == 0) {
|
||||
nfkcSingleton = Norm2AllModes::createInstance(NULL, "nfkc", errorCode);
|
||||
} else if (uprv_strcmp(what, "nfkc_cf") == 0) {
|
||||
|
@ -157,19 +169,36 @@ static void U_CALLCONV deleteNorm2AllModes(void *allModes) {
|
|||
}
|
||||
|
||||
static UBool U_CALLCONV uprv_loaded_normalizer2_cleanup() {
|
||||
#if !NORM2_HARDCODE_NFC_DATA
|
||||
delete nfcSingleton;
|
||||
nfcSingleton = NULL;
|
||||
nfcInitOnce.reset();
|
||||
#endif
|
||||
|
||||
delete nfkcSingleton;
|
||||
nfkcSingleton = NULL;
|
||||
nfkcInitOnce.reset();
|
||||
|
||||
delete nfkc_cfSingleton;
|
||||
nfkc_cfSingleton = NULL;
|
||||
nfkc_cfInitOnce.reset();
|
||||
|
||||
uhash_close(cache);
|
||||
cache=NULL;
|
||||
nfkcInitOnce.reset();
|
||||
nfkc_cfInitOnce.reset();
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
U_CDECL_END
|
||||
|
||||
#if !NORM2_HARDCODE_NFC_DATA
|
||||
const Norm2AllModes *
|
||||
Norm2AllModes::getNFCInstance(UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) { return NULL; }
|
||||
umtx_initOnce(nfcInitOnce, &initSingletons, "nfc", errorCode);
|
||||
return nfcSingleton;
|
||||
}
|
||||
#endif
|
||||
|
||||
const Norm2AllModes *
|
||||
Norm2AllModes::getNFKCInstance(UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) { return NULL; }
|
||||
|
@ -184,6 +213,36 @@ Norm2AllModes::getNFKC_CFInstance(UErrorCode &errorCode) {
|
|||
return nfkc_cfSingleton;
|
||||
}
|
||||
|
||||
#if !NORM2_HARDCODE_NFC_DATA
|
||||
const Normalizer2 *
|
||||
Normalizer2::getNFCInstance(UErrorCode &errorCode) {
|
||||
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
|
||||
return allModes!=NULL ? &allModes->comp : NULL;
|
||||
}
|
||||
|
||||
const Normalizer2 *
|
||||
Normalizer2::getNFDInstance(UErrorCode &errorCode) {
|
||||
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
|
||||
return allModes!=NULL ? &allModes->decomp : NULL;
|
||||
}
|
||||
|
||||
const Normalizer2 *Normalizer2Factory::getFCDInstance(UErrorCode &errorCode) {
|
||||
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
|
||||
return allModes!=NULL ? &allModes->fcd : NULL;
|
||||
}
|
||||
|
||||
const Normalizer2 *Normalizer2Factory::getFCCInstance(UErrorCode &errorCode) {
|
||||
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
|
||||
return allModes!=NULL ? &allModes->fcc : NULL;
|
||||
}
|
||||
|
||||
const Normalizer2Impl *
|
||||
Normalizer2Factory::getNFCImpl(UErrorCode &errorCode) {
|
||||
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
|
||||
return allModes!=NULL ? allModes->impl : NULL;
|
||||
}
|
||||
#endif
|
||||
|
||||
const Normalizer2 *
|
||||
Normalizer2::getNFKCInstance(UErrorCode &errorCode) {
|
||||
const Norm2AllModes *allModes=Norm2AllModes::getNFKCInstance(errorCode);
|
||||
|
|
File diff suppressed because it is too large
Load diff
|
@ -34,9 +34,11 @@
|
|||
|
||||
using icu::Normalizer2Impl;
|
||||
|
||||
#if NORM2_HARDCODE_NFC_DATA
|
||||
// NFC/NFD data machine-generated by gennorm2 --csource
|
||||
#define INCLUDED_FROM_NORMALIZER2_CPP
|
||||
#include "norm2_nfc_data.h"
|
||||
#endif
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
||||
|
@ -176,6 +178,36 @@ FCDNormalizer2::~FCDNormalizer2() {}
|
|||
|
||||
// instance cache ---------------------------------------------------------- ***
|
||||
|
||||
U_CDECL_BEGIN
|
||||
static UBool U_CALLCONV uprv_normalizer2_cleanup();
|
||||
U_CDECL_END
|
||||
|
||||
static Normalizer2 *noopSingleton;
|
||||
static icu::UInitOnce noopInitOnce = U_INITONCE_INITIALIZER;
|
||||
|
||||
static void U_CALLCONV initNoopSingleton(UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) {
|
||||
return;
|
||||
}
|
||||
noopSingleton=new NoopNormalizer2;
|
||||
if(noopSingleton==NULL) {
|
||||
errorCode=U_MEMORY_ALLOCATION_ERROR;
|
||||
return;
|
||||
}
|
||||
ucln_common_registerCleanup(UCLN_COMMON_NORMALIZER2, uprv_normalizer2_cleanup);
|
||||
}
|
||||
|
||||
const Normalizer2 *Normalizer2Factory::getNoopInstance(UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) { return NULL; }
|
||||
umtx_initOnce(noopInitOnce, &initNoopSingleton, errorCode);
|
||||
return noopSingleton;
|
||||
}
|
||||
|
||||
const Normalizer2Impl *
|
||||
Normalizer2Factory::getImpl(const Normalizer2 *norm2) {
|
||||
return &((Normalizer2WithImpl *)norm2)->impl;
|
||||
}
|
||||
|
||||
Norm2AllModes::~Norm2AllModes() {
|
||||
delete impl;
|
||||
}
|
||||
|
@ -195,6 +227,7 @@ Norm2AllModes::createInstance(Normalizer2Impl *impl, UErrorCode &errorCode) {
|
|||
return allModes;
|
||||
}
|
||||
|
||||
#if NORM2_HARDCODE_NFC_DATA
|
||||
Norm2AllModes *
|
||||
Norm2AllModes::createNFCInstance(UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) {
|
||||
|
@ -210,48 +243,15 @@ Norm2AllModes::createNFCInstance(UErrorCode &errorCode) {
|
|||
return createInstance(impl, errorCode);
|
||||
}
|
||||
|
||||
U_CDECL_BEGIN
|
||||
static UBool U_CALLCONV uprv_normalizer2_cleanup();
|
||||
U_CDECL_END
|
||||
|
||||
static Norm2AllModes *nfcSingleton;
|
||||
static Normalizer2 *noopSingleton;
|
||||
|
||||
static icu::UInitOnce nfcInitOnce = U_INITONCE_INITIALIZER;
|
||||
static icu::UInitOnce noopInitOnce = U_INITONCE_INITIALIZER;
|
||||
|
||||
// UInitOnce singleton initialization functions
|
||||
static void U_CALLCONV initNFCSingleton(UErrorCode &errorCode) {
|
||||
nfcSingleton=Norm2AllModes::createNFCInstance(errorCode);
|
||||
ucln_common_registerCleanup(UCLN_COMMON_NORMALIZER2, uprv_normalizer2_cleanup);
|
||||
}
|
||||
|
||||
static void U_CALLCONV initNoopSingleton(UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) {
|
||||
return;
|
||||
}
|
||||
noopSingleton=new NoopNormalizer2;
|
||||
if(noopSingleton==NULL) {
|
||||
errorCode=U_MEMORY_ALLOCATION_ERROR;
|
||||
return;
|
||||
}
|
||||
ucln_common_registerCleanup(UCLN_COMMON_NORMALIZER2, uprv_normalizer2_cleanup);
|
||||
}
|
||||
|
||||
U_CDECL_BEGIN
|
||||
|
||||
static UBool U_CALLCONV uprv_normalizer2_cleanup() {
|
||||
delete nfcSingleton;
|
||||
nfcSingleton = NULL;
|
||||
delete noopSingleton;
|
||||
noopSingleton = NULL;
|
||||
nfcInitOnce.reset();
|
||||
noopInitOnce.reset();
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
U_CDECL_END
|
||||
|
||||
const Norm2AllModes *
|
||||
Norm2AllModes::getNFCInstance(UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) { return NULL; }
|
||||
|
@ -281,23 +281,29 @@ const Normalizer2 *Normalizer2Factory::getFCCInstance(UErrorCode &errorCode) {
|
|||
return allModes!=NULL ? &allModes->fcc : NULL;
|
||||
}
|
||||
|
||||
const Normalizer2 *Normalizer2Factory::getNoopInstance(UErrorCode &errorCode) {
|
||||
if(U_FAILURE(errorCode)) { return NULL; }
|
||||
umtx_initOnce(noopInitOnce, &initNoopSingleton, errorCode);
|
||||
return noopSingleton;
|
||||
}
|
||||
|
||||
const Normalizer2Impl *
|
||||
Normalizer2Factory::getNFCImpl(UErrorCode &errorCode) {
|
||||
const Norm2AllModes *allModes=Norm2AllModes::getNFCInstance(errorCode);
|
||||
return allModes!=NULL ? allModes->impl : NULL;
|
||||
}
|
||||
#endif // NORM2_HARDCODE_NFC_DATA
|
||||
|
||||
const Normalizer2Impl *
|
||||
Normalizer2Factory::getImpl(const Normalizer2 *norm2) {
|
||||
return &((Normalizer2WithImpl *)norm2)->impl;
|
||||
U_CDECL_BEGIN
|
||||
|
||||
static UBool U_CALLCONV uprv_normalizer2_cleanup() {
|
||||
delete noopSingleton;
|
||||
noopSingleton = NULL;
|
||||
noopInitOnce.reset();
|
||||
#if NORM2_HARDCODE_NFC_DATA
|
||||
delete nfcSingleton;
|
||||
nfcSingleton = NULL;
|
||||
nfcInitOnce.reset();
|
||||
#endif
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
U_CDECL_END
|
||||
|
||||
U_NAMESPACE_END
|
||||
|
||||
// C API ------------------------------------------------------------------- ***
|
||||
|
|
|
@ -16,6 +16,8 @@
|
|||
* created by: Markus W. Scherer
|
||||
*/
|
||||
|
||||
// #define UCPTRIE_DEBUG
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
|
||||
#if !UCONFIG_NO_NORMALIZATION
|
||||
|
@ -24,7 +26,9 @@
|
|||
#include "unicode/edits.h"
|
||||
#include "unicode/normalizer2.h"
|
||||
#include "unicode/stringoptions.h"
|
||||
#include "unicode/ucptrie.h"
|
||||
#include "unicode/udata.h"
|
||||
#include "unicode/umutablecptrie.h"
|
||||
#include "unicode/ustring.h"
|
||||
#include "unicode/utf16.h"
|
||||
#include "unicode/utf8.h"
|
||||
|
@ -34,8 +38,8 @@
|
|||
#include "normalizer2impl.h"
|
||||
#include "putilimp.h"
|
||||
#include "uassert.h"
|
||||
#include "ucptrie_impl.h"
|
||||
#include "uset_imp.h"
|
||||
#include "utrie2.h"
|
||||
#include "uvector.h"
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
@ -62,7 +66,7 @@ inline uint8_t leadByteForCP(UChar32 c) {
|
|||
* Returns the code point from one single well-formed UTF-8 byte sequence
|
||||
* between cpStart and cpLimit.
|
||||
*
|
||||
* UTrie2 UTF-8 macros do not assemble whole code points (for efficiency).
|
||||
* Trie UTF-8 macros do not assemble whole code points (for efficiency).
|
||||
* When we do need the code point, we call this function.
|
||||
* We should not need it for normalization-inert data (norm16==0).
|
||||
* Illegal sequences yield the error value norm16==0 just like real normalization-inert code points.
|
||||
|
@ -253,7 +257,7 @@ UBool ReorderingBuffer::appendSupplementary(UChar32 c, uint8_t cc, UErrorCode &e
|
|||
return TRUE;
|
||||
}
|
||||
|
||||
UBool ReorderingBuffer::append(const UChar *s, int32_t length,
|
||||
UBool ReorderingBuffer::append(const UChar *s, int32_t length, UBool isNFD,
|
||||
uint8_t leadCC, uint8_t trailCC,
|
||||
UErrorCode &errorCode) {
|
||||
if(length==0) {
|
||||
|
@ -280,8 +284,11 @@ UBool ReorderingBuffer::append(const UChar *s, int32_t length,
|
|||
while(i<length) {
|
||||
U16_NEXT(s, i, length, c);
|
||||
if(i<length) {
|
||||
// s must be in NFD, otherwise we need to use getCC().
|
||||
leadCC=Normalizer2Impl::getCCFromYesOrMaybe(impl.getNorm16(c));
|
||||
if (isNFD) {
|
||||
leadCC = Normalizer2Impl::getCCFromYesOrMaybe(impl.getRawNorm16(c));
|
||||
} else {
|
||||
leadCC = impl.getCC(impl.getNorm16(c));
|
||||
}
|
||||
} else {
|
||||
leadCC=trailCC;
|
||||
}
|
||||
|
@ -411,7 +418,8 @@ struct CanonIterData : public UMemory {
|
|||
CanonIterData(UErrorCode &errorCode);
|
||||
~CanonIterData();
|
||||
void addToStartSet(UChar32 origin, UChar32 decompLead, UErrorCode &errorCode);
|
||||
UTrie2 *trie;
|
||||
UMutableCPTrie *mutableTrie;
|
||||
UCPTrie *trie;
|
||||
UVector canonStartSets; // contains UnicodeSet *
|
||||
};
|
||||
|
||||
|
@ -420,7 +428,7 @@ Normalizer2Impl::~Normalizer2Impl() {
|
|||
}
|
||||
|
||||
void
|
||||
Normalizer2Impl::init(const int32_t *inIndexes, const UTrie2 *inTrie,
|
||||
Normalizer2Impl::init(const int32_t *inIndexes, const UCPTrie *inTrie,
|
||||
const uint16_t *inExtraData, const uint8_t *inSmallFCD) {
|
||||
minDecompNoCP = static_cast<UChar>(inIndexes[IX_MIN_DECOMP_NO_CP]);
|
||||
minCompNoMaybeCP = static_cast<UChar>(inIndexes[IX_MIN_COMP_NO_MAYBE_CP]);
|
||||
|
@ -445,75 +453,8 @@ Normalizer2Impl::init(const int32_t *inIndexes, const UTrie2 *inTrie,
|
|||
smallFCD=inSmallFCD;
|
||||
}
|
||||
|
||||
class LcccContext {
|
||||
public:
|
||||
LcccContext(const Normalizer2Impl &ni, UnicodeSet &s) : impl(ni), set(s) {}
|
||||
|
||||
void handleRange(UChar32 start, UChar32 end, uint16_t norm16) {
|
||||
if (norm16 > Normalizer2Impl::MIN_NORMAL_MAYBE_YES &&
|
||||
norm16 != Normalizer2Impl::JAMO_VT) {
|
||||
set.add(start, end);
|
||||
} else if (impl.minNoNoCompNoMaybeCC <= norm16 && norm16 < impl.limitNoNo) {
|
||||
uint16_t fcd16=impl.getFCD16(start);
|
||||
if(fcd16>0xff) { set.add(start, end); }
|
||||
}
|
||||
}
|
||||
|
||||
private:
|
||||
const Normalizer2Impl &impl;
|
||||
UnicodeSet &set;
|
||||
};
|
||||
|
||||
namespace {
|
||||
|
||||
struct PropertyStartsContext {
|
||||
PropertyStartsContext(const Normalizer2Impl &ni, const USetAdder *adder)
|
||||
: impl(ni), sa(adder) {}
|
||||
|
||||
const Normalizer2Impl &impl;
|
||||
const USetAdder *sa;
|
||||
};
|
||||
|
||||
} // namespace
|
||||
|
||||
U_CDECL_BEGIN
|
||||
|
||||
static UBool U_CALLCONV
|
||||
enumLcccRange(const void *context, UChar32 start, UChar32 end, uint32_t value) {
|
||||
((LcccContext *)context)->handleRange(start, end, (uint16_t)value);
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
static UBool U_CALLCONV
|
||||
enumNorm16PropertyStartsRange(const void *context, UChar32 start, UChar32 end, uint32_t value) {
|
||||
/* add the start code point to the USet */
|
||||
const PropertyStartsContext *ctx=(const PropertyStartsContext *)context;
|
||||
const USetAdder *sa=ctx->sa;
|
||||
sa->add(sa->set, start);
|
||||
if (start != end && ctx->impl.isAlgorithmicNoNo((uint16_t)value) &&
|
||||
(value & Normalizer2Impl::DELTA_TCCC_MASK) > Normalizer2Impl::DELTA_TCCC_1) {
|
||||
// Range of code points with same-norm16-value algorithmic decompositions.
|
||||
// They might have different non-zero FCD16 values.
|
||||
uint16_t prevFCD16=ctx->impl.getFCD16(start);
|
||||
while(++start<=end) {
|
||||
uint16_t fcd16=ctx->impl.getFCD16(start);
|
||||
if(fcd16!=prevFCD16) {
|
||||
sa->add(sa->set, start);
|
||||
prevFCD16=fcd16;
|
||||
}
|
||||
}
|
||||
}
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
static UBool U_CALLCONV
|
||||
enumPropertyStartsRange(const void *context, UChar32 start, UChar32 /*end*/, uint32_t /*value*/) {
|
||||
/* add the start code point to the USet */
|
||||
const USetAdder *sa=(const USetAdder *)context;
|
||||
sa->add(sa->set, start);
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
static uint32_t U_CALLCONV
|
||||
segmentStarterMapper(const void * /*context*/, uint32_t value) {
|
||||
return value&CANON_NOT_SEGMENT_STARTER;
|
||||
|
@ -523,15 +464,44 @@ U_CDECL_END
|
|||
|
||||
void
|
||||
Normalizer2Impl::addLcccChars(UnicodeSet &set) const {
|
||||
LcccContext context(*this, set);
|
||||
utrie2_enum(normTrie, NULL, enumLcccRange, &context);
|
||||
UChar32 start = 0, end;
|
||||
uint32_t norm16;
|
||||
while ((end = ucptrie_getRange(normTrie, start, UCPTRIE_RANGE_FIXED_LEAD_SURROGATES, INERT,
|
||||
nullptr, nullptr, &norm16)) >= 0) {
|
||||
if (norm16 > Normalizer2Impl::MIN_NORMAL_MAYBE_YES &&
|
||||
norm16 != Normalizer2Impl::JAMO_VT) {
|
||||
set.add(start, end);
|
||||
} else if (minNoNoCompNoMaybeCC <= norm16 && norm16 < limitNoNo) {
|
||||
uint16_t fcd16 = getFCD16(start);
|
||||
if (fcd16 > 0xff) { set.add(start, end); }
|
||||
}
|
||||
start = end + 1;
|
||||
}
|
||||
}
|
||||
|
||||
void
|
||||
Normalizer2Impl::addPropertyStarts(const USetAdder *sa, UErrorCode & /*errorCode*/) const {
|
||||
/* add the start code point of each same-value range of each trie */
|
||||
PropertyStartsContext context(*this, sa);
|
||||
utrie2_enum(normTrie, NULL, enumNorm16PropertyStartsRange, &context);
|
||||
// Add the start code point of each same-value range of the trie.
|
||||
UChar32 start = 0, end;
|
||||
uint32_t value;
|
||||
while ((end = ucptrie_getRange(normTrie, start, UCPTRIE_RANGE_FIXED_LEAD_SURROGATES, INERT,
|
||||
nullptr, nullptr, &value)) >= 0) {
|
||||
sa->add(sa->set, start);
|
||||
if (start != end && isAlgorithmicNoNo((uint16_t)value) &&
|
||||
(value & Normalizer2Impl::DELTA_TCCC_MASK) > Normalizer2Impl::DELTA_TCCC_1) {
|
||||
// Range of code points with same-norm16-value algorithmic decompositions.
|
||||
// They might have different non-zero FCD16 values.
|
||||
uint16_t prevFCD16 = getFCD16(start);
|
||||
while (++start <= end) {
|
||||
uint16_t fcd16 = getFCD16(start);
|
||||
if (fcd16 != prevFCD16) {
|
||||
sa->add(sa->set, start);
|
||||
prevFCD16 = fcd16;
|
||||
}
|
||||
}
|
||||
}
|
||||
start = end + 1;
|
||||
}
|
||||
|
||||
/* add Hangul LV syllables and LV+1 because of skippables */
|
||||
for(UChar c=Hangul::HANGUL_BASE; c<Hangul::HANGUL_LIMIT; c+=Hangul::JAMO_T_COUNT) {
|
||||
|
@ -543,10 +513,15 @@ Normalizer2Impl::addPropertyStarts(const USetAdder *sa, UErrorCode & /*errorCode
|
|||
|
||||
void
|
||||
Normalizer2Impl::addCanonIterPropertyStarts(const USetAdder *sa, UErrorCode &errorCode) const {
|
||||
/* add the start code point of each same-value range of the canonical iterator data trie */
|
||||
if(ensureCanonIterData(errorCode)) {
|
||||
// currently only used for the SEGMENT_STARTER property
|
||||
utrie2_enum(fCanonIterData->trie, segmentStarterMapper, enumPropertyStartsRange, sa);
|
||||
// Add the start code point of each same-value range of the canonical iterator data trie.
|
||||
if (!ensureCanonIterData(errorCode)) { return; }
|
||||
// Currently only used for the SEGMENT_STARTER property.
|
||||
UChar32 start = 0, end;
|
||||
uint32_t value;
|
||||
while ((end = ucptrie_getRange(fCanonIterData->trie, start, UCPTRIE_RANGE_NORMAL, 0,
|
||||
segmentStarterMapper, nullptr, &value)) >= 0) {
|
||||
sa->add(sa->set, start);
|
||||
start = end + 1;
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -633,27 +608,23 @@ Normalizer2Impl::decompose(const UChar *src, const UChar *limit,
|
|||
// count code units below the minimum or with irrelevant data for the quick check
|
||||
for(prevSrc=src; src!=limit;) {
|
||||
if( (c=*src)<minNoCP ||
|
||||
isMostDecompYesAndZeroCC(norm16=UTRIE2_GET16_FROM_U16_SINGLE_LEAD(normTrie, c))
|
||||
isMostDecompYesAndZeroCC(norm16=UCPTRIE_FAST_BMP_GET(normTrie, UCPTRIE_16, c))
|
||||
) {
|
||||
++src;
|
||||
} else if(!U16_IS_SURROGATE(c)) {
|
||||
} else if(!U16_IS_LEAD(c)) {
|
||||
break;
|
||||
} else {
|
||||
UChar c2;
|
||||
if(U16_IS_SURROGATE_LEAD(c)) {
|
||||
if((src+1)!=limit && U16_IS_TRAIL(c2=src[1])) {
|
||||
c=U16_GET_SUPPLEMENTARY(c, c2);
|
||||
if((src+1)!=limit && U16_IS_TRAIL(c2=src[1])) {
|
||||
c=U16_GET_SUPPLEMENTARY(c, c2);
|
||||
norm16=UCPTRIE_FAST_SUPP_GET(normTrie, UCPTRIE_16, c);
|
||||
if(isMostDecompYesAndZeroCC(norm16)) {
|
||||
src+=2;
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
} else /* trail surrogate */ {
|
||||
if(prevSrc<src && U16_IS_LEAD(c2=*(src-1))) {
|
||||
--src;
|
||||
c=U16_GET_SUPPLEMENTARY(c2, c);
|
||||
}
|
||||
}
|
||||
if(isMostDecompYesAndZeroCC(norm16=getNorm16(c))) {
|
||||
src+=U16_LENGTH(c);
|
||||
} else {
|
||||
break;
|
||||
++src; // unpaired lead surrogate: inert
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -713,7 +684,7 @@ Normalizer2Impl::decomposeShort(const UChar *src, const UChar *limit,
|
|||
const UChar *prevSrc = src;
|
||||
UChar32 c;
|
||||
uint16_t norm16;
|
||||
UTRIE2_U16_NEXT16(normTrie, src, limit, c, norm16);
|
||||
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, src, limit, c, norm16);
|
||||
if (stopAtCompBoundary && norm16HasCompBoundaryBefore(norm16)) {
|
||||
return prevSrc;
|
||||
}
|
||||
|
@ -737,7 +708,7 @@ UBool Normalizer2Impl::decompose(UChar32 c, uint16_t norm16,
|
|||
}
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
c=mapAlgorithmic(c, norm16);
|
||||
norm16=getNorm16(c);
|
||||
norm16=getRawNorm16(c);
|
||||
}
|
||||
if (norm16 < minYesNo) {
|
||||
// c does not decompose
|
||||
|
@ -758,7 +729,7 @@ UBool Normalizer2Impl::decompose(UChar32 c, uint16_t norm16,
|
|||
} else {
|
||||
leadCC=0;
|
||||
}
|
||||
return buffer.append((const UChar *)mapping+1, length, leadCC, trailCC, errorCode);
|
||||
return buffer.append((const UChar *)mapping+1, length, TRUE, leadCC, trailCC, errorCode);
|
||||
}
|
||||
|
||||
const uint8_t *
|
||||
|
@ -771,7 +742,7 @@ Normalizer2Impl::decomposeShort(const uint8_t *src, const uint8_t *limit,
|
|||
while (src < limit) {
|
||||
const uint8_t *prevSrc = src;
|
||||
uint16_t norm16;
|
||||
UTRIE2_U8_NEXT16(normTrie, src, limit, norm16);
|
||||
UCPTRIE_FAST_U8_NEXT(normTrie, UCPTRIE_16, src, limit, norm16);
|
||||
// Get the decomposition and the lead and trail cc's.
|
||||
UChar32 c = U_SENTINEL;
|
||||
if (norm16 >= limitNoNo) {
|
||||
|
@ -789,7 +760,7 @@ Normalizer2Impl::decomposeShort(const uint8_t *src, const uint8_t *limit,
|
|||
}
|
||||
c = codePointFromValidUTF8(prevSrc, src);
|
||||
c = mapAlgorithmic(c, norm16);
|
||||
norm16 = getNorm16(c);
|
||||
norm16 = getRawNorm16(c);
|
||||
} else if (stopAtCompBoundary && norm16 < minNoNoCompNoMaybeCC) {
|
||||
return prevSrc;
|
||||
}
|
||||
|
@ -828,7 +799,7 @@ Normalizer2Impl::decomposeShort(const uint8_t *src, const uint8_t *limit,
|
|||
} else {
|
||||
leadCC = 0;
|
||||
}
|
||||
if (!buffer.append((const char16_t *)mapping+1, length, leadCC, trailCC, errorCode)) {
|
||||
if (!buffer.append((const char16_t *)mapping+1, length, TRUE, leadCC, trailCC, errorCode)) {
|
||||
return nullptr;
|
||||
}
|
||||
}
|
||||
|
@ -854,7 +825,7 @@ Normalizer2Impl::getDecomposition(UChar32 c, UChar buffer[4], int32_t &length) c
|
|||
length=0;
|
||||
U16_APPEND_UNSAFE(buffer, length, c);
|
||||
// The mapping might decompose further.
|
||||
norm16 = getNorm16(c);
|
||||
norm16 = getRawNorm16(c);
|
||||
}
|
||||
if (norm16 < minYesNo) {
|
||||
return decomp;
|
||||
|
@ -926,19 +897,30 @@ void Normalizer2Impl::decomposeAndAppend(const UChar *src, const UChar *limit,
|
|||
return;
|
||||
}
|
||||
// Just merge the strings at the boundary.
|
||||
ForwardUTrie2StringIterator iter(normTrie, src, limit);
|
||||
uint8_t firstCC, prevCC, cc;
|
||||
firstCC=prevCC=cc=getCC(iter.next16());
|
||||
while(cc!=0) {
|
||||
prevCC=cc;
|
||||
cc=getCC(iter.next16());
|
||||
};
|
||||
bool isFirst = true;
|
||||
uint8_t firstCC = 0, prevCC = 0, cc;
|
||||
const UChar *p = src;
|
||||
while (p != limit) {
|
||||
const UChar *codePointStart = p;
|
||||
UChar32 c;
|
||||
uint16_t norm16;
|
||||
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, p, limit, c, norm16);
|
||||
if ((cc = getCC(norm16)) == 0) {
|
||||
p = codePointStart;
|
||||
break;
|
||||
}
|
||||
if (isFirst) {
|
||||
firstCC = cc;
|
||||
isFirst = false;
|
||||
}
|
||||
prevCC = cc;
|
||||
}
|
||||
if(limit==NULL) { // appendZeroCC() needs limit!=NULL
|
||||
limit=u_strchr(iter.codePointStart, 0);
|
||||
limit=u_strchr(p, 0);
|
||||
}
|
||||
|
||||
if (buffer.append(src, (int32_t)(iter.codePointStart-src), firstCC, prevCC, errorCode)) {
|
||||
buffer.appendZeroCC(iter.codePointStart, limit, errorCode);
|
||||
if (buffer.append(src, (int32_t)(p - src), FALSE, firstCC, prevCC, errorCode)) {
|
||||
buffer.appendZeroCC(p, limit, errorCode);
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -1085,7 +1067,7 @@ void Normalizer2Impl::addComposites(const uint16_t *list, UnicodeSet &set) const
|
|||
}
|
||||
UChar32 composite=compositeAndFwd>>1;
|
||||
if((compositeAndFwd&1)!=0) {
|
||||
addComposites(getCompositionsListForComposite(getNorm16(composite)), set);
|
||||
addComposites(getCompositionsListForComposite(getRawNorm16(composite)), set);
|
||||
}
|
||||
set.add(composite);
|
||||
} while((firstUnit&COMP_1_LAST_TUPLE)==0);
|
||||
|
@ -1124,7 +1106,7 @@ void Normalizer2Impl::recompose(ReorderingBuffer &buffer, int32_t recomposeStart
|
|||
prevCC=0;
|
||||
|
||||
for(;;) {
|
||||
UTRIE2_U16_NEXT16(normTrie, p, limit, c, norm16);
|
||||
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, p, limit, c, norm16);
|
||||
cc=getCCFromYesOrMaybe(norm16);
|
||||
if( // this character combines backward and
|
||||
isMaybe(norm16) &&
|
||||
|
@ -1229,7 +1211,7 @@ void Normalizer2Impl::recompose(ReorderingBuffer &buffer, int32_t recomposeStart
|
|||
// Is the composite a starter that combines forward?
|
||||
if(compositeAndFwd&1) {
|
||||
compositionsList=
|
||||
getCompositionsListForComposite(getNorm16(composite));
|
||||
getCompositionsListForComposite(getRawNorm16(composite));
|
||||
} else {
|
||||
compositionsList=NULL;
|
||||
}
|
||||
|
@ -1268,7 +1250,7 @@ void Normalizer2Impl::recompose(ReorderingBuffer &buffer, int32_t recomposeStart
|
|||
|
||||
UChar32
|
||||
Normalizer2Impl::composePair(UChar32 a, UChar32 b) const {
|
||||
uint16_t norm16=getNorm16(a); // maps an out-of-range 'a' to inert norm16=0
|
||||
uint16_t norm16=getNorm16(a); // maps an out-of-range 'a' to inert norm16
|
||||
const uint16_t *list;
|
||||
if(isInert(norm16)) {
|
||||
return U_SENTINEL;
|
||||
|
@ -1359,28 +1341,22 @@ Normalizer2Impl::compose(const UChar *src, const UChar *limit,
|
|||
return TRUE;
|
||||
}
|
||||
if( (c=*src)<minNoMaybeCP ||
|
||||
isCompYesAndZeroCC(norm16=UTRIE2_GET16_FROM_U16_SINGLE_LEAD(normTrie, c))
|
||||
isCompYesAndZeroCC(norm16=UCPTRIE_FAST_BMP_GET(normTrie, UCPTRIE_16, c))
|
||||
) {
|
||||
++src;
|
||||
} else {
|
||||
prevSrc = src++;
|
||||
if(!U16_IS_SURROGATE(c)) {
|
||||
if(!U16_IS_LEAD(c)) {
|
||||
break;
|
||||
} else {
|
||||
UChar c2;
|
||||
if(U16_IS_SURROGATE_LEAD(c)) {
|
||||
if(src!=limit && U16_IS_TRAIL(c2=*src)) {
|
||||
++src;
|
||||
c=U16_GET_SUPPLEMENTARY(c, c2);
|
||||
if(src!=limit && U16_IS_TRAIL(c2=*src)) {
|
||||
++src;
|
||||
c=U16_GET_SUPPLEMENTARY(c, c2);
|
||||
norm16=UCPTRIE_FAST_SUPP_GET(normTrie, UCPTRIE_16, c);
|
||||
if(!isCompYesAndZeroCC(norm16)) {
|
||||
break;
|
||||
}
|
||||
} else /* trail surrogate */ {
|
||||
if(prevBoundary<prevSrc && U16_IS_LEAD(c2=*(prevSrc-1))) {
|
||||
--prevSrc;
|
||||
c=U16_GET_SUPPLEMENTARY(c2, c);
|
||||
}
|
||||
}
|
||||
if(!isCompYesAndZeroCC(norm16=getNorm16(c))) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -1529,7 +1505,7 @@ Normalizer2Impl::compose(const UChar *src, const UChar *limit,
|
|||
}
|
||||
uint8_t prevCC = cc;
|
||||
nextSrc = src;
|
||||
UTRIE2_U16_NEXT16(normTrie, nextSrc, limit, c, n16);
|
||||
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, nextSrc, limit, c, n16);
|
||||
if (n16 >= MIN_YES_YES_WITH_CC) {
|
||||
cc = getCCFromNormalYesOrMaybe(n16);
|
||||
if (prevCC > cc) {
|
||||
|
@ -1559,7 +1535,7 @@ Normalizer2Impl::compose(const UChar *src, const UChar *limit,
|
|||
// decompose and recompose.
|
||||
if (prevBoundary != prevSrc && !norm16HasCompBoundaryBefore(norm16)) {
|
||||
const UChar *p = prevSrc;
|
||||
UTRIE2_U16_PREV16(normTrie, prevBoundary, p, c, norm16);
|
||||
UCPTRIE_FAST_U16_PREV(normTrie, UCPTRIE_16, prevBoundary, p, c, norm16);
|
||||
if (!norm16HasCompBoundaryAfter(norm16, onlyContiguous)) {
|
||||
prevSrc = p;
|
||||
}
|
||||
|
@ -1626,28 +1602,22 @@ Normalizer2Impl::composeQuickCheck(const UChar *src, const UChar *limit,
|
|||
return src;
|
||||
}
|
||||
if( (c=*src)<minNoMaybeCP ||
|
||||
isCompYesAndZeroCC(norm16=UTRIE2_GET16_FROM_U16_SINGLE_LEAD(normTrie, c))
|
||||
isCompYesAndZeroCC(norm16=UCPTRIE_FAST_BMP_GET(normTrie, UCPTRIE_16, c))
|
||||
) {
|
||||
++src;
|
||||
} else {
|
||||
prevSrc = src++;
|
||||
if(!U16_IS_SURROGATE(c)) {
|
||||
if(!U16_IS_LEAD(c)) {
|
||||
break;
|
||||
} else {
|
||||
UChar c2;
|
||||
if(U16_IS_SURROGATE_LEAD(c)) {
|
||||
if(src!=limit && U16_IS_TRAIL(c2=*src)) {
|
||||
++src;
|
||||
c=U16_GET_SUPPLEMENTARY(c, c2);
|
||||
if(src!=limit && U16_IS_TRAIL(c2=*src)) {
|
||||
++src;
|
||||
c=U16_GET_SUPPLEMENTARY(c, c2);
|
||||
norm16=UCPTRIE_FAST_SUPP_GET(normTrie, UCPTRIE_16, c);
|
||||
if(!isCompYesAndZeroCC(norm16)) {
|
||||
break;
|
||||
}
|
||||
} else /* trail surrogate */ {
|
||||
if(prevBoundary<prevSrc && U16_IS_LEAD(c2=*(prevSrc-1))) {
|
||||
--prevSrc;
|
||||
c=U16_GET_SUPPLEMENTARY(c2, c);
|
||||
}
|
||||
}
|
||||
if(!isCompYesAndZeroCC(norm16=getNorm16(c))) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -1665,7 +1635,7 @@ Normalizer2Impl::composeQuickCheck(const UChar *src, const UChar *limit,
|
|||
} else {
|
||||
const UChar *p = prevSrc;
|
||||
uint16_t n16;
|
||||
UTRIE2_U16_PREV16(normTrie, prevBoundary, p, c, n16);
|
||||
UCPTRIE_FAST_U16_PREV(normTrie, UCPTRIE_16, prevBoundary, p, c, n16);
|
||||
if (norm16HasCompBoundaryAfter(n16, onlyContiguous)) {
|
||||
prevBoundary = prevSrc;
|
||||
} else {
|
||||
|
@ -1699,7 +1669,7 @@ Normalizer2Impl::composeQuickCheck(const UChar *src, const UChar *limit,
|
|||
}
|
||||
uint8_t prevCC = cc;
|
||||
nextSrc = src;
|
||||
UTRIE2_U16_NEXT16(normTrie, nextSrc, limit, c, norm16);
|
||||
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, nextSrc, limit, c, norm16);
|
||||
if (isMaybeOrNonZeroCC(norm16)) {
|
||||
cc = getCCFromYesOrMaybe(norm16);
|
||||
if (!(prevCC <= cc || cc == 0)) {
|
||||
|
@ -1786,7 +1756,7 @@ Normalizer2Impl::composeUTF8(uint32_t options, UBool onlyContiguous,
|
|||
++src;
|
||||
} else {
|
||||
prevSrc = src;
|
||||
UTRIE2_U8_NEXT16(normTrie, src, limit, norm16);
|
||||
UCPTRIE_FAST_U8_NEXT(normTrie, UCPTRIE_16, src, limit, norm16);
|
||||
if (!isCompYesAndZeroCC(norm16)) {
|
||||
break;
|
||||
}
|
||||
|
@ -1945,7 +1915,7 @@ Normalizer2Impl::composeUTF8(uint32_t options, UBool onlyContiguous,
|
|||
}
|
||||
uint8_t prevCC = cc;
|
||||
nextSrc = src;
|
||||
UTRIE2_U8_NEXT16(normTrie, nextSrc, limit, n16);
|
||||
UCPTRIE_FAST_U8_NEXT(normTrie, UCPTRIE_16, nextSrc, limit, n16);
|
||||
if (n16 >= MIN_YES_YES_WITH_CC) {
|
||||
cc = getCCFromNormalYesOrMaybe(n16);
|
||||
if (prevCC > cc) {
|
||||
|
@ -1975,7 +1945,7 @@ Normalizer2Impl::composeUTF8(uint32_t options, UBool onlyContiguous,
|
|||
// decompose and recompose.
|
||||
if (prevBoundary != prevSrc && !norm16HasCompBoundaryBefore(norm16)) {
|
||||
const uint8_t *p = prevSrc;
|
||||
UTRIE2_U8_PREV16(normTrie, prevBoundary, p, norm16);
|
||||
UCPTRIE_FAST_U8_PREV(normTrie, UCPTRIE_16, prevBoundary, p, norm16);
|
||||
if (!norm16HasCompBoundaryAfter(norm16, onlyContiguous)) {
|
||||
prevSrc = p;
|
||||
}
|
||||
|
@ -2023,7 +1993,7 @@ UBool Normalizer2Impl::hasCompBoundaryBefore(const UChar *src, const UChar *limi
|
|||
}
|
||||
UChar32 c;
|
||||
uint16_t norm16;
|
||||
UTRIE2_U16_NEXT16(normTrie, src, limit, c, norm16);
|
||||
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, src, limit, c, norm16);
|
||||
return norm16HasCompBoundaryBefore(norm16);
|
||||
}
|
||||
|
||||
|
@ -2032,7 +2002,7 @@ UBool Normalizer2Impl::hasCompBoundaryBefore(const uint8_t *src, const uint8_t *
|
|||
return TRUE;
|
||||
}
|
||||
uint16_t norm16;
|
||||
UTRIE2_U8_NEXT16(normTrie, src, limit, norm16);
|
||||
UCPTRIE_FAST_U8_NEXT(normTrie, UCPTRIE_16, src, limit, norm16);
|
||||
return norm16HasCompBoundaryBefore(norm16);
|
||||
}
|
||||
|
||||
|
@ -2043,7 +2013,7 @@ UBool Normalizer2Impl::hasCompBoundaryAfter(const UChar *start, const UChar *p,
|
|||
}
|
||||
UChar32 c;
|
||||
uint16_t norm16;
|
||||
UTRIE2_U16_PREV16(normTrie, start, p, c, norm16);
|
||||
UCPTRIE_FAST_U16_PREV(normTrie, UCPTRIE_16, start, p, c, norm16);
|
||||
return norm16HasCompBoundaryAfter(norm16, onlyContiguous);
|
||||
}
|
||||
|
||||
|
@ -2053,36 +2023,42 @@ UBool Normalizer2Impl::hasCompBoundaryAfter(const uint8_t *start, const uint8_t
|
|||
return TRUE;
|
||||
}
|
||||
uint16_t norm16;
|
||||
UTRIE2_U8_PREV16(normTrie, start, p, norm16);
|
||||
UCPTRIE_FAST_U8_PREV(normTrie, UCPTRIE_16, start, p, norm16);
|
||||
return norm16HasCompBoundaryAfter(norm16, onlyContiguous);
|
||||
}
|
||||
|
||||
const UChar *Normalizer2Impl::findPreviousCompBoundary(const UChar *start, const UChar *p,
|
||||
UBool onlyContiguous) const {
|
||||
BackwardUTrie2StringIterator iter(normTrie, start, p);
|
||||
for(;;) {
|
||||
uint16_t norm16=iter.previous16();
|
||||
while (p != start) {
|
||||
const UChar *codePointLimit = p;
|
||||
UChar32 c;
|
||||
uint16_t norm16;
|
||||
UCPTRIE_FAST_U16_PREV(normTrie, UCPTRIE_16, start, p, c, norm16);
|
||||
if (norm16HasCompBoundaryAfter(norm16, onlyContiguous)) {
|
||||
return iter.codePointLimit;
|
||||
return codePointLimit;
|
||||
}
|
||||
if (hasCompBoundaryBefore(iter.codePoint, norm16)) {
|
||||
return iter.codePointStart;
|
||||
if (hasCompBoundaryBefore(c, norm16)) {
|
||||
return p;
|
||||
}
|
||||
}
|
||||
return p;
|
||||
}
|
||||
|
||||
const UChar *Normalizer2Impl::findNextCompBoundary(const UChar *p, const UChar *limit,
|
||||
UBool onlyContiguous) const {
|
||||
ForwardUTrie2StringIterator iter(normTrie, p, limit);
|
||||
for(;;) {
|
||||
uint16_t norm16=iter.next16();
|
||||
if (hasCompBoundaryBefore(iter.codePoint, norm16)) {
|
||||
return iter.codePointStart;
|
||||
while (p != limit) {
|
||||
const UChar *codePointStart = p;
|
||||
UChar32 c;
|
||||
uint16_t norm16;
|
||||
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, p, limit, c, norm16);
|
||||
if (hasCompBoundaryBefore(c, norm16)) {
|
||||
return codePointStart;
|
||||
}
|
||||
if (norm16HasCompBoundaryAfter(norm16, onlyContiguous)) {
|
||||
return iter.codePointLimit;
|
||||
return p;
|
||||
}
|
||||
}
|
||||
return p;
|
||||
}
|
||||
|
||||
uint8_t Normalizer2Impl::getPreviousTrailCC(const UChar *start, const UChar *p) const {
|
||||
|
@ -2130,7 +2106,7 @@ uint16_t Normalizer2Impl::getFCD16FromNormData(UChar32 c) const {
|
|||
}
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
c=mapAlgorithmic(c, norm16);
|
||||
norm16=getNorm16(c);
|
||||
norm16=getRawNorm16(c);
|
||||
}
|
||||
}
|
||||
if(norm16<=minYesNo || isHangulLVT(norm16)) {
|
||||
|
@ -2195,17 +2171,10 @@ Normalizer2Impl::makeFCD(const UChar *src, const UChar *limit,
|
|||
prevFCD16=0;
|
||||
++src;
|
||||
} else {
|
||||
if(U16_IS_SURROGATE(c)) {
|
||||
if(U16_IS_LEAD(c)) {
|
||||
UChar c2;
|
||||
if(U16_IS_SURROGATE_LEAD(c)) {
|
||||
if((src+1)!=limit && U16_IS_TRAIL(c2=src[1])) {
|
||||
c=U16_GET_SUPPLEMENTARY(c, c2);
|
||||
}
|
||||
} else /* trail surrogate */ {
|
||||
if(prevSrc<src && U16_IS_LEAD(c2=*(src-1))) {
|
||||
--src;
|
||||
c=U16_GET_SUPPLEMENTARY(c2, c);
|
||||
}
|
||||
if((src+1)!=limit && U16_IS_TRAIL(c2=src[1])) {
|
||||
c=U16_GET_SUPPLEMENTARY(c, c2);
|
||||
}
|
||||
}
|
||||
if((fcd16=getFCD16FromNormData(c))<=0xff) {
|
||||
|
@ -2336,7 +2305,7 @@ const UChar *Normalizer2Impl::findPreviousFCDBoundary(const UChar *start, const
|
|||
const UChar *codePointLimit = p;
|
||||
UChar32 c;
|
||||
uint16_t norm16;
|
||||
UTRIE2_U16_PREV16(normTrie, start, p, c, norm16);
|
||||
UCPTRIE_FAST_U16_PREV(normTrie, UCPTRIE_16, start, p, c, norm16);
|
||||
if (c < minDecompNoCP || norm16HasDecompBoundaryAfter(norm16)) {
|
||||
return codePointLimit;
|
||||
}
|
||||
|
@ -2352,7 +2321,7 @@ const UChar *Normalizer2Impl::findNextFCDBoundary(const UChar *p, const UChar *l
|
|||
const UChar *codePointStart=p;
|
||||
UChar32 c;
|
||||
uint16_t norm16;
|
||||
UTRIE2_U16_NEXT16(normTrie, p, limit, c, norm16);
|
||||
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, p, limit, c, norm16);
|
||||
if (c < minLcccCP || norm16HasDecompBoundaryBefore(norm16)) {
|
||||
return codePointStart;
|
||||
}
|
||||
|
@ -2366,19 +2335,20 @@ const UChar *Normalizer2Impl::findNextFCDBoundary(const UChar *p, const UChar *l
|
|||
// CanonicalIterator data -------------------------------------------------- ***
|
||||
|
||||
CanonIterData::CanonIterData(UErrorCode &errorCode) :
|
||||
trie(utrie2_open(0, 0, &errorCode)),
|
||||
mutableTrie(umutablecptrie_open(0, 0, &errorCode)), trie(nullptr),
|
||||
canonStartSets(uprv_deleteUObject, NULL, errorCode) {}
|
||||
|
||||
CanonIterData::~CanonIterData() {
|
||||
utrie2_close(trie);
|
||||
umutablecptrie_close(mutableTrie);
|
||||
ucptrie_close(trie);
|
||||
}
|
||||
|
||||
void CanonIterData::addToStartSet(UChar32 origin, UChar32 decompLead, UErrorCode &errorCode) {
|
||||
uint32_t canonValue=utrie2_get32(trie, decompLead);
|
||||
uint32_t canonValue = umutablecptrie_get(mutableTrie, decompLead);
|
||||
if((canonValue&(CANON_HAS_SET|CANON_VALUE_MASK))==0 && origin!=0) {
|
||||
// origin is the first character whose decomposition starts with
|
||||
// the character for which we are setting the value.
|
||||
utrie2_set32(trie, decompLead, canonValue|origin, &errorCode);
|
||||
umutablecptrie_set(mutableTrie, decompLead, canonValue|origin, &errorCode);
|
||||
} else {
|
||||
// origin is not the first character, or it is U+0000.
|
||||
UnicodeSet *set;
|
||||
|
@ -2390,7 +2360,7 @@ void CanonIterData::addToStartSet(UChar32 origin, UChar32 decompLead, UErrorCode
|
|||
}
|
||||
UChar32 firstOrigin=(UChar32)(canonValue&CANON_VALUE_MASK);
|
||||
canonValue=(canonValue&~CANON_VALUE_MASK)|CANON_HAS_SET|(uint32_t)canonStartSets.size();
|
||||
utrie2_set32(trie, decompLead, canonValue, &errorCode);
|
||||
umutablecptrie_set(mutableTrie, decompLead, canonValue, &errorCode);
|
||||
canonStartSets.addElement(set, errorCode);
|
||||
if(firstOrigin!=0) {
|
||||
set->add(firstOrigin);
|
||||
|
@ -2406,7 +2376,6 @@ void CanonIterData::addToStartSet(UChar32 origin, UChar32 decompLead, UErrorCode
|
|||
class InitCanonIterData {
|
||||
public:
|
||||
static void doInit(Normalizer2Impl *impl, UErrorCode &errorCode);
|
||||
static void handleRange(Normalizer2Impl *impl, UChar32 start, UChar32 end, uint16_t value, UErrorCode &errorCode);
|
||||
};
|
||||
|
||||
U_CDECL_BEGIN
|
||||
|
@ -2417,18 +2386,6 @@ initCanonIterData(Normalizer2Impl *impl, UErrorCode &errorCode) {
|
|||
InitCanonIterData::doInit(impl, errorCode);
|
||||
}
|
||||
|
||||
// Call Normalizer2Impl::makeCanonIterDataFromNorm16() for a range of same-norm16 characters.
|
||||
// context: the Normalizer2Impl
|
||||
static UBool U_CALLCONV
|
||||
enumCIDRangeHandler(const void *context, UChar32 start, UChar32 end, uint32_t value) {
|
||||
UErrorCode errorCode = U_ZERO_ERROR;
|
||||
if (value != Normalizer2Impl::INERT) {
|
||||
Normalizer2Impl *impl = (Normalizer2Impl *)context;
|
||||
InitCanonIterData::handleRange(impl, start, end, (uint16_t)value, errorCode);
|
||||
}
|
||||
return U_SUCCESS(errorCode);
|
||||
}
|
||||
|
||||
U_CDECL_END
|
||||
|
||||
void InitCanonIterData::doInit(Normalizer2Impl *impl, UErrorCode &errorCode) {
|
||||
|
@ -2438,8 +2395,24 @@ void InitCanonIterData::doInit(Normalizer2Impl *impl, UErrorCode &errorCode) {
|
|||
errorCode=U_MEMORY_ALLOCATION_ERROR;
|
||||
}
|
||||
if (U_SUCCESS(errorCode)) {
|
||||
utrie2_enum(impl->normTrie, NULL, enumCIDRangeHandler, impl);
|
||||
utrie2_freeze(impl->fCanonIterData->trie, UTRIE2_32_VALUE_BITS, &errorCode);
|
||||
UChar32 start = 0, end;
|
||||
uint32_t value;
|
||||
while ((end = ucptrie_getRange(impl->normTrie, start,
|
||||
UCPTRIE_RANGE_FIXED_LEAD_SURROGATES, Normalizer2Impl::INERT,
|
||||
nullptr, nullptr, &value)) >= 0) {
|
||||
// Call Normalizer2Impl::makeCanonIterDataFromNorm16() for a range of same-norm16 characters.
|
||||
if (value != Normalizer2Impl::INERT) {
|
||||
impl->makeCanonIterDataFromNorm16(start, end, value, *impl->fCanonIterData, errorCode);
|
||||
}
|
||||
start = end + 1;
|
||||
}
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
umutablecptrie_setName(impl->fCanonIterData->mutableTrie, "CanonIterData");
|
||||
#endif
|
||||
impl->fCanonIterData->trie = umutablecptrie_buildImmutable(
|
||||
impl->fCanonIterData->mutableTrie, UCPTRIE_TYPE_SMALL, UCPTRIE_VALUE_BITS_32, &errorCode);
|
||||
umutablecptrie_close(impl->fCanonIterData->mutableTrie);
|
||||
impl->fCanonIterData->mutableTrie = nullptr;
|
||||
}
|
||||
if (U_FAILURE(errorCode)) {
|
||||
delete impl->fCanonIterData;
|
||||
|
@ -2447,11 +2420,6 @@ void InitCanonIterData::doInit(Normalizer2Impl *impl, UErrorCode &errorCode) {
|
|||
}
|
||||
}
|
||||
|
||||
void InitCanonIterData::handleRange(
|
||||
Normalizer2Impl *impl, UChar32 start, UChar32 end, uint16_t value, UErrorCode &errorCode) {
|
||||
impl->makeCanonIterDataFromNorm16(start, end, value, *impl->fCanonIterData, errorCode);
|
||||
}
|
||||
|
||||
void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, const uint16_t norm16,
|
||||
CanonIterData &newData,
|
||||
UErrorCode &errorCode) const {
|
||||
|
@ -2465,7 +2433,7 @@ void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, co
|
|||
return;
|
||||
}
|
||||
for(UChar32 c=start; c<=end; ++c) {
|
||||
uint32_t oldValue=utrie2_get32(newData.trie, c);
|
||||
uint32_t oldValue = umutablecptrie_get(newData.mutableTrie, c);
|
||||
uint32_t newValue=oldValue;
|
||||
if(isMaybeOrNonZeroCC(norm16)) {
|
||||
// not a segment starter if it occurs in a decomposition or has cc!=0
|
||||
|
@ -2483,7 +2451,7 @@ void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, co
|
|||
if (isDecompNoAlgorithmic(norm16_2)) {
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
c2 = mapAlgorithmic(c2, norm16_2);
|
||||
norm16_2 = getNorm16(c2);
|
||||
norm16_2 = getRawNorm16(c2);
|
||||
// No compatibility mappings for the CanonicalIterator.
|
||||
U_ASSERT(!(isHangulLV(norm16_2) || isHangulLVT(norm16_2)));
|
||||
}
|
||||
|
@ -2510,10 +2478,10 @@ void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, co
|
|||
if(norm16_2>=minNoNo) {
|
||||
while(i<length) {
|
||||
U16_NEXT_UNSAFE(mapping, i, c2);
|
||||
uint32_t c2Value=utrie2_get32(newData.trie, c2);
|
||||
uint32_t c2Value = umutablecptrie_get(newData.mutableTrie, c2);
|
||||
if((c2Value&CANON_NOT_SEGMENT_STARTER)==0) {
|
||||
utrie2_set32(newData.trie, c2, c2Value|CANON_NOT_SEGMENT_STARTER,
|
||||
&errorCode);
|
||||
umutablecptrie_set(newData.mutableTrie, c2,
|
||||
c2Value|CANON_NOT_SEGMENT_STARTER, &errorCode);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -2524,7 +2492,7 @@ void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, co
|
|||
}
|
||||
}
|
||||
if(newValue!=oldValue) {
|
||||
utrie2_set32(newData.trie, c, newValue, &errorCode);
|
||||
umutablecptrie_set(newData.mutableTrie, c, newValue, &errorCode);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -2537,7 +2505,7 @@ UBool Normalizer2Impl::ensureCanonIterData(UErrorCode &errorCode) const {
|
|||
}
|
||||
|
||||
int32_t Normalizer2Impl::getCanonValue(UChar32 c) const {
|
||||
return (int32_t)utrie2_get32(fCanonIterData->trie, c);
|
||||
return (int32_t)ucptrie_get(fCanonIterData->trie, c);
|
||||
}
|
||||
|
||||
const UnicodeSet &Normalizer2Impl::getCanonStartSet(int32_t n) const {
|
||||
|
@ -2561,7 +2529,7 @@ UBool Normalizer2Impl::getCanonStartSet(UChar32 c, UnicodeSet &set) const {
|
|||
set.add(value);
|
||||
}
|
||||
if((canonValue&CANON_HAS_COMPOSITIONS)!=0) {
|
||||
uint16_t norm16=getNorm16(c);
|
||||
uint16_t norm16=getRawNorm16(c);
|
||||
if(norm16==JAMO_L) {
|
||||
UChar32 syllable=
|
||||
(UChar32)(Hangul::HANGUL_BASE+(c-Hangul::JAMO_L_BASE)*Hangul::JAMO_VT_COUNT);
|
||||
|
@ -2608,7 +2576,7 @@ unorm2_swap(const UDataSwapper *ds,
|
|||
pInfo->dataFormat[1]==0x72 &&
|
||||
pInfo->dataFormat[2]==0x6d &&
|
||||
pInfo->dataFormat[3]==0x32 &&
|
||||
(1<=formatVersion0 && formatVersion0<=3)
|
||||
(1<=formatVersion0 && formatVersion0<=4)
|
||||
)) {
|
||||
udata_printError(ds, "unorm2_swap(): data format %02x.%02x.%02x.%02x (format version %02x) is not recognized as Normalizer2 data\n",
|
||||
pInfo->dataFormat[0], pInfo->dataFormat[1],
|
||||
|
@ -2669,9 +2637,9 @@ unorm2_swap(const UDataSwapper *ds,
|
|||
ds->swapArray32(ds, inBytes, nextOffset-offset, outBytes, pErrorCode);
|
||||
offset=nextOffset;
|
||||
|
||||
/* swap the UTrie2 */
|
||||
/* swap the trie */
|
||||
nextOffset=indexes[Normalizer2Impl::IX_EXTRA_DATA_OFFSET];
|
||||
utrie2_swap(ds, inBytes+offset, nextOffset-offset, outBytes+offset, pErrorCode);
|
||||
utrie_swapAnyVersion(ds, inBytes+offset, nextOffset-offset, outBytes+offset, pErrorCode);
|
||||
offset=nextOffset;
|
||||
|
||||
/* swap the uint16_t extraData[] */
|
||||
|
|
|
@ -24,12 +24,19 @@
|
|||
#if !UCONFIG_NO_NORMALIZATION
|
||||
|
||||
#include "unicode/normalizer2.h"
|
||||
#include "unicode/ucptrie.h"
|
||||
#include "unicode/unistr.h"
|
||||
#include "unicode/unorm.h"
|
||||
#include "unicode/utf.h"
|
||||
#include "unicode/utf16.h"
|
||||
#include "mutex.h"
|
||||
#include "uset_imp.h"
|
||||
#include "utrie2.h"
|
||||
|
||||
// When the nfc.nrm data is *not* hardcoded into the common library
|
||||
// (with this constant set to 0),
|
||||
// then it needs to be built into the data package:
|
||||
// Add nfc.nrm to icu4c/source/data/Makefile.in DAT_FILES_SHORT
|
||||
#define NORM2_HARDCODE_NFC_DATA 1
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
||||
|
@ -158,8 +165,7 @@ public:
|
|||
appendBMP((UChar)c, cc, errorCode) :
|
||||
appendSupplementary(c, cc, errorCode);
|
||||
}
|
||||
// s must be in NFD, otherwise change the implementation.
|
||||
UBool append(const UChar *s, int32_t length,
|
||||
UBool append(const UChar *s, int32_t length, UBool isNFD,
|
||||
uint8_t leadCC, uint8_t trailCC,
|
||||
UErrorCode &errorCode);
|
||||
UBool appendBMP(UChar c, uint8_t cc, UErrorCode &errorCode) {
|
||||
|
@ -243,7 +249,7 @@ public:
|
|||
}
|
||||
virtual ~Normalizer2Impl();
|
||||
|
||||
void init(const int32_t *inIndexes, const UTrie2 *inTrie,
|
||||
void init(const int32_t *inIndexes, const UCPTrie *inTrie,
|
||||
const uint16_t *inExtraData, const uint8_t *inSmallFCD);
|
||||
|
||||
void addLcccChars(UnicodeSet &set) const;
|
||||
|
@ -254,7 +260,12 @@ public:
|
|||
|
||||
UBool ensureCanonIterData(UErrorCode &errorCode) const;
|
||||
|
||||
uint16_t getNorm16(UChar32 c) const { return UTRIE2_GET16(normTrie, c); }
|
||||
// The trie stores values for lead surrogate code *units*.
|
||||
// Surrogate code *points* are inert.
|
||||
uint16_t getNorm16(UChar32 c) const {
|
||||
return U_IS_LEAD(c) ? INERT : UCPTRIE_FAST_GET(normTrie, UCPTRIE_16, c);
|
||||
}
|
||||
uint16_t getRawNorm16(UChar32 c) const { return UCPTRIE_FAST_GET(normTrie, UCPTRIE_16, c); }
|
||||
|
||||
UNormalizationCheckResult getCompQuickCheck(uint16_t norm16) const {
|
||||
if(norm16<minNoNo || MIN_YES_YES_WITH_CC<=norm16) {
|
||||
|
@ -704,7 +715,7 @@ private:
|
|||
uint16_t centerNoNoDelta;
|
||||
uint16_t minMaybeYes;
|
||||
|
||||
const UTrie2 *normTrie;
|
||||
const UCPTrie *normTrie;
|
||||
const uint16_t *maybeYesCompositions;
|
||||
const uint16_t *extraData; // mappings and/or compositions for yesYes, yesNo & noNo characters
|
||||
const uint8_t *smallFCD; // [0x100] one bit per 32 BMP code points, set if any FCD!=0
|
||||
|
@ -764,7 +775,7 @@ unorm_getFCD16(UChar32 c);
|
|||
|
||||
/**
|
||||
* Format of Normalizer2 .nrm data files.
|
||||
* Format version 3.0.
|
||||
* Format version 4.0.
|
||||
*
|
||||
* Normalizer2 .nrm data files provide data for the Unicode Normalization algorithms.
|
||||
* ICU ships with data files for standard Unicode Normalization Forms
|
||||
|
@ -818,7 +829,7 @@ unorm_getFCD16(UChar32 c);
|
|||
* minMaybeYes=indexes[IX_MIN_MAYBE_YES];
|
||||
* See the normTrie description below and the design doc for details.
|
||||
*
|
||||
* UTrie2 normTrie; -- see utrie2_impl.h and utrie2.h
|
||||
* UCPTrie normTrie; -- see ucptrie_impl.h and ucptrie.h, same as Java CodePointTrie
|
||||
*
|
||||
* The trie holds the main normalization data. Each code point is mapped to a 16-bit value.
|
||||
* Rather than using independent bits in the value (which would require more than 16 bits),
|
||||
|
@ -946,6 +957,20 @@ unorm_getFCD16(UChar32 c);
|
|||
* which is artificially assigned "worst case" values lccc=1 and tccc=255.
|
||||
*
|
||||
* - A mapping to an empty string has explicit lccc=1 and tccc=255 values.
|
||||
*
|
||||
* Changes from format version 3 to format version 4 (ICU 63) ------------------
|
||||
*
|
||||
* Switched from UTrie2 to UCPTrie/CodePointTrie.
|
||||
*
|
||||
* The new trie no longer stores different values for surrogate code *units* vs.
|
||||
* surrogate code *points*.
|
||||
* Lead surrogates still have values for optimized UTF-16 string processing.
|
||||
* When looking up code point properties, the code now checks for lead surrogates and
|
||||
* treats them as inert.
|
||||
*
|
||||
* gennorm2 now has to reject mappings for surrogate code points.
|
||||
* UTS #46 maps unpaired surrogates to U+FFFD in code rather than via its
|
||||
* custom normalization data file.
|
||||
*/
|
||||
|
||||
#endif /* !UCONFIG_NO_NORMALIZATION */
|
||||
|
|
|
@ -41,6 +41,7 @@
|
|||
#include "propsvec.h"
|
||||
#include "uassert.h"
|
||||
#include "ucmndata.h"
|
||||
#include "udataswp.h"
|
||||
#include "uenumimp.h"
|
||||
#include "cmemory.h"
|
||||
#include "cstring.h"
|
||||
|
|
|
@ -28,81 +28,6 @@
|
|||
|
||||
/* swapping ----------------------------------------------------------------- */
|
||||
|
||||
/*
|
||||
* This performs data swapping for a folded trie (see utrie.c for details).
|
||||
*/
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie_swap(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode) {
|
||||
const UTrieHeader *inTrie;
|
||||
UTrieHeader trie;
|
||||
int32_t size;
|
||||
UBool dataIs32;
|
||||
|
||||
if(pErrorCode==NULL || U_FAILURE(*pErrorCode)) {
|
||||
return 0;
|
||||
}
|
||||
if(ds==NULL || inData==NULL || (length>=0 && outData==NULL)) {
|
||||
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* setup and swapping */
|
||||
if(length>=0 && (uint32_t)length<sizeof(UTrieHeader)) {
|
||||
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
inTrie=(const UTrieHeader *)inData;
|
||||
trie.signature=ds->readUInt32(inTrie->signature);
|
||||
trie.options=ds->readUInt32(inTrie->options);
|
||||
trie.indexLength=udata_readInt32(ds, inTrie->indexLength);
|
||||
trie.dataLength=udata_readInt32(ds, inTrie->dataLength);
|
||||
|
||||
if( trie.signature!=0x54726965 ||
|
||||
(trie.options&UTRIE_OPTIONS_SHIFT_MASK)!=UTRIE_SHIFT ||
|
||||
((trie.options>>UTRIE_OPTIONS_INDEX_SHIFT)&UTRIE_OPTIONS_SHIFT_MASK)!=UTRIE_INDEX_SHIFT ||
|
||||
trie.indexLength<UTRIE_BMP_INDEX_LENGTH ||
|
||||
(trie.indexLength&(UTRIE_SURROGATE_BLOCK_COUNT-1))!=0 ||
|
||||
trie.dataLength<UTRIE_DATA_BLOCK_LENGTH ||
|
||||
(trie.dataLength&(UTRIE_DATA_GRANULARITY-1))!=0 ||
|
||||
((trie.options&UTRIE_OPTIONS_LATIN1_IS_LINEAR)!=0 && trie.dataLength<(UTRIE_DATA_BLOCK_LENGTH+0x100))
|
||||
) {
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UTrie */
|
||||
return 0;
|
||||
}
|
||||
|
||||
dataIs32=(UBool)((trie.options&UTRIE_OPTIONS_DATA_IS_32_BIT)!=0);
|
||||
size=sizeof(UTrieHeader)+trie.indexLength*2+trie.dataLength*(dataIs32?4:2);
|
||||
|
||||
if(length>=0) {
|
||||
UTrieHeader *outTrie;
|
||||
|
||||
if(length<size) {
|
||||
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
outTrie=(UTrieHeader *)outData;
|
||||
|
||||
/* swap the header */
|
||||
ds->swapArray32(ds, inTrie, sizeof(UTrieHeader), outTrie, pErrorCode);
|
||||
|
||||
/* swap the index and the data */
|
||||
if(dataIs32) {
|
||||
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
|
||||
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, trie.dataLength*4,
|
||||
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
|
||||
} else {
|
||||
ds->swapArray16(ds, inTrie+1, (trie.indexLength+trie.dataLength)*2, outTrie+1, pErrorCode);
|
||||
}
|
||||
}
|
||||
|
||||
return size;
|
||||
}
|
||||
|
||||
#if !UCONFIG_NO_COLLATION
|
||||
|
||||
U_CAPI UBool U_EXPORT2
|
||||
|
|
573
icu4c/source/common/ucptrie.cpp
Normal file
573
icu4c/source/common/ucptrie.cpp
Normal file
|
@ -0,0 +1,573 @@
|
|||
// © 2017 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
|
||||
// ucptrie.cpp (modified from utrie2.cpp)
|
||||
// created: 2017dec29 Markus W. Scherer
|
||||
|
||||
// #define UCPTRIE_DEBUG
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
# include <stdio.h>
|
||||
#endif
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/ucptrie.h"
|
||||
#include "unicode/utf.h"
|
||||
#include "unicode/utf8.h"
|
||||
#include "unicode/utf16.h"
|
||||
#include "cmemory.h"
|
||||
#include "uassert.h"
|
||||
#include "ucptrie_impl.h"
|
||||
|
||||
U_CAPI UCPTrie * U_EXPORT2
|
||||
ucptrie_openFromBinary(UCPTrieType type, UCPTrieValueWidth valueWidth,
|
||||
const void *data, int32_t length, int32_t *pActualLength,
|
||||
UErrorCode *pErrorCode) {
|
||||
if (U_FAILURE(*pErrorCode)) {
|
||||
return nullptr;
|
||||
}
|
||||
|
||||
if (length <= 0 || (U_POINTER_MASK_LSB(data, 3) != 0) ||
|
||||
type < UCPTRIE_TYPE_ANY || UCPTRIE_TYPE_SMALL < type ||
|
||||
valueWidth < UCPTRIE_VALUE_BITS_ANY || UCPTRIE_VALUE_BITS_8 < valueWidth) {
|
||||
*pErrorCode = U_ILLEGAL_ARGUMENT_ERROR;
|
||||
return nullptr;
|
||||
}
|
||||
|
||||
// Enough data for a trie header?
|
||||
if (length < (int32_t)sizeof(UCPTrieHeader)) {
|
||||
*pErrorCode = U_INVALID_FORMAT_ERROR;
|
||||
return nullptr;
|
||||
}
|
||||
|
||||
// Check the signature.
|
||||
const UCPTrieHeader *header = (const UCPTrieHeader *)data;
|
||||
if (header->signature != UCPTRIE_SIG) {
|
||||
*pErrorCode = U_INVALID_FORMAT_ERROR;
|
||||
return nullptr;
|
||||
}
|
||||
|
||||
int32_t options = header->options;
|
||||
int32_t typeInt = (options >> 6) & 3;
|
||||
int32_t valueWidthInt = options & UCPTRIE_OPTIONS_VALUE_BITS_MASK;
|
||||
if (typeInt > UCPTRIE_TYPE_SMALL || valueWidthInt > UCPTRIE_VALUE_BITS_8 ||
|
||||
(options & UCPTRIE_OPTIONS_RESERVED_MASK) != 0) {
|
||||
*pErrorCode = U_INVALID_FORMAT_ERROR;
|
||||
return nullptr;
|
||||
}
|
||||
UCPTrieType actualType = (UCPTrieType)typeInt;
|
||||
UCPTrieValueWidth actualValueWidth = (UCPTrieValueWidth)valueWidthInt;
|
||||
if (type < 0) {
|
||||
type = actualType;
|
||||
}
|
||||
if (valueWidth < 0) {
|
||||
valueWidth = actualValueWidth;
|
||||
}
|
||||
if (type != actualType || valueWidth != actualValueWidth) {
|
||||
*pErrorCode = U_INVALID_FORMAT_ERROR;
|
||||
return nullptr;
|
||||
}
|
||||
|
||||
// Get the length values and offsets.
|
||||
UCPTrie tempTrie;
|
||||
uprv_memset(&tempTrie, 0, sizeof(tempTrie));
|
||||
tempTrie.indexLength = header->indexLength;
|
||||
tempTrie.dataLength =
|
||||
((options & UCPTRIE_OPTIONS_DATA_LENGTH_MASK) << 4) | header->dataLength;
|
||||
tempTrie.index3NullOffset = header->index3NullOffset;
|
||||
tempTrie.dataNullOffset =
|
||||
((options & UCPTRIE_OPTIONS_DATA_NULL_OFFSET_MASK) << 8) | header->dataNullOffset;
|
||||
|
||||
tempTrie.highStart = header->shiftedHighStart << UCPTRIE_SHIFT_2;
|
||||
tempTrie.shifted12HighStart = (tempTrie.highStart + 0xfff) >> 12;
|
||||
tempTrie.type = type;
|
||||
tempTrie.valueWidth = valueWidth;
|
||||
|
||||
// Calculate the actual length.
|
||||
int32_t actualLength = (int32_t)sizeof(UCPTrieHeader) + tempTrie.indexLength * 2;
|
||||
if (valueWidth == UCPTRIE_VALUE_BITS_16) {
|
||||
actualLength += tempTrie.dataLength * 2;
|
||||
} else if (valueWidth == UCPTRIE_VALUE_BITS_32) {
|
||||
actualLength += tempTrie.dataLength * 4;
|
||||
} else {
|
||||
actualLength += tempTrie.dataLength;
|
||||
}
|
||||
if (length < actualLength) {
|
||||
*pErrorCode = U_INVALID_FORMAT_ERROR; // Not enough bytes.
|
||||
return nullptr;
|
||||
}
|
||||
|
||||
// Allocate the trie.
|
||||
UCPTrie *trie = (UCPTrie *)uprv_malloc(sizeof(UCPTrie));
|
||||
if (trie == nullptr) {
|
||||
*pErrorCode = U_MEMORY_ALLOCATION_ERROR;
|
||||
return nullptr;
|
||||
}
|
||||
uprv_memcpy(trie, &tempTrie, sizeof(tempTrie));
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
trie->name = "fromSerialized";
|
||||
#endif
|
||||
|
||||
// Set the pointers to its index and data arrays.
|
||||
const uint16_t *p16 = (const uint16_t *)(header + 1);
|
||||
trie->index = p16;
|
||||
p16 += trie->indexLength;
|
||||
|
||||
// Get the data.
|
||||
int32_t nullValueOffset = trie->dataNullOffset;
|
||||
if (nullValueOffset >= trie->dataLength) {
|
||||
nullValueOffset = trie->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET;
|
||||
}
|
||||
switch (valueWidth) {
|
||||
case UCPTRIE_VALUE_BITS_16:
|
||||
trie->data.ptr16 = p16;
|
||||
trie->nullValue = trie->data.ptr16[nullValueOffset];
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_32:
|
||||
trie->data.ptr32 = (const uint32_t *)p16;
|
||||
trie->nullValue = trie->data.ptr32[nullValueOffset];
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_8:
|
||||
trie->data.ptr8 = (const uint8_t *)p16;
|
||||
trie->nullValue = trie->data.ptr8[nullValueOffset];
|
||||
break;
|
||||
default:
|
||||
// Unreachable because valueWidth was checked above.
|
||||
*pErrorCode = U_INVALID_FORMAT_ERROR;
|
||||
return nullptr;
|
||||
}
|
||||
|
||||
if (pActualLength != nullptr) {
|
||||
*pActualLength = actualLength;
|
||||
}
|
||||
return trie;
|
||||
}
|
||||
|
||||
U_CAPI void U_EXPORT2
|
||||
ucptrie_close(UCPTrie *trie) {
|
||||
uprv_free(trie);
|
||||
}
|
||||
|
||||
U_CAPI UCPTrieType U_EXPORT2
|
||||
ucptrie_getType(const UCPTrie *trie) {
|
||||
return (UCPTrieType)trie->type;
|
||||
}
|
||||
|
||||
U_CAPI UCPTrieValueWidth U_EXPORT2
|
||||
ucptrie_getValueWidth(const UCPTrie *trie) {
|
||||
return (UCPTrieValueWidth)trie->valueWidth;
|
||||
}
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
ucptrie_internalSmallIndex(const UCPTrie *trie, UChar32 c) {
|
||||
int32_t i1 = c >> UCPTRIE_SHIFT_1;
|
||||
if (trie->type == UCPTRIE_TYPE_FAST) {
|
||||
U_ASSERT(0xffff < c && c < trie->highStart);
|
||||
i1 += UCPTRIE_BMP_INDEX_LENGTH - UCPTRIE_OMITTED_BMP_INDEX_1_LENGTH;
|
||||
} else {
|
||||
U_ASSERT((uint32_t)c < (uint32_t)trie->highStart && trie->highStart > UCPTRIE_SMALL_LIMIT);
|
||||
i1 += UCPTRIE_SMALL_INDEX_LENGTH;
|
||||
}
|
||||
int32_t i3Block = trie->index[
|
||||
(int32_t)trie->index[i1] + ((c >> UCPTRIE_SHIFT_2) & UCPTRIE_INDEX_2_MASK)];
|
||||
int32_t i3 = (c >> UCPTRIE_SHIFT_3) & UCPTRIE_INDEX_3_MASK;
|
||||
int32_t dataBlock;
|
||||
if ((i3Block & 0x8000) == 0) {
|
||||
// 16-bit indexes
|
||||
dataBlock = trie->index[i3Block + i3];
|
||||
} else {
|
||||
// 18-bit indexes stored in groups of 9 entries per 8 indexes.
|
||||
i3Block = (i3Block & 0x7fff) + (i3 & ~7) + (i3 >> 3);
|
||||
i3 &= 7;
|
||||
dataBlock = ((int32_t)trie->index[i3Block++] << (2 + (2 * i3))) & 0x30000;
|
||||
dataBlock |= trie->index[i3Block + i3];
|
||||
}
|
||||
return dataBlock + (c & UCPTRIE_SMALL_DATA_MASK);
|
||||
}
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
ucptrie_internalSmallU8Index(const UCPTrie *trie, int32_t lt1, uint8_t t2, uint8_t t3) {
|
||||
UChar32 c = (lt1 << 12) | (t2 << 6) | t3;
|
||||
if (c >= trie->highStart) {
|
||||
// Possible because the UTF-8 macro compares with shifted12HighStart which may be higher.
|
||||
return trie->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET;
|
||||
}
|
||||
return ucptrie_internalSmallIndex(trie, c);
|
||||
}
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
ucptrie_internalU8PrevIndex(const UCPTrie *trie, UChar32 c,
|
||||
const uint8_t *start, const uint8_t *src) {
|
||||
int32_t i, length;
|
||||
// Support 64-bit pointers by avoiding cast of arbitrary difference.
|
||||
if ((src - start) <= 7) {
|
||||
i = length = (int32_t)(src - start);
|
||||
} else {
|
||||
i = length = 7;
|
||||
start = src - 7;
|
||||
}
|
||||
c = utf8_prevCharSafeBody(start, 0, &i, c, -1);
|
||||
i = length - i; // Number of bytes read backward from src.
|
||||
int32_t idx = _UCPTRIE_CP_INDEX(trie, 0xffff, c);
|
||||
return (idx << 3) | i;
|
||||
}
|
||||
|
||||
namespace {
|
||||
|
||||
inline uint32_t getValue(UCPTrieData data, UCPTrieValueWidth valueWidth, int32_t dataIndex) {
|
||||
switch (valueWidth) {
|
||||
case UCPTRIE_VALUE_BITS_16:
|
||||
return data.ptr16[dataIndex];
|
||||
case UCPTRIE_VALUE_BITS_32:
|
||||
return data.ptr32[dataIndex];
|
||||
case UCPTRIE_VALUE_BITS_8:
|
||||
return data.ptr8[dataIndex];
|
||||
default:
|
||||
// Unreachable if the trie is properly initialized.
|
||||
return 0xffffffff;
|
||||
}
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
U_CAPI uint32_t U_EXPORT2
|
||||
ucptrie_get(const UCPTrie *trie, UChar32 c) {
|
||||
int32_t dataIndex;
|
||||
if ((uint32_t)c <= 0x7f) {
|
||||
// linear ASCII
|
||||
dataIndex = c;
|
||||
} else {
|
||||
UChar32 fastMax = trie->type == UCPTRIE_TYPE_FAST ? 0xffff : UCPTRIE_SMALL_MAX;
|
||||
dataIndex = _UCPTRIE_CP_INDEX(trie, fastMax, c);
|
||||
}
|
||||
return getValue(trie->data, (UCPTrieValueWidth)trie->valueWidth, dataIndex);
|
||||
}
|
||||
|
||||
namespace {
|
||||
|
||||
constexpr int32_t MAX_UNICODE = 0x10ffff;
|
||||
|
||||
inline uint32_t maybeFilterValue(uint32_t value, uint32_t trieNullValue, uint32_t nullValue,
|
||||
UCPTrieValueFilter *filter, const void *context) {
|
||||
if (value == trieNullValue) {
|
||||
value = nullValue;
|
||||
} else if (filter != nullptr) {
|
||||
value = filter(context, value);
|
||||
}
|
||||
return value;
|
||||
}
|
||||
|
||||
UChar32 getRange(const void *t, UChar32 start,
|
||||
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue) {
|
||||
if ((uint32_t)start > MAX_UNICODE) {
|
||||
return U_SENTINEL;
|
||||
}
|
||||
const UCPTrie *trie = reinterpret_cast<const UCPTrie *>(t);
|
||||
UCPTrieValueWidth valueWidth = (UCPTrieValueWidth)trie->valueWidth;
|
||||
if (start >= trie->highStart) {
|
||||
if (pValue != nullptr) {
|
||||
int32_t di = trie->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET;
|
||||
uint32_t value = getValue(trie->data, valueWidth, di);
|
||||
if (filter != nullptr) { value = filter(context, value); }
|
||||
*pValue = value;
|
||||
}
|
||||
return MAX_UNICODE;
|
||||
}
|
||||
|
||||
uint32_t nullValue = trie->nullValue;
|
||||
if (filter != nullptr) { nullValue = filter(context, nullValue); }
|
||||
const uint16_t *index = trie->index;
|
||||
|
||||
int32_t prevI3Block = -1;
|
||||
int32_t prevBlock = -1;
|
||||
UChar32 c = start;
|
||||
uint32_t value;
|
||||
bool haveValue = false;
|
||||
do {
|
||||
int32_t i3Block;
|
||||
int32_t i3;
|
||||
int32_t i3BlockLength;
|
||||
int32_t dataBlockLength;
|
||||
if (c <= 0xffff && (trie->type == UCPTRIE_TYPE_FAST || c <= UCPTRIE_SMALL_MAX)) {
|
||||
i3Block = 0;
|
||||
i3 = c >> UCPTRIE_FAST_SHIFT;
|
||||
i3BlockLength = trie->type == UCPTRIE_TYPE_FAST ?
|
||||
UCPTRIE_BMP_INDEX_LENGTH : UCPTRIE_SMALL_INDEX_LENGTH;
|
||||
dataBlockLength = UCPTRIE_FAST_DATA_BLOCK_LENGTH;
|
||||
} else {
|
||||
// Use the multi-stage index.
|
||||
int32_t i1 = c >> UCPTRIE_SHIFT_1;
|
||||
if (trie->type == UCPTRIE_TYPE_FAST) {
|
||||
U_ASSERT(0xffff < c && c < trie->highStart);
|
||||
i1 += UCPTRIE_BMP_INDEX_LENGTH - UCPTRIE_OMITTED_BMP_INDEX_1_LENGTH;
|
||||
} else {
|
||||
U_ASSERT(c < trie->highStart && trie->highStart > UCPTRIE_SMALL_LIMIT);
|
||||
i1 += UCPTRIE_SMALL_INDEX_LENGTH;
|
||||
}
|
||||
i3Block = trie->index[
|
||||
(int32_t)trie->index[i1] + ((c >> UCPTRIE_SHIFT_2) & UCPTRIE_INDEX_2_MASK)];
|
||||
if (i3Block == prevI3Block && (c - start) >= UCPTRIE_CP_PER_INDEX_2_ENTRY) {
|
||||
// The index-3 block is the same as the previous one, and filled with value.
|
||||
U_ASSERT((c & (UCPTRIE_CP_PER_INDEX_2_ENTRY - 1)) == 0);
|
||||
c += UCPTRIE_CP_PER_INDEX_2_ENTRY;
|
||||
continue;
|
||||
}
|
||||
prevI3Block = i3Block;
|
||||
if (i3Block == trie->index3NullOffset) {
|
||||
// This is the index-3 null block.
|
||||
if (haveValue) {
|
||||
if (nullValue != value) {
|
||||
return c - 1;
|
||||
}
|
||||
} else {
|
||||
value = nullValue;
|
||||
if (pValue != nullptr) { *pValue = nullValue; }
|
||||
haveValue = true;
|
||||
}
|
||||
prevBlock = trie->dataNullOffset;
|
||||
c = (c + UCPTRIE_CP_PER_INDEX_2_ENTRY) & ~(UCPTRIE_CP_PER_INDEX_2_ENTRY - 1);
|
||||
continue;
|
||||
}
|
||||
i3 = (c >> UCPTRIE_SHIFT_3) & UCPTRIE_INDEX_3_MASK;
|
||||
i3BlockLength = UCPTRIE_INDEX_3_BLOCK_LENGTH;
|
||||
dataBlockLength = UCPTRIE_SMALL_DATA_BLOCK_LENGTH;
|
||||
}
|
||||
// Enumerate data blocks for one index-3 block.
|
||||
do {
|
||||
int32_t block;
|
||||
if ((i3Block & 0x8000) == 0) {
|
||||
block = index[i3Block + i3];
|
||||
} else {
|
||||
// 18-bit indexes stored in groups of 9 entries per 8 indexes.
|
||||
int32_t group = (i3Block & 0x7fff) + (i3 & ~7) + (i3 >> 3);
|
||||
int32_t gi = i3 & 7;
|
||||
block = ((int32_t)index[group++] << (2 + (2 * gi))) & 0x30000;
|
||||
block |= index[group + gi];
|
||||
}
|
||||
if (block == prevBlock && (c - start) >= dataBlockLength) {
|
||||
// The block is the same as the previous one, and filled with value.
|
||||
U_ASSERT((c & (dataBlockLength - 1)) == 0);
|
||||
c += dataBlockLength;
|
||||
} else {
|
||||
int32_t dataMask = dataBlockLength - 1;
|
||||
prevBlock = block;
|
||||
if (block == trie->dataNullOffset) {
|
||||
// This is the data null block.
|
||||
if (haveValue) {
|
||||
if (nullValue != value) {
|
||||
return c - 1;
|
||||
}
|
||||
} else {
|
||||
value = nullValue;
|
||||
if (pValue != nullptr) { *pValue = nullValue; }
|
||||
haveValue = true;
|
||||
}
|
||||
c = (c + dataBlockLength) & ~dataMask;
|
||||
} else {
|
||||
int32_t di = block + (c & dataMask);
|
||||
uint32_t value2 = getValue(trie->data, valueWidth, di);
|
||||
value2 = maybeFilterValue(value2, trie->nullValue, nullValue,
|
||||
filter, context);
|
||||
if (haveValue) {
|
||||
if (value2 != value) {
|
||||
return c - 1;
|
||||
}
|
||||
} else {
|
||||
value = value2;
|
||||
if (pValue != nullptr) { *pValue = value; }
|
||||
haveValue = true;
|
||||
}
|
||||
while ((++c & dataMask) != 0) {
|
||||
if (maybeFilterValue(getValue(trie->data, valueWidth, ++di),
|
||||
trie->nullValue, nullValue,
|
||||
filter, context) != value) {
|
||||
return c - 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
} while (++i3 < i3BlockLength);
|
||||
} while (c < trie->highStart);
|
||||
U_ASSERT(haveValue);
|
||||
int32_t di = trie->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET;
|
||||
uint32_t highValue = getValue(trie->data, valueWidth, di);
|
||||
if (maybeFilterValue(highValue, trie->nullValue, nullValue,
|
||||
filter, context) != value) {
|
||||
return c - 1;
|
||||
} else {
|
||||
return MAX_UNICODE;
|
||||
}
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
U_CFUNC UChar32
|
||||
ucptrie_internalGetRange(UCPTrieGetRange *getRange,
|
||||
const void *trie, UChar32 start,
|
||||
UCPTrieRangeOption option, uint32_t surrogateValue,
|
||||
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue) {
|
||||
if (option == UCPTRIE_RANGE_NORMAL) {
|
||||
return getRange(trie, start, filter, context, pValue);
|
||||
}
|
||||
uint32_t value;
|
||||
if (pValue == nullptr) {
|
||||
// We need to examine the range value even if the caller does not want it.
|
||||
pValue = &value;
|
||||
}
|
||||
UChar32 surrEnd = option == UCPTRIE_RANGE_FIXED_ALL_SURROGATES ? 0xdfff : 0xdbff;
|
||||
UChar32 end = getRange(trie, start, filter, context, pValue);
|
||||
if (end < 0xd7ff || start > surrEnd) {
|
||||
return end;
|
||||
}
|
||||
// The range overlaps with surrogates, or ends just before the first one.
|
||||
if (*pValue == surrogateValue) {
|
||||
if (end >= surrEnd) {
|
||||
// Surrogates followed by a non-surrogateValue range,
|
||||
// or surrogates are part of a larger surrogateValue range.
|
||||
return end;
|
||||
}
|
||||
} else {
|
||||
if (start <= 0xd7ff) {
|
||||
return 0xd7ff; // Non-surrogateValue range ends before surrogateValue surrogates.
|
||||
}
|
||||
// Start is a surrogate with a non-surrogateValue code *unit* value.
|
||||
// Return a surrogateValue code *point* range.
|
||||
*pValue = surrogateValue;
|
||||
if (end > surrEnd) {
|
||||
return surrEnd; // Surrogate range ends before non-surrogateValue rest of range.
|
||||
}
|
||||
}
|
||||
// See if the surrogateValue surrogate range can be merged with
|
||||
// an immediately following range.
|
||||
uint32_t value2;
|
||||
UChar32 end2 = getRange(trie, surrEnd + 1, filter, context, &value2);
|
||||
if (value2 == surrogateValue) {
|
||||
return end2;
|
||||
}
|
||||
return surrEnd;
|
||||
}
|
||||
|
||||
U_CAPI UChar32 U_EXPORT2
|
||||
ucptrie_getRange(const UCPTrie *trie, UChar32 start,
|
||||
UCPTrieRangeOption option, uint32_t surrogateValue,
|
||||
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue) {
|
||||
return ucptrie_internalGetRange(getRange, trie, start,
|
||||
option, surrogateValue,
|
||||
filter, context, pValue);
|
||||
}
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
ucptrie_toBinary(const UCPTrie *trie,
|
||||
void *data, int32_t capacity,
|
||||
UErrorCode *pErrorCode) {
|
||||
if (U_FAILURE(*pErrorCode)) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
UCPTrieType type = (UCPTrieType)trie->type;
|
||||
UCPTrieValueWidth valueWidth = (UCPTrieValueWidth)trie->valueWidth;
|
||||
if (type < UCPTRIE_TYPE_FAST || UCPTRIE_TYPE_SMALL < type ||
|
||||
valueWidth < UCPTRIE_VALUE_BITS_16 || UCPTRIE_VALUE_BITS_8 < valueWidth ||
|
||||
capacity < 0 ||
|
||||
(capacity > 0 && (data == nullptr || (U_POINTER_MASK_LSB(data, 3) != 0)))) {
|
||||
*pErrorCode = U_ILLEGAL_ARGUMENT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
int32_t length = (int32_t)sizeof(UCPTrieHeader) + trie->indexLength * 2;
|
||||
switch (valueWidth) {
|
||||
case UCPTRIE_VALUE_BITS_16:
|
||||
length += trie->dataLength * 2;
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_32:
|
||||
length += trie->dataLength * 4;
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_8:
|
||||
length += trie->dataLength;
|
||||
break;
|
||||
default:
|
||||
// unreachable
|
||||
break;
|
||||
}
|
||||
if (capacity < length) {
|
||||
*pErrorCode = U_BUFFER_OVERFLOW_ERROR;
|
||||
return length;
|
||||
}
|
||||
|
||||
char *bytes = (char *)data;
|
||||
UCPTrieHeader *header = (UCPTrieHeader *)bytes;
|
||||
header->signature = UCPTRIE_SIG; // "Tri3"
|
||||
header->options = (uint16_t)(
|
||||
((trie->dataLength & 0xf0000) >> 4) |
|
||||
((trie->dataNullOffset & 0xf0000) >> 8) |
|
||||
(trie->type << 6) |
|
||||
valueWidth);
|
||||
header->indexLength = (uint16_t)trie->indexLength;
|
||||
header->dataLength = (uint16_t)trie->dataLength;
|
||||
header->index3NullOffset = trie->index3NullOffset;
|
||||
header->dataNullOffset = (uint16_t)trie->dataNullOffset;
|
||||
header->shiftedHighStart = trie->highStart >> UCPTRIE_SHIFT_2;
|
||||
bytes += sizeof(UCPTrieHeader);
|
||||
|
||||
uprv_memcpy(bytes, trie->index, trie->indexLength * 2);
|
||||
bytes += trie->indexLength * 2;
|
||||
|
||||
switch (valueWidth) {
|
||||
case UCPTRIE_VALUE_BITS_16:
|
||||
uprv_memcpy(bytes, trie->data.ptr16, trie->dataLength * 2);
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_32:
|
||||
uprv_memcpy(bytes, trie->data.ptr32, trie->dataLength * 4);
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_8:
|
||||
uprv_memcpy(bytes, trie->data.ptr8, trie->dataLength);
|
||||
break;
|
||||
default:
|
||||
// unreachable
|
||||
break;
|
||||
}
|
||||
return length;
|
||||
}
|
||||
|
||||
namespace {
|
||||
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
long countNull(const UCPTrie *trie) {
|
||||
uint32_t nullValue=trie->nullValue;
|
||||
int32_t length=trie->dataLength;
|
||||
long count=0;
|
||||
switch (trie->valueWidth) {
|
||||
case UCPTRIE_VALUE_BITS_16:
|
||||
for(int32_t i=0; i<length; ++i) {
|
||||
if(trie->data.ptr16[i]==nullValue) { ++count; }
|
||||
}
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_32:
|
||||
for(int32_t i=0; i<length; ++i) {
|
||||
if(trie->data.ptr32[i]==nullValue) { ++count; }
|
||||
}
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_8:
|
||||
for(int32_t i=0; i<length; ++i) {
|
||||
if(trie->data.ptr8[i]==nullValue) { ++count; }
|
||||
}
|
||||
break;
|
||||
default:
|
||||
// unreachable
|
||||
break;
|
||||
}
|
||||
return count;
|
||||
}
|
||||
|
||||
U_CFUNC void
|
||||
ucptrie_printLengths(const UCPTrie *trie, const char *which) {
|
||||
long indexLength=trie->indexLength;
|
||||
long dataLength=(long)trie->dataLength;
|
||||
long totalLength=(long)sizeof(UCPTrieHeader)+indexLength*2+
|
||||
dataLength*(trie->valueWidth==UCPTRIE_VALUE_BITS_16 ? 2 :
|
||||
trie->valueWidth==UCPTRIE_VALUE_BITS_32 ? 4 : 1);
|
||||
printf("**UCPTrieLengths(%s %s)** index:%6ld data:%6ld countNull:%6ld serialized:%6ld\n",
|
||||
which, trie->name, indexLength, dataLength, countNull(trie), totalLength);
|
||||
}
|
||||
#endif
|
||||
|
||||
} // namespace
|
284
icu4c/source/common/ucptrie_impl.h
Normal file
284
icu4c/source/common/ucptrie_impl.h
Normal file
|
@ -0,0 +1,284 @@
|
|||
// © 2017 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
|
||||
// ucptrie_impl.h (modified from utrie2_impl.h)
|
||||
// created: 2017dec29 Markus W. Scherer
|
||||
|
||||
#ifndef __UCPTRIE_IMPL_H__
|
||||
#define __UCPTRIE_IMPL_H__
|
||||
|
||||
#include "unicode/ucptrie.h"
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
#include "unicode/umutablecptrie.h"
|
||||
#endif
|
||||
|
||||
// UCPTrie signature values, in platform endianness and opposite endianness.
|
||||
// The UCPTrie signature ASCII byte values spell "Tri3".
|
||||
#define UCPTRIE_SIG 0x54726933
|
||||
#define UCPTRIE_OE_SIG 0x33697254
|
||||
|
||||
/**
|
||||
* Header data for the binary, memory-mappable representation of a UCPTrie/CodePointTrie.
|
||||
* @internal
|
||||
*/
|
||||
struct UCPTrieHeader {
|
||||
/** "Tri3" in big-endian US-ASCII (0x54726933) */
|
||||
uint32_t signature;
|
||||
|
||||
/**
|
||||
* Options bit field:
|
||||
* Bits 15..12: Data length bits 19..16.
|
||||
* Bits 11..8: Data null block offset bits 19..16.
|
||||
* Bits 7..6: UCPTrieType
|
||||
* Bits 5..3: Reserved (0).
|
||||
* Bits 2..0: UCPTrieValueWidth
|
||||
*/
|
||||
uint16_t options;
|
||||
|
||||
/** Total length of the index tables. */
|
||||
uint16_t indexLength;
|
||||
|
||||
/** Data length bits 15..0. */
|
||||
uint16_t dataLength;
|
||||
|
||||
/** Index-3 null block offset, 0x7fff or 0xffff if none. */
|
||||
uint16_t index3NullOffset;
|
||||
|
||||
/** Data null block offset bits 15..0, 0xfffff if none. */
|
||||
uint16_t dataNullOffset;
|
||||
|
||||
/**
|
||||
* First code point of the single-value range ending with U+10ffff,
|
||||
* rounded up and then shifted right by UCPTRIE_SHIFT_2.
|
||||
*/
|
||||
uint16_t shiftedHighStart;
|
||||
};
|
||||
|
||||
/**
|
||||
* Constants for use with UCPTrieHeader.options.
|
||||
* @internal
|
||||
*/
|
||||
enum {
|
||||
UCPTRIE_OPTIONS_DATA_LENGTH_MASK = 0xf000,
|
||||
UCPTRIE_OPTIONS_DATA_NULL_OFFSET_MASK = 0xf00,
|
||||
UCPTRIE_OPTIONS_RESERVED_MASK = 0x38,
|
||||
UCPTRIE_OPTIONS_VALUE_BITS_MASK = 7,
|
||||
/**
|
||||
* Value for index3NullOffset which indicates that there is no index-3 null block.
|
||||
* Bit 15 is unused for this value because this bit is used if the index-3 contains
|
||||
* 18-bit indexes.
|
||||
*/
|
||||
UCPTRIE_NO_INDEX3_NULL_OFFSET = 0x7fff,
|
||||
UCPTRIE_NO_DATA_NULL_OFFSET = 0xfffff
|
||||
};
|
||||
|
||||
// Internal constants.
|
||||
enum {
|
||||
/** The length of the BMP index table. 1024=0x400 */
|
||||
UCPTRIE_BMP_INDEX_LENGTH = 0x10000 >> UCPTRIE_FAST_SHIFT,
|
||||
|
||||
UCPTRIE_SMALL_LIMIT = 0x1000,
|
||||
UCPTRIE_SMALL_INDEX_LENGTH = UCPTRIE_SMALL_LIMIT >> UCPTRIE_FAST_SHIFT,
|
||||
|
||||
/** Shift size for getting the index-3 table offset. */
|
||||
UCPTRIE_SHIFT_3 = 4,
|
||||
|
||||
/** Shift size for getting the index-2 table offset. */
|
||||
UCPTRIE_SHIFT_2 = 5 + UCPTRIE_SHIFT_3,
|
||||
|
||||
/** Shift size for getting the index-1 table offset. */
|
||||
UCPTRIE_SHIFT_1 = 5 + UCPTRIE_SHIFT_2,
|
||||
|
||||
/**
|
||||
* Difference between two shift sizes,
|
||||
* for getting an index-2 offset from an index-3 offset. 5=9-4
|
||||
*/
|
||||
UCPTRIE_SHIFT_2_3 = UCPTRIE_SHIFT_2 - UCPTRIE_SHIFT_3,
|
||||
|
||||
/**
|
||||
* Difference between two shift sizes,
|
||||
* for getting an index-1 offset from an index-2 offset. 5=14-9
|
||||
*/
|
||||
UCPTRIE_SHIFT_1_2 = UCPTRIE_SHIFT_1 - UCPTRIE_SHIFT_2,
|
||||
|
||||
/**
|
||||
* Number of index-1 entries for the BMP. (4)
|
||||
* This part of the index-1 table is omitted from the serialized form.
|
||||
*/
|
||||
UCPTRIE_OMITTED_BMP_INDEX_1_LENGTH = 0x10000 >> UCPTRIE_SHIFT_1,
|
||||
|
||||
/** Number of entries in an index-2 block. 32=0x20 */
|
||||
UCPTRIE_INDEX_2_BLOCK_LENGTH = 1 << UCPTRIE_SHIFT_1_2,
|
||||
|
||||
/** Mask for getting the lower bits for the in-index-2-block offset. */
|
||||
UCPTRIE_INDEX_2_MASK = UCPTRIE_INDEX_2_BLOCK_LENGTH - 1,
|
||||
|
||||
/** Number of code points per index-2 table entry. 512=0x200 */
|
||||
UCPTRIE_CP_PER_INDEX_2_ENTRY = 1 << UCPTRIE_SHIFT_2,
|
||||
|
||||
/** Number of entries in an index-3 block. 32=0x20 */
|
||||
UCPTRIE_INDEX_3_BLOCK_LENGTH = 1 << UCPTRIE_SHIFT_2_3,
|
||||
|
||||
/** Mask for getting the lower bits for the in-index-3-block offset. */
|
||||
UCPTRIE_INDEX_3_MASK = UCPTRIE_INDEX_3_BLOCK_LENGTH - 1,
|
||||
|
||||
/** Number of entries in a small data block. 16=0x10 */
|
||||
UCPTRIE_SMALL_DATA_BLOCK_LENGTH = 1 << UCPTRIE_SHIFT_3,
|
||||
|
||||
/** Mask for getting the lower bits for the in-small-data-block offset. */
|
||||
UCPTRIE_SMALL_DATA_MASK = UCPTRIE_SMALL_DATA_BLOCK_LENGTH - 1
|
||||
};
|
||||
|
||||
typedef UChar32
|
||||
UCPTrieGetRange(const void *trie, UChar32 start,
|
||||
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue);
|
||||
|
||||
U_CFUNC UChar32
|
||||
ucptrie_internalGetRange(UCPTrieGetRange *getRange,
|
||||
const void *trie, UChar32 start,
|
||||
UCPTrieRangeOption option, uint32_t surrogateValue,
|
||||
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue);
|
||||
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
U_CFUNC void
|
||||
ucptrie_printLengths(const UCPTrie *trie, const char *which);
|
||||
|
||||
U_CFUNC void umutablecptrie_setName(UMutableCPTrie *builder, const char *name);
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Format of the binary, memory-mappable representation of a UCPTrie/CodePointTrie.
|
||||
* For overview information see http://site.icu-project.org/design/struct/utrie
|
||||
*
|
||||
* The binary trie data should be 32-bit-aligned.
|
||||
* The overall layout is:
|
||||
*
|
||||
* UCPTrieHeader header; -- 16 bytes, see struct definition above
|
||||
* uint16_t index[header.indexLength];
|
||||
* uintXY_t data[header.dataLength];
|
||||
*
|
||||
* The trie data array is an array of uint16_t, uint32_t, or uint8_t,
|
||||
* specified via the UCPTrieValueWidth when building the trie.
|
||||
* The data array is 32-bit-aligned for uint32_t, otherwise 16-bit-aligned.
|
||||
* The overall length of the trie data is a multiple of 4 bytes.
|
||||
* (Padding is added at the end of the index array and/or near the end of the data array as needed.)
|
||||
*
|
||||
* The length of the data array (dataLength) is stored as an integer split across two fields
|
||||
* of the header struct (high bits in header.options).
|
||||
*
|
||||
* The trie type can be "fast" or "small" which determines the index structure,
|
||||
* specified via the UCPTrieType when building the trie.
|
||||
*
|
||||
* The type and valueWidth are stored in the header.options.
|
||||
* There are reserved type and valueWidth values, and reserved header.options bits.
|
||||
* They could be used in future format extensions.
|
||||
* Code reading the trie structure must fail with an error when unknown values or options are set.
|
||||
*
|
||||
* Values for ASCII character (U+0000..U+007F) can always be found at the start of the data array.
|
||||
*
|
||||
* Values for code points below a type-specific fast-indexing limit are found via two-stage lookup.
|
||||
* For a "fast" trie, the limit is the BMP/supplementary boundary at U+10000.
|
||||
* For a "small" trie, the limit is UCPTRIE_SMALL_MAX+1=U+1000.
|
||||
*
|
||||
* All code points in the range highStart..U+10FFFF map to a single highValue
|
||||
* which is stored at the second-to-last position of the data array.
|
||||
* (See UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET.)
|
||||
* The highStart value is header.shiftedHighStart<<UCPTRIE_SHIFT_2.
|
||||
* (UCPTRIE_SHIFT_2=9)
|
||||
*
|
||||
* Values for code points fast_limit..highStart-1 are found via four-stage lookup.
|
||||
* The data block size is smaller for this range than for the fast range.
|
||||
* This together with more index stages with small blocks makes this range
|
||||
* more easily compactable.
|
||||
*
|
||||
* There is also a trie error value stored at the last position of the data array.
|
||||
* (See UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET.)
|
||||
* It is intended to be returned for inputs that are not Unicode code points
|
||||
* (outside U+0000..U+10FFFF), or in string processing for ill-formed input
|
||||
* (unpaired surrogate in UTF-16, ill-formed UTF-8 subsequence).
|
||||
*
|
||||
* For a "fast" trie:
|
||||
*
|
||||
* The index array starts with the BMP index table for BMP code point lookup.
|
||||
* Its length is 1024=0x400.
|
||||
*
|
||||
* The supplementary index-1 table follows the BMP index table.
|
||||
* Variable length, for code points up to highStart-1.
|
||||
* Maximum length 64=0x40=0x100000>>UCPTRIE_SHIFT_1.
|
||||
* (For 0x100000 supplementary code points U+10000..U+10ffff.)
|
||||
*
|
||||
* After this index-1 table follow the variable-length index-3 and index-2 tables.
|
||||
*
|
||||
* The supplementary index tables are omitted completely
|
||||
* if there is only BMP data (highStart<=U+10000).
|
||||
*
|
||||
* For a "small" trie:
|
||||
*
|
||||
* The index array starts with a fast-index table for lookup of code points U+0000..U+0FFF.
|
||||
*
|
||||
* The "supplementary" index tables are always stored.
|
||||
* The index-1 table starts from U+0000, its maximum length is 68=0x44=0x110000>>UCPTRIE_SHIFT_1.
|
||||
*
|
||||
* For both trie types:
|
||||
*
|
||||
* The last index-2 block may be a partial block, storing indexes only for code points
|
||||
* below highStart.
|
||||
*
|
||||
* Lookup for ASCII code point c:
|
||||
*
|
||||
* Linear access from the start of the data array.
|
||||
*
|
||||
* value = data[c];
|
||||
*
|
||||
* Lookup for fast-range code point c:
|
||||
*
|
||||
* Shift the code point right by UCPTRIE_FAST_SHIFT=6 bits,
|
||||
* fetch the index array value at that offset,
|
||||
* add the lower code point bits, index into the data array.
|
||||
*
|
||||
* value = data[index[c>>6] + (c&0x3f)];
|
||||
*
|
||||
* (This works for ASCII as well.)
|
||||
*
|
||||
* Lookup for small-range code point c below highStart:
|
||||
*
|
||||
* Split the code point into four bit fields using several sets of shifts & masks
|
||||
* to read consecutive values from the index-1, index-2, index-3 and data tables.
|
||||
*
|
||||
* If all of the data block offsets in an index-3 block fit within 16 bits (up to 0xffff),
|
||||
* then the data block offsets are stored directly as uint16_t.
|
||||
*
|
||||
* Otherwise (this is very unusual but possible), the index-2 entry for the index-3 block
|
||||
* has bit 15 (0x8000) set, and each set of 8 index-3 entries is preceded by
|
||||
* an additional uint16_t word. Data block offsets are 18 bits wide, with the top 2 bits stored
|
||||
* in the additional word.
|
||||
*
|
||||
* See ucptrie_internalSmallIndex() for details.
|
||||
*
|
||||
* (In a "small" trie, this works for ASCII and below-fast_limit code points as well.)
|
||||
*
|
||||
* Compaction:
|
||||
*
|
||||
* Multiple code point ranges ("blocks") that are aligned on certain boundaries
|
||||
* (determined by the shifting/bit fields of code points) and
|
||||
* map to the same data values normally share a single subsequence of the data array.
|
||||
* Data blocks can also overlap partially.
|
||||
* (Depending on the builder code finding duplicate and overlapping blocks.)
|
||||
*
|
||||
* Iteration over same-value ranges:
|
||||
*
|
||||
* Range iteration (ucptrie_getRange()) walks the structure from a start code point
|
||||
* until some code point is found that maps to a different value;
|
||||
* the end of the returned range is just before that.
|
||||
*
|
||||
* The header.dataNullOffset (split across two header fields, high bits in header.options)
|
||||
* is the offset of a widely shared data block filled with one single value.
|
||||
* It helps quickly skip over large ranges of data with that value.
|
||||
* Similarly, the header.index3NullOffset is the index-array offset of an index-3 block
|
||||
* where all index entries point to the dataNullOffset.
|
||||
* If there is no such data or index-3 block, then these offsets are set to
|
||||
* values that cannot be reached (data offset out of range/reserved index offset),
|
||||
* normally UCPTRIE_NO_DATA_NULL_OFFSET or UCPTRIE_NO_INDEX3_NULL_OFFSET respectively.
|
||||
*/
|
||||
|
||||
#endif
|
|
@ -333,6 +333,43 @@ uprv_compareInvEbcdic(const UDataSwapper *ds,
|
|||
# error Unknown charset family!
|
||||
#endif
|
||||
|
||||
// utrie_swap.cpp -----------------------------------------------------------***
|
||||
|
||||
/**
|
||||
* Swaps a serialized UTrie.
|
||||
* @internal
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie_swap(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Swaps a serialized UTrie2.
|
||||
* @internal
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie2_swap(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Swaps a serialized UCPTrie.
|
||||
* @internal
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
ucptrie_swap(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Swaps a serialized UTrie, UTrie2, or UCPTrie.
|
||||
* @internal
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie_swapAnyVersion(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode);
|
||||
|
||||
/* material... -------------------------------------------------------------- */
|
||||
|
||||
|
|
1605
icu4c/source/common/umutablecptrie.cpp
Normal file
1605
icu4c/source/common/umutablecptrie.cpp
Normal file
File diff suppressed because it is too large
Load diff
695
icu4c/source/common/unicode/ucptrie.h
Normal file
695
icu4c/source/common/unicode/ucptrie.h
Normal file
|
@ -0,0 +1,695 @@
|
|||
// © 2017 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
|
||||
// ucptrie.h (modified from utrie2.h)
|
||||
// created: 2017dec29 Markus W. Scherer
|
||||
|
||||
#ifndef __UCPTRIE_H__
|
||||
#define __UCPTRIE_H__
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/localpointer.h"
|
||||
#include "unicode/utf8.h"
|
||||
#include "putilimp.h"
|
||||
#include "udataswp.h"
|
||||
|
||||
U_CDECL_BEGIN
|
||||
|
||||
/**
|
||||
* \file
|
||||
*
|
||||
* This file defines an immutable Unicode code point trie.
|
||||
*
|
||||
* @see UCPTrie
|
||||
* @see UMutableCPTrie
|
||||
*/
|
||||
|
||||
/**
|
||||
* Immutable Unicode code point trie structure.
|
||||
* Fast, reasonably compact, map from Unicode code points (U+0000..U+10FFFF) to integer values.
|
||||
* For details see http://site.icu-project.org/design/struct/utrie
|
||||
*
|
||||
* Do not access UCPTrie fields directly; use public functions and macros.
|
||||
* Functions are easy to use: They support all trie types and value widths.
|
||||
*
|
||||
* When performance is really important, macros provide faster access.
|
||||
* Most macros are specific to either "fast" or "small" tries, see UCPTrieType.
|
||||
* There are "fast" macros for special optimized use cases.
|
||||
*
|
||||
* The macros will return bogus values, or may crash, if used on the wrong type or value width.
|
||||
*
|
||||
* @see UMutableCPTrie
|
||||
* @draft ICU 63
|
||||
*/
|
||||
struct UCPTrie;
|
||||
typedef struct UCPTrie UCPTrie;
|
||||
|
||||
/**
|
||||
* Selectors for the type of a UCPTrie.
|
||||
* Different trade-offs for size vs. speed.
|
||||
*
|
||||
* @see umutablecptrie_buildImmutable
|
||||
* @see ucptrie_openFromBinary
|
||||
* @see ucptrie_getType
|
||||
* @draft ICU 63
|
||||
*/
|
||||
enum UCPTrieType {
|
||||
/**
|
||||
* For ucptrie_openFromBinary() to accept any type.
|
||||
* ucptrie_getType() will return the actual type.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
UCPTRIE_TYPE_ANY = -1,
|
||||
/**
|
||||
* Fast/simple/larger BMP data structure. Use functions and "fast" macros.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
UCPTRIE_TYPE_FAST,
|
||||
/**
|
||||
* Small/slower BMP data structure. Use functions and "small" macros.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
UCPTRIE_TYPE_SMALL
|
||||
};
|
||||
typedef enum UCPTrieType UCPTrieType;
|
||||
|
||||
/**
|
||||
* Selectors for the number of bits in a UCPTrie data value.
|
||||
*
|
||||
* @see umutablecptrie_buildImmutable
|
||||
* @see ucptrie_openFromBinary
|
||||
* @see ucptrie_getValueWidth
|
||||
* @draft ICU 63
|
||||
*/
|
||||
enum UCPTrieValueWidth {
|
||||
/**
|
||||
* For ucptrie_openFromBinary() to accept any data value width.
|
||||
* ucptrie_getValueWidth() will return the actual data value width.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
UCPTRIE_VALUE_BITS_ANY = -1,
|
||||
/**
|
||||
* 16 bits per UCPTrie data value.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
UCPTRIE_VALUE_BITS_16,
|
||||
/**
|
||||
* 32 bits per UCPTrie data value.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
UCPTRIE_VALUE_BITS_32,
|
||||
/**
|
||||
* 8 bits per UCPTrie data value.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
UCPTRIE_VALUE_BITS_8
|
||||
};
|
||||
typedef enum UCPTrieValueWidth UCPTrieValueWidth;
|
||||
|
||||
/**
|
||||
* Selectors for how ucptrie_getRange() should report value ranges overlapping with surrogates.
|
||||
* Most users should use UCPTRIE_RANGE_NORMAL.
|
||||
*
|
||||
* @see ucptrie_getRange
|
||||
* @draft ICU 63
|
||||
*/
|
||||
enum UCPTrieRangeOption {
|
||||
/**
|
||||
* ucptrie_getRange() enumerates all same-value ranges as stored in the trie.
|
||||
* Most users should use this option.
|
||||
*/
|
||||
UCPTRIE_RANGE_NORMAL,
|
||||
/**
|
||||
* ucptrie_getRange() enumerates all same-value ranges as stored in the trie,
|
||||
* except that lead surrogates (U+D800..U+DBFF) are treated as having the
|
||||
* surrogateValue, which is passed to getRange() as a separate parameter.
|
||||
* The surrogateValue is not transformed via filter().
|
||||
* See U_IS_LEAD(c).
|
||||
*
|
||||
* Most users should use UCPTRIE_RANGE_NORMAL instead.
|
||||
*
|
||||
* This option is useful for tries that map surrogate code *units* to
|
||||
* special values optimized for UTF-16 string processing
|
||||
* or for special error behavior for unpaired surrogates,
|
||||
* but those values are not to be associated with the lead surrogate code *points*.
|
||||
*/
|
||||
UCPTRIE_RANGE_FIXED_LEAD_SURROGATES,
|
||||
/**
|
||||
* ucptrie_getRange() enumerates all same-value ranges as stored in the trie,
|
||||
* except that all surrogates (U+D800..U+DFFF) are treated as having the
|
||||
* surrogateValue, which is passed to getRange() as a separate parameter.
|
||||
* The surrogateValue is not transformed via filter().
|
||||
* See U_IS_SURROGATE(c).
|
||||
*
|
||||
* Most users should use UCPTRIE_RANGE_NORMAL instead.
|
||||
*
|
||||
* This option is useful for tries that map surrogate code *units* to
|
||||
* special values optimized for UTF-16 string processing
|
||||
* or for special error behavior for unpaired surrogates,
|
||||
* but those values are not to be associated with the lead surrogate code *points*.
|
||||
*/
|
||||
UCPTRIE_RANGE_FIXED_ALL_SURROGATES
|
||||
};
|
||||
typedef enum UCPTrieRangeOption UCPTrieRangeOption;
|
||||
|
||||
/**
|
||||
* Opens a trie from its binary form, stored in 32-bit-aligned memory.
|
||||
* Inverse of ucptrie_toBinary().
|
||||
*
|
||||
* The memory must remain valid and unchanged as long as the trie is used.
|
||||
* You must ucptrie_close() the trie once you are done using it.
|
||||
*
|
||||
* @param type selects the trie type; results in an
|
||||
* U_INVALID_FORMAT_ERROR if it does not match the binary data;
|
||||
* use UCPTRIE_TYPE_ANY to accept any type
|
||||
* @param valueWidth selects the number of bits in a data value; results in an
|
||||
* U_INVALID_FORMAT_ERROR if it does not match the binary data;
|
||||
* use UCPTRIE_VALUE_BITS_ANY to accept any data value width
|
||||
* @param data a pointer to 32-bit-aligned memory containing the binary data of a UCPTrie
|
||||
* @param length the number of bytes available at data;
|
||||
* can be more than necessary
|
||||
* @param pActualLength receives the actual number of bytes at data taken up by the trie data;
|
||||
* can be NULL
|
||||
* @param pErrorCode an in/out ICU UErrorCode
|
||||
* @return the trie
|
||||
*
|
||||
* @see umutablecptrie_open
|
||||
* @see umutablecptrie_buildImmutable
|
||||
* @see ucptrie_toBinary
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI UCPTrie * U_EXPORT2
|
||||
ucptrie_openFromBinary(UCPTrieType type, UCPTrieValueWidth valueWidth,
|
||||
const void *data, int32_t length, int32_t *pActualLength,
|
||||
UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Closes a trie and releases associated memory.
|
||||
*
|
||||
* @param trie the trie
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI void U_EXPORT2
|
||||
ucptrie_close(UCPTrie *trie);
|
||||
|
||||
#if U_SHOW_CPLUSPLUS_API
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
||||
/**
|
||||
* \class LocalUCPTriePointer
|
||||
* "Smart pointer" class, closes a UCPTrie via ucptrie_close().
|
||||
* For most methods see the LocalPointerBase base class.
|
||||
*
|
||||
* @see LocalPointerBase
|
||||
* @see LocalPointer
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_DEFINE_LOCAL_OPEN_POINTER(LocalUCPTriePointer, UCPTrie, ucptrie_close);
|
||||
|
||||
U_NAMESPACE_END
|
||||
|
||||
#endif
|
||||
|
||||
/**
|
||||
* Returns the trie type.
|
||||
*
|
||||
* @param trie the trie
|
||||
* @return the trie type
|
||||
* @see ucptrie_openFromBinary
|
||||
* @see UCPTRIE_TYPE_ANY
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI UCPTrieType U_EXPORT2
|
||||
ucptrie_getType(const UCPTrie *trie);
|
||||
|
||||
/**
|
||||
* Returns the number of bits in a trie data value.
|
||||
*
|
||||
* @param trie the trie
|
||||
* @return the number of bits in a trie data value
|
||||
* @see ucptrie_openFromBinary
|
||||
* @see UCPTRIE_VALUE_BITS_ANY
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI UCPTrieValueWidth U_EXPORT2
|
||||
ucptrie_getValueWidth(const UCPTrie *trie);
|
||||
|
||||
/**
|
||||
* Returns the value for a code point as stored in the trie, with range checking.
|
||||
* Returns the trie error value if c is not in the range 0..U+10FFFF.
|
||||
*
|
||||
* Easier to use than UCPTRIE_FAST_GET() and similar macros but slower.
|
||||
* Easier to use because, unlike the macros, this function works on all UCPTrie
|
||||
* objects, for all types and value widths.
|
||||
*
|
||||
* @param trie the trie
|
||||
* @param c the code point
|
||||
* @return the trie value,
|
||||
* or the trie error value if the code point is not in the range 0..U+10FFFF
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI uint32_t U_EXPORT2
|
||||
ucptrie_get(const UCPTrie *trie, UChar32 c);
|
||||
|
||||
/**
|
||||
* Callback function type: Modifies a trie value.
|
||||
* Optionally called by ucptrie_getRange() or umutablecptrie_getRange().
|
||||
* The modified value will be returned by the getRange function.
|
||||
*
|
||||
* Can be used to ignore some of the value bits,
|
||||
* make a filter for one of several values,
|
||||
* return a value index computed from the trie value, etc.
|
||||
*
|
||||
* @param context an opaque pointer, as passed into the getRange function
|
||||
* @param value a value from the trie
|
||||
* @return the modified value
|
||||
* @draft ICU 63
|
||||
*/
|
||||
typedef uint32_t U_CALLCONV
|
||||
UCPTrieValueFilter(const void *context, uint32_t value);
|
||||
|
||||
/**
|
||||
* Returns the last code point such that all those from start to there have the same value.
|
||||
* Can be used to efficiently iterate over all same-value ranges in a trie.
|
||||
*
|
||||
* If the UCPTrieValueFilter function pointer is not NULL, then
|
||||
* the value to be delivered is passed through that function, and the return value is the end
|
||||
* of the range where all values are modified to the same actual value.
|
||||
* The value is unchanged if that function pointer is NULL.
|
||||
*
|
||||
* Example:
|
||||
* \code
|
||||
* UChar32 start = 0, end;
|
||||
* uint32_t value;
|
||||
* while ((end = ucptrie_getRange(trie, start, UCPTRIE_RANGE_NORMAL, 0,
|
||||
* NULL, NULL, &value)) >= 0) {
|
||||
* // Work with the range start..end and its value.
|
||||
* start = end + 1;
|
||||
* }
|
||||
* \endcode
|
||||
*
|
||||
* @param trie the trie
|
||||
* @param start range start
|
||||
* @param option defines whether surrogates are treated normally,
|
||||
* or as having the surrogateValue; usually UCPTRIE_RANGE_NORMAL
|
||||
* @param surrogateValue value for surrogates; ignored if option==UCPTRIE_RANGE_NORMAL
|
||||
* @param filter a pointer to a function that may modify the trie data value,
|
||||
* or NULL if the values from the trie are to be used unmodified
|
||||
* @param context an opaque pointer that is passed on to the filter function
|
||||
* @param pValue if not NULL, receives the value that every code point start..end has;
|
||||
* may have been modified by filter(context, trie value)
|
||||
* if that function pointer is not NULL
|
||||
* @return the range end code point, or -1 if start is not a valid code point
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI UChar32 U_EXPORT2
|
||||
ucptrie_getRange(const UCPTrie *trie, UChar32 start,
|
||||
UCPTrieRangeOption option, uint32_t surrogateValue,
|
||||
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue);
|
||||
|
||||
/**
|
||||
* Writes a memory-mappable form of the trie into 32-bit aligned memory.
|
||||
* Inverse of ucptrie_openFromBinary().
|
||||
*
|
||||
* @param trie the trie
|
||||
* @param data a pointer to 32-bit-aligned memory to be filled with the trie data;
|
||||
* can be NULL if capacity==0
|
||||
* @param capacity the number of bytes available at data, or 0 for pure preflighting
|
||||
* @param pErrorCode an in/out ICU UErrorCode;
|
||||
* U_BUFFER_OVERFLOW_ERROR if the capacity is too small
|
||||
* @return the number of bytes written or (if buffer overflow) needed for the trie
|
||||
*
|
||||
* @see ucptrie_openFromBinary()
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
ucptrie_toBinary(const UCPTrie *trie, void *data, int32_t capacity, UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Macro parameter value for a trie with 16-bit data values.
|
||||
* Use the name of this macro as a "dataAccess" parameter in other macros.
|
||||
* Do not use this macro in any other way.
|
||||
*
|
||||
* @see UCPTRIE_VALUE_BITS_16
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_16(trie, i) ((trie)->data.ptr16[i])
|
||||
|
||||
/**
|
||||
* Macro parameter value for a trie with 32-bit data values.
|
||||
* Use the name of this macro as a "dataAccess" parameter in other macros.
|
||||
* Do not use this macro in any other way.
|
||||
*
|
||||
* @see UCPTRIE_VALUE_BITS_32
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_32(trie, i) ((trie)->data.ptr32[i])
|
||||
|
||||
/**
|
||||
* Macro parameter value for a trie with 8-bit data values.
|
||||
* Use the name of this macro as a "dataAccess" parameter in other macros.
|
||||
* Do not use this macro in any other way.
|
||||
*
|
||||
* @see UCPTRIE_VALUE_BITS_8
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_8(trie, i) ((trie)->data.ptr8[i])
|
||||
|
||||
/**
|
||||
* Returns a trie value for a code point, with range checking.
|
||||
* Returns the trie error value if c is not in the range 0..U+10FFFF.
|
||||
*
|
||||
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
|
||||
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the trie’s value width
|
||||
* @param c (UChar32, in) the input code point
|
||||
* @return The code point's trie value.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_FAST_GET(trie, dataAccess, c) dataAccess(trie, _UCPTRIE_CP_INDEX(trie, 0xffff, c))
|
||||
|
||||
/**
|
||||
* Returns a 16-bit trie value for a code point, with range checking.
|
||||
* Returns the trie error value if c is not in the range U+0000..U+10FFFF.
|
||||
*
|
||||
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_SMALL
|
||||
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the trie’s value width
|
||||
* @param c (UChar32, in) the input code point
|
||||
* @return The code point's trie value.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_SMALL_GET(trie, dataAccess, c) \
|
||||
dataAccess(trie, _UCPTRIE_CP_INDEX(trie, UCPTRIE_SMALL_MAX, c))
|
||||
|
||||
/**
|
||||
* UTF-16: Reads the next code point (UChar32 c, out), post-increments src,
|
||||
* and gets a value from the trie.
|
||||
* Sets the trie error value if c is an unpaired surrogate.
|
||||
*
|
||||
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
|
||||
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the trie’s value width
|
||||
* @param src (const UChar *, in/out) the source text pointer
|
||||
* @param limit (const UChar *, in) the limit pointer for the text, or NULL if NUL-terminated
|
||||
* @param c (UChar32, out) variable for the code point
|
||||
* @param result (out) variable for the trie lookup result
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_FAST_U16_NEXT(trie, dataAccess, src, limit, c, result) { \
|
||||
(c) = *(src)++; \
|
||||
int32_t __index; \
|
||||
if (!U16_IS_SURROGATE(c)) { \
|
||||
__index = _UCPTRIE_FAST_INDEX(trie, c); \
|
||||
} else { \
|
||||
uint16_t __c2; \
|
||||
if (U16_IS_SURROGATE_LEAD(c) && (src) != (limit) && U16_IS_TRAIL(__c2 = *(src))) { \
|
||||
++(src); \
|
||||
(c) = U16_GET_SUPPLEMENTARY((c), __c2); \
|
||||
__index = _UCPTRIE_SMALL_INDEX(trie, c); \
|
||||
} else { \
|
||||
__index = (trie)->dataLength - UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET; \
|
||||
} \
|
||||
} \
|
||||
(result) = dataAccess(trie, __index); \
|
||||
}
|
||||
|
||||
/**
|
||||
* UTF-16: Reads the previous code point (UChar32 c, out), pre-decrements src,
|
||||
* and gets a value from the trie.
|
||||
* Sets the trie error value if c is an unpaired surrogate.
|
||||
*
|
||||
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
|
||||
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the trie’s value width
|
||||
* @param start (const UChar *, in) the start pointer for the text
|
||||
* @param src (const UChar *, in/out) the source text pointer
|
||||
* @param c (UChar32, out) variable for the code point
|
||||
* @param result (out) variable for the trie lookup result
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_FAST_U16_PREV(trie, dataAccess, start, src, c, result) { \
|
||||
(c) = *--(src); \
|
||||
int32_t __index; \
|
||||
if (!U16_IS_SURROGATE(c)) { \
|
||||
__index = _UCPTRIE_FAST_INDEX(trie, c); \
|
||||
} else { \
|
||||
uint16_t __c2; \
|
||||
if (U16_IS_SURROGATE_TRAIL(c) && (src) != (start) && U16_IS_LEAD(__c2 = *((src) - 1))) { \
|
||||
--(src); \
|
||||
(c) = U16_GET_SUPPLEMENTARY(__c2, (c)); \
|
||||
__index = _UCPTRIE_SMALL_INDEX(trie, c); \
|
||||
} else { \
|
||||
__index = (trie)->dataLength - UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET; \
|
||||
} \
|
||||
} \
|
||||
(result) = dataAccess(trie, __index); \
|
||||
}
|
||||
|
||||
/**
|
||||
* UTF-8: Post-increments src and gets a value from the trie.
|
||||
* Sets the trie error value for an ill-formed byte sequence.
|
||||
*
|
||||
* Unlike UCPTRIE_FAST_U16_NEXT() this UTF-8 macro does not provide the code point
|
||||
* because it would be more work to do so and is often not needed.
|
||||
* If the trie value differs from the error value, then the byte sequence is well-formed,
|
||||
* and the code point can be assembled without revalidation.
|
||||
*
|
||||
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
|
||||
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the trie’s value width
|
||||
* @param src (const char *, in/out) the source text pointer
|
||||
* @param limit (const char *, in) the limit pointer for the text (must not be NULL)
|
||||
* @param result (out) variable for the trie lookup result
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_FAST_U8_NEXT(trie, dataAccess, src, limit, result) { \
|
||||
int32_t __lead = (uint8_t)*(src)++; \
|
||||
if (!U8_IS_SINGLE(__lead)) { \
|
||||
uint8_t __t1, __t2, __t3; \
|
||||
if ((src) != (limit) && \
|
||||
(__lead >= 0xe0 ? \
|
||||
__lead < 0xf0 ? /* U+0800..U+FFFF except surrogates */ \
|
||||
U8_LEAD3_T1_BITS[__lead &= 0xf] & (1 << ((__t1 = *(src)) >> 5)) && \
|
||||
++(src) != (limit) && (__t2 = *(src) - 0x80) <= 0x3f && \
|
||||
(__lead = ((int32_t)(trie)->index[(__lead << 6) + (__t1 & 0x3f)]) + __t2, 1) \
|
||||
: /* U+10000..U+10FFFF */ \
|
||||
(__lead -= 0xf0) <= 4 && \
|
||||
U8_LEAD4_T1_BITS[(__t1 = *(src)) >> 4] & (1 << __lead) && \
|
||||
(__lead = (__lead << 6) | (__t1 & 0x3f), ++(src) != (limit)) && \
|
||||
(__t2 = *(src) - 0x80) <= 0x3f && \
|
||||
++(src) != (limit) && (__t3 = *(src) - 0x80) <= 0x3f && \
|
||||
(__lead = __lead >= (trie)->shifted12HighStart ? \
|
||||
(trie)->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET : \
|
||||
ucptrie_internalSmallU8Index((trie), __lead, __t2, __t3), 1) \
|
||||
: /* U+0080..U+07FF */ \
|
||||
__lead >= 0xc2 && (__t1 = *(src) - 0x80) <= 0x3f && \
|
||||
(__lead = (int32_t)(trie)->index[__lead & 0x1f] + __t1, 1))) { \
|
||||
++(src); \
|
||||
} else { \
|
||||
__lead = (trie)->dataLength - UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET; /* ill-formed*/ \
|
||||
} \
|
||||
} \
|
||||
(result) = dataAccess(trie, __lead); \
|
||||
}
|
||||
|
||||
/**
|
||||
* UTF-8: Pre-decrements src and gets a value from the trie.
|
||||
* Sets the trie error value for an ill-formed byte sequence.
|
||||
*
|
||||
* Unlike UCPTRIE_FAST_U16_PREV() this UTF-8 macro does not provide the code point
|
||||
* because it would be more work to do so and is often not needed.
|
||||
* If the trie value differs from the error value, then the byte sequence is well-formed,
|
||||
* and the code point can be assembled without revalidation.
|
||||
*
|
||||
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
|
||||
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the trie’s value width
|
||||
* @param start (const char *, in) the start pointer for the text
|
||||
* @param src (const char *, in/out) the source text pointer
|
||||
* @param result (out) variable for the trie lookup result
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_FAST_U8_PREV(trie, dataAccess, start, src, result) { \
|
||||
int32_t __index = (uint8_t)*--(src); \
|
||||
if (!U8_IS_SINGLE(__index)) { \
|
||||
__index = ucptrie_internalU8PrevIndex((trie), __index, (const uint8_t *)(start), \
|
||||
(const uint8_t *)(src)); \
|
||||
(src) -= __index & 7; \
|
||||
__index >>= 3; \
|
||||
} \
|
||||
(result) = dataAccess(trie, __index); \
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns a trie value for an ASCII code point, without range checking.
|
||||
*
|
||||
* @param trie (const UCPTrie *, in) the trie (of either fast or small type)
|
||||
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the trie’s value width
|
||||
* @param c (UChar32, in) the input code point; must be U+0000..U+007F
|
||||
* @return The ASCII code point's trie value.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_ASCII_GET(trie, dataAccess, c) dataAccess(trie, c)
|
||||
|
||||
/**
|
||||
* Returns a trie value for a BMP code point (U+0000..U+FFFF), without range checking.
|
||||
* Can be used to look up a value for a UTF-16 code unit if other parts of
|
||||
* the string processing check for surrogates.
|
||||
*
|
||||
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
|
||||
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the trie’s value width
|
||||
* @param c (UChar32, in) the input code point, must be U+0000..U+FFFF
|
||||
* @return The BMP code point's trie value.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_FAST_BMP_GET(trie, dataAccess, c) dataAccess(trie, _UCPTRIE_FAST_INDEX(trie, c))
|
||||
|
||||
/**
|
||||
* Returns a trie value for a supplementary code point (U+10000..U+10FFFF),
|
||||
* without range checking.
|
||||
*
|
||||
* @param trie (const UCPTrie *, in) the trie; must have type UCPTRIE_TYPE_FAST
|
||||
* @param dataAccess UCPTRIE_16, UCPTRIE_32, or UCPTRIE_8 according to the trie’s value width
|
||||
* @param c (UChar32, in) the input code point, must be U+10000..U+10FFFF
|
||||
* @return The supplementary code point's trie value.
|
||||
* @draft ICU 63
|
||||
*/
|
||||
#define UCPTRIE_FAST_SUPP_GET(trie, dataAccess, c) dataAccess(trie, _UCPTRIE_SMALL_INDEX(trie, c))
|
||||
|
||||
/* Internal definitions ----------------------------------------------------- */
|
||||
|
||||
/** @internal */
|
||||
typedef union UCPTrieData {
|
||||
/** @internal */
|
||||
const void *ptr0;
|
||||
/** @internal */
|
||||
const uint16_t *ptr16;
|
||||
/** @internal */
|
||||
const uint32_t *ptr32;
|
||||
/** @internal */
|
||||
const uint8_t *ptr8;
|
||||
} UCPTrieData;
|
||||
|
||||
/**
|
||||
* Internal trie structure definition.
|
||||
* Visible only for use by API macros.
|
||||
* @internal
|
||||
*/
|
||||
struct UCPTrie {
|
||||
/** @internal */
|
||||
const uint16_t *index;
|
||||
/** @internal */
|
||||
UCPTrieData data;
|
||||
|
||||
/** @internal */
|
||||
int32_t indexLength;
|
||||
/** @internal */
|
||||
int32_t dataLength;
|
||||
/** Start of the last range which ends at U+10FFFF. @internal */
|
||||
UChar32 highStart;
|
||||
/** highStart>>12 @internal */
|
||||
uint16_t shifted12HighStart;
|
||||
|
||||
/** @internal */
|
||||
int8_t type; // UCPTrieType
|
||||
/** @internal */
|
||||
int8_t valueWidth; // UCPTrieValueWidth
|
||||
|
||||
/** padding/reserved @internal */
|
||||
uint32_t reserved32;
|
||||
/** padding/reserved @internal */
|
||||
uint16_t reserved16;
|
||||
|
||||
/**
|
||||
* Internal index-3 null block offset.
|
||||
* Set to an impossibly high value (e.g., 0xffff) if there is no dedicated index-3 null block.
|
||||
* @internal
|
||||
*/
|
||||
uint16_t index3NullOffset;
|
||||
/**
|
||||
* Internal data null block offset, not shifted.
|
||||
* Set to an impossibly high value (e.g., 0xfffff) if there is no dedicated data null block.
|
||||
* @internal
|
||||
*/
|
||||
int32_t dataNullOffset;
|
||||
/** @internal */
|
||||
uint32_t nullValue;
|
||||
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
/** @internal */
|
||||
const char *name;
|
||||
#endif
|
||||
};
|
||||
|
||||
/**
|
||||
* Internal implementation constants.
|
||||
* These are needed for the API macros, but users should not use these directly.
|
||||
* @internal
|
||||
*/
|
||||
enum {
|
||||
/** @internal */
|
||||
UCPTRIE_FAST_SHIFT = 6,
|
||||
|
||||
/** Number of entries in a data block for code points below the fast limit. 64=0x40 @internal */
|
||||
UCPTRIE_FAST_DATA_BLOCK_LENGTH = 1 << UCPTRIE_FAST_SHIFT,
|
||||
|
||||
/** Mask for getting the lower bits for the in-fast-data-block offset. @internal */
|
||||
UCPTRIE_FAST_DATA_MASK = UCPTRIE_FAST_DATA_BLOCK_LENGTH - 1,
|
||||
|
||||
/** @internal */
|
||||
UCPTRIE_SMALL_MAX = 0xfff,
|
||||
|
||||
/**
|
||||
* Offset from dataLength (to be subtracted) for fetching the
|
||||
* value returned for out-of-range code points and ill-formed UTF-8/16.
|
||||
* @internal
|
||||
*/
|
||||
UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET = 1,
|
||||
/**
|
||||
* Offset from dataLength (to be subtracted) for fetching the
|
||||
* value returned for code points highStart..U+10FFFF.
|
||||
* @internal
|
||||
*/
|
||||
UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET = 2
|
||||
};
|
||||
|
||||
/* Internal functions and macros -------------------------------------------- */
|
||||
|
||||
/** @internal */
|
||||
U_INTERNAL int32_t U_EXPORT2
|
||||
ucptrie_internalSmallIndex(const UCPTrie *trie, UChar32 c);
|
||||
|
||||
/** @internal */
|
||||
U_INTERNAL int32_t U_EXPORT2
|
||||
ucptrie_internalSmallU8Index(const UCPTrie *trie, int32_t lt1, uint8_t t2, uint8_t t3);
|
||||
|
||||
/**
|
||||
* Internal function for part of the UCPTRIE_FAST_U8_PREVxx() macro implementations.
|
||||
* Do not call directly.
|
||||
* @internal
|
||||
*/
|
||||
U_INTERNAL int32_t U_EXPORT2
|
||||
ucptrie_internalU8PrevIndex(const UCPTrie *trie, UChar32 c,
|
||||
const uint8_t *start, const uint8_t *src);
|
||||
|
||||
/** Internal trie getter for a code point below the fast limit. Returns the data index. @internal */
|
||||
#define _UCPTRIE_FAST_INDEX(trie, c) \
|
||||
((int32_t)(trie)->index[(c) >> UCPTRIE_FAST_SHIFT] + ((c) & UCPTRIE_FAST_DATA_MASK))
|
||||
|
||||
/** Internal trie getter for a code point at or above the fast limit. Returns the data index. @internal */
|
||||
#define _UCPTRIE_SMALL_INDEX(trie, c) \
|
||||
((c) >= (trie)->highStart ? \
|
||||
(trie)->dataLength - UCPTRIE_HIGH_VALUE_NEG_DATA_OFFSET : \
|
||||
ucptrie_internalSmallIndex(trie, c))
|
||||
|
||||
/**
|
||||
* Internal trie getter for a code point, with checking that c is in U+0000..10FFFF.
|
||||
* Returns the data index.
|
||||
* @internal
|
||||
*/
|
||||
#define _UCPTRIE_CP_INDEX(trie, fastMax, c) \
|
||||
((uint32_t)(c) <= (uint32_t)(fastMax) ? \
|
||||
_UCPTRIE_FAST_INDEX(trie, c) : \
|
||||
(uint32_t)(c) <= 0x10ffff ? \
|
||||
_UCPTRIE_SMALL_INDEX(trie, c) : \
|
||||
(trie)->dataLength - UCPTRIE_ERROR_VALUE_NEG_DATA_OFFSET)
|
||||
|
||||
U_CDECL_END
|
||||
|
||||
#endif
|
215
icu4c/source/common/unicode/umutablecptrie.h
Normal file
215
icu4c/source/common/unicode/umutablecptrie.h
Normal file
|
@ -0,0 +1,215 @@
|
|||
// © 2017 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
|
||||
// umutablecptrie.h (split out of ucptrie.h)
|
||||
// created: 2018jan24 Markus W. Scherer
|
||||
|
||||
#ifndef __UMUTABLECPTRIE_H__
|
||||
#define __UMUTABLECPTRIE_H__
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/localpointer.h"
|
||||
#include "unicode/ucptrie.h"
|
||||
#include "unicode/utf8.h"
|
||||
#include "putilimp.h"
|
||||
#include "udataswp.h"
|
||||
|
||||
U_CDECL_BEGIN
|
||||
|
||||
/**
|
||||
* \file
|
||||
*
|
||||
* This file defines a mutable Unicode code point trie.
|
||||
*
|
||||
* @see UCPTrie
|
||||
* @see UMutableCPTrie
|
||||
*/
|
||||
|
||||
/**
|
||||
* Mutable Unicode code point trie.
|
||||
* Fast map from Unicode code points (U+0000..U+10FFFF) to 32-bit integer values.
|
||||
* For details see http://site.icu-project.org/design/struct/utrie
|
||||
*
|
||||
* Setting values (especially ranges) and lookup is fast.
|
||||
* The mutable trie is only somewhat space-efficient.
|
||||
* It builds a compacted, immutable UCPTrie.
|
||||
*
|
||||
* This trie can be modified while iterating over its contents.
|
||||
* For example, it is possible to merge its values with those from another
|
||||
* set of ranges (e.g., another mutable or immutable trie):
|
||||
* Iterate over those source ranges; for each of them iterate over this trie;
|
||||
* add the source value into the value of each trie range.
|
||||
*
|
||||
* @see UCPTrie
|
||||
* @see umutablecptrie_buildImmutable
|
||||
* @draft ICU 63
|
||||
*/
|
||||
struct UMutableCPTrie;
|
||||
typedef struct UMutableCPTrie UMutableCPTrie;
|
||||
|
||||
/**
|
||||
* Creates a mutable trie that initially maps each Unicode code point to the same value.
|
||||
* It uses 32-bit data values until umutablecptrie_buildImmutable() is called.
|
||||
* umutablecptrie_buildImmutable() takes a valueWidth parameter which
|
||||
* determines the number of bits in the data value in the resulting UCPTrie.
|
||||
* You must umutablecptrie_close() the trie once you are done using it.
|
||||
*
|
||||
* @param initialValue the initial value that is set for all code points
|
||||
* @param errorValue the value for out-of-range code points and ill-formed UTF-8/16
|
||||
* @param pErrorCode an in/out ICU UErrorCode
|
||||
* @return the trie
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI UMutableCPTrie * U_EXPORT2
|
||||
umutablecptrie_open(uint32_t initialValue, uint32_t errorValue, UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Clones a mutable trie.
|
||||
* You must umutablecptrie_close() the clone once you are done using it.
|
||||
*
|
||||
* @param other the trie to clone
|
||||
* @param pErrorCode an in/out ICU UErrorCode
|
||||
* @return the trie clone
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI UMutableCPTrie * U_EXPORT2
|
||||
umutablecptrie_clone(const UMutableCPTrie *other, UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Closes a mutable trie and releases associated memory.
|
||||
*
|
||||
* @param trie the trie
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI void U_EXPORT2
|
||||
umutablecptrie_close(UMutableCPTrie *trie);
|
||||
|
||||
#if U_SHOW_CPLUSPLUS_API
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
||||
/**
|
||||
* \class LocalUMutableCPTriePointer
|
||||
* "Smart pointer" class, closes a UMutableCPTrie via umutablecptrie_close().
|
||||
* For most methods see the LocalPointerBase base class.
|
||||
*
|
||||
* @see LocalPointerBase
|
||||
* @see LocalPointer
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_DEFINE_LOCAL_OPEN_POINTER(LocalUMutableCPTriePointer, UMutableCPTrie, umutablecptrie_close);
|
||||
|
||||
U_NAMESPACE_END
|
||||
|
||||
#endif
|
||||
|
||||
/**
|
||||
* Creates a mutable trie with the same contents as the immutable one.
|
||||
* You must umutablecptrie_close() the mutable trie once you are done using it.
|
||||
*
|
||||
* @param trie the immutable trie
|
||||
* @param pErrorCode an in/out ICU UErrorCode
|
||||
* @return the mutable trie
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI UMutableCPTrie * U_EXPORT2
|
||||
umutablecptrie_fromUCPTrie(const UCPTrie *trie, UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Returns the value for a code point as stored in the trie.
|
||||
*
|
||||
* @param trie the trie
|
||||
* @param c the code point
|
||||
* @return the value
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI uint32_t U_EXPORT2
|
||||
umutablecptrie_get(const UMutableCPTrie *trie, UChar32 c);
|
||||
|
||||
/**
|
||||
* Returns the last code point such that all those from start to there have the same value.
|
||||
* Can be used to efficiently iterate over all same-value ranges in a trie.
|
||||
* The trie can be modified between calls to this function.
|
||||
*
|
||||
* If the UCPTrieValueFilter function pointer is not NULL, then
|
||||
* the value to be delivered is passed through that function, and the return value is the end
|
||||
* of the range where all values are modified to the same actual value.
|
||||
* The value is unchanged if that function pointer is NULL.
|
||||
*
|
||||
* See the same-signature ucptrie_getRange() for a code sample.
|
||||
*
|
||||
* @param trie the trie
|
||||
* @param start range start
|
||||
* @param option defines whether surrogates are treated normally,
|
||||
* or as having the surrogateValue; usually UCPTRIE_RANGE_NORMAL
|
||||
* @param surrogateValue value for surrogates; ignored if option==UCPTRIE_RANGE_NORMAL
|
||||
* @param filter a pointer to a function that may modify the trie data value,
|
||||
* or NULL if the values from the trie are to be used unmodified
|
||||
* @param context an opaque pointer that is passed on to the filter function
|
||||
* @param pValue if not NULL, receives the value that every code point start..end has;
|
||||
* may have been modified by filter(context, trie value)
|
||||
* if that function pointer is not NULL
|
||||
* @return the range end code point, or -1 if start is not a valid code point
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI UChar32 U_EXPORT2
|
||||
umutablecptrie_getRange(const UMutableCPTrie *trie, UChar32 start,
|
||||
UCPTrieRangeOption option, uint32_t surrogateValue,
|
||||
UCPTrieValueFilter *filter, const void *context, uint32_t *pValue);
|
||||
|
||||
/**
|
||||
* Sets a value for a code point.
|
||||
*
|
||||
* @param trie the trie
|
||||
* @param c the code point
|
||||
* @param value the value
|
||||
* @param pErrorCode an in/out ICU UErrorCode
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI void U_EXPORT2
|
||||
umutablecptrie_set(UMutableCPTrie *trie, UChar32 c, uint32_t value, UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Sets a value for each code point [start..end].
|
||||
* Faster and more space-efficient than setting the value for each code point separately.
|
||||
*
|
||||
* @param trie the trie
|
||||
* @param start the first code point to get the value
|
||||
* @param end the last code point to get the value (inclusive)
|
||||
* @param value the value
|
||||
* @param pErrorCode an in/out ICU UErrorCode
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI void U_EXPORT2
|
||||
umutablecptrie_setRange(UMutableCPTrie *trie,
|
||||
UChar32 start, UChar32 end,
|
||||
uint32_t value, UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Compacts the data and builds an immutable UCPTrie according to the parameters.
|
||||
* After this, the mutable trie will be empty.
|
||||
*
|
||||
* Not every possible set of mappings can be built into a UCPTrie,
|
||||
* because of limitations resulting from speed and space optimizations.
|
||||
* Every Unicode assigned character can be mapped to a unique value.
|
||||
* Typical data yields data structures far smaller than the limitations.
|
||||
*
|
||||
* It is possible to construct extremely unusual mappings that exceed the data structure limits.
|
||||
* In such a case this function will fail with a U_INDEX_OUTOFBOUNDS_ERROR.
|
||||
*
|
||||
* @param trie the trie trie
|
||||
* @param type selects the trie type
|
||||
* @param valueWidth selects the number of bits in a trie data value; if smaller than 32 bits,
|
||||
* then the values stored in the trie will be truncated first
|
||||
* @param pErrorCode an in/out ICU UErrorCode
|
||||
*
|
||||
* @see umutablecptrie_fromUCPTrie
|
||||
* @draft ICU 63
|
||||
*/
|
||||
U_CAPI UCPTrie * U_EXPORT2
|
||||
umutablecptrie_buildImmutable(UMutableCPTrie *trie, UCPTrieType type, UCPTrieValueWidth valueWidth,
|
||||
UErrorCode *pErrorCode);
|
||||
|
||||
U_CDECL_END
|
||||
|
||||
#endif
|
|
@ -21,7 +21,6 @@
|
|||
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/utf16.h"
|
||||
#include "udataswp.h"
|
||||
|
||||
U_CDECL_BEGIN
|
||||
|
||||
|
@ -732,17 +731,13 @@ utrie_serialize(UNewTrie *trie, void *data, int32_t capacity,
|
|||
UBool reduceTo16Bits,
|
||||
UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Swap a serialized UTrie.
|
||||
* @internal
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie_swap(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode);
|
||||
|
||||
/* serialization ------------------------------------------------------------ */
|
||||
|
||||
// UTrie signature values, in platform endianness and opposite endianness.
|
||||
// The UTrie signature ASCII byte values spell "Trie".
|
||||
#define UTRIE_SIG 0x54726965
|
||||
#define UTRIE_OE_SIG 0x65697254
|
||||
|
||||
/**
|
||||
* Trie data structure in serialized form:
|
||||
*
|
||||
|
|
|
@ -24,11 +24,10 @@
|
|||
* This file contains only the runtime and enumeration code, for read-only access.
|
||||
* See utrie2_builder.c for the builder code.
|
||||
*/
|
||||
#ifdef UTRIE2_DEBUG
|
||||
# include <stdio.h>
|
||||
#endif
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
#include "unicode/umutablecptrie.h"
|
||||
#endif
|
||||
#include "unicode/utf.h"
|
||||
#include "unicode/utf8.h"
|
||||
#include "unicode/utf16.h"
|
||||
|
@ -202,6 +201,9 @@ utrie2_openFromSerialized(UTrie2ValueBits valueBits,
|
|||
trie->memory=(uint32_t *)data;
|
||||
trie->length=actualLength;
|
||||
trie->isMemoryOwned=FALSE;
|
||||
#ifdef UTRIE2_DEBUG
|
||||
trie->name="fromSerialized";
|
||||
#endif
|
||||
|
||||
/* set the pointers to its index and data arrays */
|
||||
p16=(const uint16_t *)(header+1);
|
||||
|
@ -294,6 +296,9 @@ utrie2_openDummy(UTrie2ValueBits valueBits,
|
|||
trie->errorValue=errorValue;
|
||||
trie->highStart=0;
|
||||
trie->highValueIndex=dataMove+UTRIE2_DATA_START_OFFSET;
|
||||
#ifdef UTRIE2_DEBUG
|
||||
trie->name="dummy";
|
||||
#endif
|
||||
|
||||
/* set the header fields */
|
||||
header=(UTrie2Header *)trie->memory;
|
||||
|
@ -373,34 +378,15 @@ utrie2_close(UTrie2 *trie) {
|
|||
}
|
||||
if(trie->newTrie!=NULL) {
|
||||
uprv_free(trie->newTrie->data);
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
umutablecptrie_close(trie->newTrie->t3);
|
||||
#endif
|
||||
uprv_free(trie->newTrie);
|
||||
}
|
||||
uprv_free(trie);
|
||||
}
|
||||
}
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie2_getVersion(const void *data, int32_t length, UBool anyEndianOk) {
|
||||
uint32_t signature;
|
||||
if(length<16 || data==NULL || (U_POINTER_MASK_LSB(data, 3)!=0)) {
|
||||
return 0;
|
||||
}
|
||||
signature=*(const uint32_t *)data;
|
||||
if(signature==UTRIE2_SIG) {
|
||||
return 2;
|
||||
}
|
||||
if(anyEndianOk && signature==UTRIE2_OE_SIG) {
|
||||
return 2;
|
||||
}
|
||||
if(signature==UTRIE_SIG) {
|
||||
return 1;
|
||||
}
|
||||
if(anyEndianOk && signature==UTRIE_OE_SIG) {
|
||||
return 1;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
U_CAPI UBool U_EXPORT2
|
||||
utrie2_isFrozen(const UTrie2 *trie) {
|
||||
return (UBool)(trie->newTrie==NULL);
|
||||
|
@ -430,96 +416,6 @@ utrie2_serialize(const UTrie2 *trie,
|
|||
return trie->length;
|
||||
}
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie2_swap(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode) {
|
||||
const UTrie2Header *inTrie;
|
||||
UTrie2Header trie;
|
||||
int32_t dataLength, size;
|
||||
UTrie2ValueBits valueBits;
|
||||
|
||||
if(U_FAILURE(*pErrorCode)) {
|
||||
return 0;
|
||||
}
|
||||
if(ds==NULL || inData==NULL || (length>=0 && outData==NULL)) {
|
||||
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* setup and swapping */
|
||||
if(length>=0 && length<(int32_t)sizeof(UTrie2Header)) {
|
||||
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
inTrie=(const UTrie2Header *)inData;
|
||||
trie.signature=ds->readUInt32(inTrie->signature);
|
||||
trie.options=ds->readUInt16(inTrie->options);
|
||||
trie.indexLength=ds->readUInt16(inTrie->indexLength);
|
||||
trie.shiftedDataLength=ds->readUInt16(inTrie->shiftedDataLength);
|
||||
|
||||
valueBits=(UTrie2ValueBits)(trie.options&UTRIE2_OPTIONS_VALUE_BITS_MASK);
|
||||
dataLength=(int32_t)trie.shiftedDataLength<<UTRIE2_INDEX_SHIFT;
|
||||
|
||||
if( trie.signature!=UTRIE2_SIG ||
|
||||
valueBits<0 || UTRIE2_COUNT_VALUE_BITS<=valueBits ||
|
||||
trie.indexLength<UTRIE2_INDEX_1_OFFSET ||
|
||||
dataLength<UTRIE2_DATA_START_OFFSET
|
||||
) {
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UTrie */
|
||||
return 0;
|
||||
}
|
||||
|
||||
size=sizeof(UTrie2Header)+trie.indexLength*2;
|
||||
switch(valueBits) {
|
||||
case UTRIE2_16_VALUE_BITS:
|
||||
size+=dataLength*2;
|
||||
break;
|
||||
case UTRIE2_32_VALUE_BITS:
|
||||
size+=dataLength*4;
|
||||
break;
|
||||
default:
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
if(length>=0) {
|
||||
UTrie2Header *outTrie;
|
||||
|
||||
if(length<size) {
|
||||
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
outTrie=(UTrie2Header *)outData;
|
||||
|
||||
/* swap the header */
|
||||
ds->swapArray32(ds, &inTrie->signature, 4, &outTrie->signature, pErrorCode);
|
||||
ds->swapArray16(ds, &inTrie->options, 12, &outTrie->options, pErrorCode);
|
||||
|
||||
/* swap the index and the data */
|
||||
switch(valueBits) {
|
||||
case UTRIE2_16_VALUE_BITS:
|
||||
ds->swapArray16(ds, inTrie+1, (trie.indexLength+dataLength)*2, outTrie+1, pErrorCode);
|
||||
break;
|
||||
case UTRIE2_32_VALUE_BITS:
|
||||
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
|
||||
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, dataLength*4,
|
||||
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
|
||||
break;
|
||||
default:
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
return size;
|
||||
}
|
||||
|
||||
// utrie2_swapAnyVersion() should be defined here but lives in utrie2_builder.c
|
||||
// to avoid a dependency from utrie2.cpp on utrie.c.
|
||||
|
||||
/* enumeration -------------------------------------------------------------- */
|
||||
|
||||
#define MIN_VALUE(a, b) ((a)<(b) ? (a) : (b))
|
||||
|
|
|
@ -22,7 +22,6 @@
|
|||
#include "unicode/utypes.h"
|
||||
#include "unicode/utf8.h"
|
||||
#include "putilimp.h"
|
||||
#include "udataswp.h"
|
||||
|
||||
U_CDECL_BEGIN
|
||||
|
||||
|
@ -330,40 +329,6 @@ utrie2_serialize(const UTrie2 *trie,
|
|||
|
||||
/* Public UTrie2 API: miscellaneous functions ------------------------------- */
|
||||
|
||||
/**
|
||||
* Get the UTrie version from 32-bit-aligned memory containing the serialized form
|
||||
* of either a UTrie (version 1) or a UTrie2 (version 2).
|
||||
*
|
||||
* @param data a pointer to 32-bit-aligned memory containing the serialized form
|
||||
* of a UTrie, version 1 or 2
|
||||
* @param length the number of bytes available at data;
|
||||
* can be more than necessary (see return value)
|
||||
* @param anyEndianOk If FALSE, only platform-endian serialized forms are recognized.
|
||||
* If TRUE, opposite-endian serialized forms are recognized as well.
|
||||
* @return the UTrie version of the serialized form, or 0 if it is not
|
||||
* recognized as a serialized UTrie
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie2_getVersion(const void *data, int32_t length, UBool anyEndianOk);
|
||||
|
||||
/**
|
||||
* Swap a serialized UTrie2.
|
||||
* @internal
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie2_swap(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Swap a serialized UTrie or UTrie2.
|
||||
* @internal
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie2_swapAnyVersion(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode);
|
||||
|
||||
/**
|
||||
* Build a UTrie2 (version 2) from a UTrie (version 1).
|
||||
* Enumerates all values in the UTrie and builds a UTrie2 with the same values.
|
||||
|
@ -709,6 +674,10 @@ struct UTrie2 {
|
|||
UBool padding1;
|
||||
int16_t padding2;
|
||||
UNewTrie2 *newTrie; /* builder object; NULL when frozen */
|
||||
|
||||
#ifdef UTRIE2_DEBUG
|
||||
const char *name;
|
||||
#endif
|
||||
};
|
||||
|
||||
/**
|
||||
|
|
|
@ -24,16 +24,23 @@
|
|||
* This file contains only the builder code.
|
||||
* See utrie2.c for the runtime and enumeration code.
|
||||
*/
|
||||
// #define UTRIE2_DEBUG
|
||||
#ifdef UTRIE2_DEBUG
|
||||
# include <stdio.h>
|
||||
#endif
|
||||
// #define UCPTRIE_DEBUG
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
#include "unicode/ucptrie.h"
|
||||
#include "unicode/umutablecptrie.h"
|
||||
#include "ucptrie_impl.h"
|
||||
#endif
|
||||
#include "cmemory.h"
|
||||
#include "utrie2.h"
|
||||
#include "utrie2_impl.h"
|
||||
|
||||
#include "utrie.h" /* for utrie2_fromUTrie() and utrie_swap() */
|
||||
#include "utrie.h" // for utrie2_fromUTrie()
|
||||
|
||||
/* Implementation notes ----------------------------------------------------- */
|
||||
|
||||
|
@ -132,8 +139,14 @@ utrie2_open(uint32_t initialValue, uint32_t errorValue, UErrorCode *pErrorCode)
|
|||
trie->errorValue=errorValue;
|
||||
trie->highStart=0x110000;
|
||||
trie->newTrie=newTrie;
|
||||
#ifdef UTRIE2_DEBUG
|
||||
trie->name="open";
|
||||
#endif
|
||||
|
||||
newTrie->data=data;
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
newTrie->t3=umutablecptrie_open(initialValue, errorValue, pErrorCode);
|
||||
#endif
|
||||
newTrie->dataCapacity=UNEWTRIE2_INITIAL_DATA_LENGTH;
|
||||
newTrie->initialValue=initialValue;
|
||||
newTrie->errorValue=errorValue;
|
||||
|
@ -246,6 +259,14 @@ cloneBuilder(const UNewTrie2 *other) {
|
|||
uprv_free(trie);
|
||||
return NULL;
|
||||
}
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
if(other->t3==nullptr) {
|
||||
trie->t3=nullptr;
|
||||
} else {
|
||||
UErrorCode errorCode=U_ZERO_ERROR;
|
||||
trie->t3=umutablecptrie_clone(other->t3, &errorCode);
|
||||
}
|
||||
#endif
|
||||
trie->dataCapacity=other->dataCapacity;
|
||||
|
||||
/* clone data */
|
||||
|
@ -343,6 +364,22 @@ copyEnumRange(const void *context, UChar32 start, UChar32 end, uint32_t value) {
|
|||
}
|
||||
|
||||
#ifdef UTRIE2_DEBUG
|
||||
static long countInitial(const UTrie2 *trie) {
|
||||
uint32_t initialValue=trie->initialValue;
|
||||
int32_t length=trie->dataLength;
|
||||
long count=0;
|
||||
if(trie->data16!=nullptr) {
|
||||
for(int32_t i=0; i<length; ++i) {
|
||||
if(trie->data16[i]==initialValue) { ++count; }
|
||||
}
|
||||
} else {
|
||||
for(int32_t i=0; i<length; ++i) {
|
||||
if(trie->data32[i]==initialValue) { ++count; }
|
||||
}
|
||||
}
|
||||
return count;
|
||||
}
|
||||
|
||||
static void
|
||||
utrie_printLengths(const UTrie *trie) {
|
||||
long indexLength=trie->indexLength;
|
||||
|
@ -357,8 +394,8 @@ utrie2_printLengths(const UTrie2 *trie, const char *which) {
|
|||
long indexLength=trie->indexLength;
|
||||
long dataLength=(long)trie->dataLength;
|
||||
long totalLength=(long)sizeof(UTrie2Header)+indexLength*2+dataLength*(trie->data32!=NULL ? 4 : 2);
|
||||
printf("**UTrie2Lengths(%s)** index:%6ld data:%6ld serialized:%6ld\n",
|
||||
which, indexLength, dataLength, totalLength);
|
||||
printf("**UTrie2Lengths(%s %s)** index:%6ld data:%6ld countInitial:%6ld serialized:%6ld\n",
|
||||
which, trie->name, indexLength, dataLength, countInitial(trie), totalLength);
|
||||
}
|
||||
#endif
|
||||
|
||||
|
@ -622,6 +659,9 @@ set32(UNewTrie2 *trie,
|
|||
*pErrorCode=U_NO_WRITE_PERMISSION;
|
||||
return;
|
||||
}
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
umutablecptrie_set(trie->t3, c, value, pErrorCode);
|
||||
#endif
|
||||
|
||||
block=getDataBlock(trie, c, forLSCP);
|
||||
if(block<0) {
|
||||
|
@ -717,6 +757,9 @@ utrie2_setRange32(UTrie2 *trie,
|
|||
*pErrorCode=U_NO_WRITE_PERMISSION;
|
||||
return;
|
||||
}
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
umutablecptrie_setRange(newTrie->t3, start, end, value, pErrorCode);
|
||||
#endif
|
||||
if(!overwrite && value==newTrie->initialValue) {
|
||||
return; /* nothing to do */
|
||||
}
|
||||
|
@ -732,7 +775,7 @@ utrie2_setRange32(UTrie2 *trie,
|
|||
return;
|
||||
}
|
||||
|
||||
nextStart=(start+UTRIE2_DATA_BLOCK_LENGTH)&~UTRIE2_DATA_MASK;
|
||||
nextStart=(start+UTRIE2_DATA_MASK)&~UTRIE2_DATA_MASK;
|
||||
if(nextStart<=limit) {
|
||||
fillBlock(newTrie->data+block, start&UTRIE2_DATA_MASK, UTRIE2_DATA_BLOCK_LENGTH,
|
||||
value, newTrie->initialValue, overwrite);
|
||||
|
@ -983,6 +1026,10 @@ findHighStart(UNewTrie2 *trie, uint32_t highValue) {
|
|||
*/
|
||||
static void
|
||||
compactData(UNewTrie2 *trie) {
|
||||
#ifdef UTRIE2_DEBUG
|
||||
int32_t countSame=0, sumOverlaps=0;
|
||||
#endif
|
||||
|
||||
int32_t start, newStart, movedStart;
|
||||
int32_t blockLength, overlap;
|
||||
int32_t i, mapIndex, blockCount;
|
||||
|
@ -1023,6 +1070,9 @@ compactData(UNewTrie2 *trie) {
|
|||
if( (movedStart=findSameDataBlock(trie->data, newStart, start, blockLength))
|
||||
>=0
|
||||
) {
|
||||
#ifdef UTRIE2_DEBUG
|
||||
++countSame;
|
||||
#endif
|
||||
/* found an identical block, set the other block's index value for the current block */
|
||||
for(i=blockCount, mapIndex=start>>UTRIE2_SHIFT_2; i>0; --i) {
|
||||
trie->map[mapIndex++]=movedStart;
|
||||
|
@ -1042,6 +1092,9 @@ compactData(UNewTrie2 *trie) {
|
|||
overlap>0 && !equal_uint32(trie->data+(newStart-overlap), trie->data+start, overlap);
|
||||
overlap-=UTRIE2_DATA_GRANULARITY) {}
|
||||
|
||||
#ifdef UTRIE2_DEBUG
|
||||
sumOverlaps+=overlap;
|
||||
#endif
|
||||
if(overlap>0 || newStart<start) {
|
||||
/* some overlap, or just move the whole block */
|
||||
movedStart=newStart-overlap;
|
||||
|
@ -1081,8 +1134,8 @@ compactData(UNewTrie2 *trie) {
|
|||
|
||||
#ifdef UTRIE2_DEBUG
|
||||
/* we saved some space */
|
||||
printf("compacting UTrie2: count of 32-bit data words %lu->%lu\n",
|
||||
(long)trie->dataLength, (long)newStart);
|
||||
printf("compacting UTrie2: count of 32-bit data words %lu->%lu countSame=%ld sumOverlaps=%ld\n",
|
||||
(long)trie->dataLength, (long)newStart, (long)countSame, (long)sumOverlaps);
|
||||
#endif
|
||||
|
||||
trie->dataLength=newStart;
|
||||
|
@ -1163,7 +1216,7 @@ compactIndex2(UNewTrie2 *trie) {
|
|||
|
||||
#ifdef UTRIE2_DEBUG
|
||||
/* we saved some space */
|
||||
printf("compacting UTrie2: count of 16-bit index-2 words %lu->%lu\n",
|
||||
printf("compacting UTrie2: count of 16-bit index words %lu->%lu\n",
|
||||
(long)trie->index2Length, (long)newStart);
|
||||
#endif
|
||||
|
||||
|
@ -1193,7 +1246,7 @@ compactTrie(UTrie2 *trie, UErrorCode *pErrorCode) {
|
|||
trie->highStart=newTrie->highStart=highStart;
|
||||
|
||||
#ifdef UTRIE2_DEBUG
|
||||
printf("UTrie2: highStart U+%04lx highValue 0x%lx initialValue 0x%lx\n",
|
||||
printf("UTrie2: highStart U+%06lx highValue 0x%lx initialValue 0x%lx\n",
|
||||
(long)highStart, (long)highValue, (long)trie->initialValue);
|
||||
#endif
|
||||
|
||||
|
@ -1211,7 +1264,7 @@ compactTrie(UTrie2 *trie, UErrorCode *pErrorCode) {
|
|||
compactIndex2(newTrie);
|
||||
#ifdef UTRIE2_DEBUG
|
||||
} else {
|
||||
printf("UTrie2: highStart U+%04lx count of 16-bit index-2 words %lu->%lu\n",
|
||||
printf("UTrie2: highStart U+%04lx count of 16-bit index words %lu->%lu\n",
|
||||
(long)highStart, (long)trie->newTrie->index2Length, (long)UTRIE2_INDEX_1_OFFSET);
|
||||
#endif
|
||||
}
|
||||
|
@ -1411,31 +1464,18 @@ utrie2_freeze(UTrie2 *trie, UTrie2ValueBits valueBits, UErrorCode *pErrorCode) {
|
|||
return;
|
||||
}
|
||||
|
||||
#ifdef UTRIE2_DEBUG
|
||||
utrie2_printLengths(trie, "");
|
||||
#endif
|
||||
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
umutablecptrie_setName(newTrie->t3, trie->name);
|
||||
ucptrie_close(
|
||||
umutablecptrie_buildImmutable(
|
||||
newTrie->t3, UCPTRIE_TYPE_FAST, (UCPTrieValueWidth)valueBits, pErrorCode));
|
||||
#endif
|
||||
/* Delete the UNewTrie2. */
|
||||
uprv_free(newTrie->data);
|
||||
uprv_free(newTrie);
|
||||
trie->newTrie=NULL;
|
||||
}
|
||||
|
||||
/*
|
||||
* This is here to avoid a dependency from utrie2.cpp on utrie.c.
|
||||
* This file already depends on utrie.c.
|
||||
* Otherwise, this should be in utrie2.cpp right after utrie2_swap().
|
||||
*/
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie2_swapAnyVersion(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode) {
|
||||
if(U_SUCCESS(*pErrorCode)) {
|
||||
switch(utrie2_getVersion(inData, length, TRUE)) {
|
||||
case 1:
|
||||
return utrie_swap(ds, inData, length, outData, pErrorCode);
|
||||
case 2:
|
||||
return utrie2_swap(ds, inData, length, outData, pErrorCode);
|
||||
default:
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
|
|
@ -22,22 +22,20 @@
|
|||
#ifndef __UTRIE2_IMPL_H__
|
||||
#define __UTRIE2_IMPL_H__
|
||||
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
#include "unicode/umutablecptrie.h"
|
||||
#endif
|
||||
#include "utrie2.h"
|
||||
|
||||
/* Public UTrie2 API implementation ----------------------------------------- */
|
||||
|
||||
/*
|
||||
* These definitions are mostly needed by utrie2.c,
|
||||
* These definitions are mostly needed by utrie2.cpp,
|
||||
* but also by utrie2_serialize() and utrie2_swap().
|
||||
*/
|
||||
|
||||
/*
|
||||
* UTrie and UTrie2 signature values,
|
||||
* in platform endianness and opposite endianness.
|
||||
*/
|
||||
#define UTRIE_SIG 0x54726965
|
||||
#define UTRIE_OE_SIG 0x65697254
|
||||
|
||||
// UTrie2 signature values, in platform endianness and opposite endianness.
|
||||
// The UTrie2 signature ASCII byte values spell "Tri2".
|
||||
#define UTRIE2_SIG 0x54726932
|
||||
#define UTRIE2_OE_SIG 0x32697254
|
||||
|
||||
|
@ -145,6 +143,9 @@ struct UNewTrie2 {
|
|||
int32_t index1[UNEWTRIE2_INDEX_1_LENGTH];
|
||||
int32_t index2[UNEWTRIE2_MAX_INDEX_2_LENGTH];
|
||||
uint32_t *data;
|
||||
#ifdef UCPTRIE_DEBUG
|
||||
UMutableCPTrie *t3;
|
||||
#endif
|
||||
|
||||
uint32_t initialValue, errorValue;
|
||||
int32_t index2Length, dataCapacity, dataLength;
|
||||
|
|
344
icu4c/source/common/utrie_swap.cpp
Normal file
344
icu4c/source/common/utrie_swap.cpp
Normal file
|
@ -0,0 +1,344 @@
|
|||
// © 2018 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
|
||||
// utrie_swap.cpp
|
||||
// created: 2018aug08 Markus W. Scherer
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#include "cmemory.h"
|
||||
#include "ucptrie_impl.h"
|
||||
#include "udataswp.h"
|
||||
#include "utrie.h"
|
||||
#include "utrie2_impl.h"
|
||||
|
||||
// These functions for swapping different generations of ICU code point tries are here
|
||||
// so that their implementation files need not depend on swapper code,
|
||||
// need not depend on each other, and so that other swapper code
|
||||
// need not depend on other trie code.
|
||||
|
||||
namespace {
|
||||
|
||||
constexpr int32_t ASCII_LIMIT = 0x80;
|
||||
|
||||
} // namespace
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie_swap(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode) {
|
||||
const UTrieHeader *inTrie;
|
||||
UTrieHeader trie;
|
||||
int32_t size;
|
||||
UBool dataIs32;
|
||||
|
||||
if(pErrorCode==NULL || U_FAILURE(*pErrorCode)) {
|
||||
return 0;
|
||||
}
|
||||
if(ds==NULL || inData==NULL || (length>=0 && outData==NULL)) {
|
||||
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* setup and swapping */
|
||||
if(length>=0 && (uint32_t)length<sizeof(UTrieHeader)) {
|
||||
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
inTrie=(const UTrieHeader *)inData;
|
||||
trie.signature=ds->readUInt32(inTrie->signature);
|
||||
trie.options=ds->readUInt32(inTrie->options);
|
||||
trie.indexLength=udata_readInt32(ds, inTrie->indexLength);
|
||||
trie.dataLength=udata_readInt32(ds, inTrie->dataLength);
|
||||
|
||||
if( trie.signature!=0x54726965 ||
|
||||
(trie.options&UTRIE_OPTIONS_SHIFT_MASK)!=UTRIE_SHIFT ||
|
||||
((trie.options>>UTRIE_OPTIONS_INDEX_SHIFT)&UTRIE_OPTIONS_SHIFT_MASK)!=UTRIE_INDEX_SHIFT ||
|
||||
trie.indexLength<UTRIE_BMP_INDEX_LENGTH ||
|
||||
(trie.indexLength&(UTRIE_SURROGATE_BLOCK_COUNT-1))!=0 ||
|
||||
trie.dataLength<UTRIE_DATA_BLOCK_LENGTH ||
|
||||
(trie.dataLength&(UTRIE_DATA_GRANULARITY-1))!=0 ||
|
||||
((trie.options&UTRIE_OPTIONS_LATIN1_IS_LINEAR)!=0 && trie.dataLength<(UTRIE_DATA_BLOCK_LENGTH+0x100))
|
||||
) {
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UTrie */
|
||||
return 0;
|
||||
}
|
||||
|
||||
dataIs32=(UBool)((trie.options&UTRIE_OPTIONS_DATA_IS_32_BIT)!=0);
|
||||
size=sizeof(UTrieHeader)+trie.indexLength*2+trie.dataLength*(dataIs32?4:2);
|
||||
|
||||
if(length>=0) {
|
||||
UTrieHeader *outTrie;
|
||||
|
||||
if(length<size) {
|
||||
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
outTrie=(UTrieHeader *)outData;
|
||||
|
||||
/* swap the header */
|
||||
ds->swapArray32(ds, inTrie, sizeof(UTrieHeader), outTrie, pErrorCode);
|
||||
|
||||
/* swap the index and the data */
|
||||
if(dataIs32) {
|
||||
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
|
||||
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, trie.dataLength*4,
|
||||
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
|
||||
} else {
|
||||
ds->swapArray16(ds, inTrie+1, (trie.indexLength+trie.dataLength)*2, outTrie+1, pErrorCode);
|
||||
}
|
||||
}
|
||||
|
||||
return size;
|
||||
}
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie2_swap(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode) {
|
||||
const UTrie2Header *inTrie;
|
||||
UTrie2Header trie;
|
||||
int32_t dataLength, size;
|
||||
UTrie2ValueBits valueBits;
|
||||
|
||||
if(U_FAILURE(*pErrorCode)) {
|
||||
return 0;
|
||||
}
|
||||
if(ds==NULL || inData==NULL || (length>=0 && outData==NULL)) {
|
||||
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* setup and swapping */
|
||||
if(length>=0 && length<(int32_t)sizeof(UTrie2Header)) {
|
||||
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
inTrie=(const UTrie2Header *)inData;
|
||||
trie.signature=ds->readUInt32(inTrie->signature);
|
||||
trie.options=ds->readUInt16(inTrie->options);
|
||||
trie.indexLength=ds->readUInt16(inTrie->indexLength);
|
||||
trie.shiftedDataLength=ds->readUInt16(inTrie->shiftedDataLength);
|
||||
|
||||
valueBits=(UTrie2ValueBits)(trie.options&UTRIE2_OPTIONS_VALUE_BITS_MASK);
|
||||
dataLength=(int32_t)trie.shiftedDataLength<<UTRIE2_INDEX_SHIFT;
|
||||
|
||||
if( trie.signature!=UTRIE2_SIG ||
|
||||
valueBits<0 || UTRIE2_COUNT_VALUE_BITS<=valueBits ||
|
||||
trie.indexLength<UTRIE2_INDEX_1_OFFSET ||
|
||||
dataLength<UTRIE2_DATA_START_OFFSET
|
||||
) {
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UTrie */
|
||||
return 0;
|
||||
}
|
||||
|
||||
size=sizeof(UTrie2Header)+trie.indexLength*2;
|
||||
switch(valueBits) {
|
||||
case UTRIE2_16_VALUE_BITS:
|
||||
size+=dataLength*2;
|
||||
break;
|
||||
case UTRIE2_32_VALUE_BITS:
|
||||
size+=dataLength*4;
|
||||
break;
|
||||
default:
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
if(length>=0) {
|
||||
UTrie2Header *outTrie;
|
||||
|
||||
if(length<size) {
|
||||
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
outTrie=(UTrie2Header *)outData;
|
||||
|
||||
/* swap the header */
|
||||
ds->swapArray32(ds, &inTrie->signature, 4, &outTrie->signature, pErrorCode);
|
||||
ds->swapArray16(ds, &inTrie->options, 12, &outTrie->options, pErrorCode);
|
||||
|
||||
/* swap the index and the data */
|
||||
switch(valueBits) {
|
||||
case UTRIE2_16_VALUE_BITS:
|
||||
ds->swapArray16(ds, inTrie+1, (trie.indexLength+dataLength)*2, outTrie+1, pErrorCode);
|
||||
break;
|
||||
case UTRIE2_32_VALUE_BITS:
|
||||
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
|
||||
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, dataLength*4,
|
||||
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
|
||||
break;
|
||||
default:
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
return size;
|
||||
}
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
ucptrie_swap(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode) {
|
||||
const UCPTrieHeader *inTrie;
|
||||
UCPTrieHeader trie;
|
||||
int32_t dataLength, size;
|
||||
UCPTrieValueWidth valueWidth;
|
||||
|
||||
if(U_FAILURE(*pErrorCode)) {
|
||||
return 0;
|
||||
}
|
||||
if(ds==nullptr || inData==nullptr || (length>=0 && outData==nullptr)) {
|
||||
*pErrorCode=U_ILLEGAL_ARGUMENT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* setup and swapping */
|
||||
if(length>=0 && length<(int32_t)sizeof(UCPTrieHeader)) {
|
||||
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
inTrie=(const UCPTrieHeader *)inData;
|
||||
trie.signature=ds->readUInt32(inTrie->signature);
|
||||
trie.options=ds->readUInt16(inTrie->options);
|
||||
trie.indexLength=ds->readUInt16(inTrie->indexLength);
|
||||
trie.dataLength = ds->readUInt16(inTrie->dataLength);
|
||||
|
||||
UCPTrieType type = (UCPTrieType)((trie.options >> 6) & 3);
|
||||
valueWidth = (UCPTrieValueWidth)(trie.options & UCPTRIE_OPTIONS_VALUE_BITS_MASK);
|
||||
dataLength = ((int32_t)(trie.options & UCPTRIE_OPTIONS_DATA_LENGTH_MASK) << 4) | trie.dataLength;
|
||||
|
||||
int32_t minIndexLength = type == UCPTRIE_TYPE_FAST ?
|
||||
UCPTRIE_BMP_INDEX_LENGTH : UCPTRIE_SMALL_INDEX_LENGTH;
|
||||
if( trie.signature!=UCPTRIE_SIG ||
|
||||
type > UCPTRIE_TYPE_SMALL ||
|
||||
(trie.options & UCPTRIE_OPTIONS_RESERVED_MASK) != 0 ||
|
||||
valueWidth > UCPTRIE_VALUE_BITS_8 ||
|
||||
trie.indexLength < minIndexLength ||
|
||||
dataLength < ASCII_LIMIT
|
||||
) {
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR; /* not a UCPTrie */
|
||||
return 0;
|
||||
}
|
||||
|
||||
size=sizeof(UCPTrieHeader)+trie.indexLength*2;
|
||||
switch(valueWidth) {
|
||||
case UCPTRIE_VALUE_BITS_16:
|
||||
size+=dataLength*2;
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_32:
|
||||
size+=dataLength*4;
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_8:
|
||||
size+=dataLength;
|
||||
break;
|
||||
default:
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
if(length>=0) {
|
||||
UCPTrieHeader *outTrie;
|
||||
|
||||
if(length<size) {
|
||||
*pErrorCode=U_INDEX_OUTOFBOUNDS_ERROR;
|
||||
return 0;
|
||||
}
|
||||
|
||||
outTrie=(UCPTrieHeader *)outData;
|
||||
|
||||
/* swap the header */
|
||||
ds->swapArray32(ds, &inTrie->signature, 4, &outTrie->signature, pErrorCode);
|
||||
ds->swapArray16(ds, &inTrie->options, 12, &outTrie->options, pErrorCode);
|
||||
|
||||
/* swap the index and the data */
|
||||
switch(valueWidth) {
|
||||
case UCPTRIE_VALUE_BITS_16:
|
||||
ds->swapArray16(ds, inTrie+1, (trie.indexLength+dataLength)*2, outTrie+1, pErrorCode);
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_32:
|
||||
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
|
||||
ds->swapArray32(ds, (const uint16_t *)(inTrie+1)+trie.indexLength, dataLength*4,
|
||||
(uint16_t *)(outTrie+1)+trie.indexLength, pErrorCode);
|
||||
break;
|
||||
case UCPTRIE_VALUE_BITS_8:
|
||||
ds->swapArray16(ds, inTrie+1, trie.indexLength*2, outTrie+1, pErrorCode);
|
||||
if(inTrie!=outTrie) {
|
||||
uprv_memmove((outTrie+1)+trie.indexLength, (inTrie+1)+trie.indexLength, dataLength);
|
||||
}
|
||||
break;
|
||||
default:
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
return size;
|
||||
}
|
||||
|
||||
namespace {
|
||||
|
||||
/**
|
||||
* Gets the trie version from 32-bit-aligned memory containing the serialized form
|
||||
* of a UTrie (version 1), a UTrie2 (version 2), or a UCPTrie (version 3).
|
||||
*
|
||||
* @param data a pointer to 32-bit-aligned memory containing the serialized form of a trie
|
||||
* @param length the number of bytes available at data;
|
||||
* can be more than necessary (see return value)
|
||||
* @param anyEndianOk If FALSE, only platform-endian serialized forms are recognized.
|
||||
* If TRUE, opposite-endian serialized forms are recognized as well.
|
||||
* @return the trie version of the serialized form, or 0 if it is not
|
||||
* recognized as a serialized trie
|
||||
*/
|
||||
int32_t
|
||||
getVersion(const void *data, int32_t length, UBool anyEndianOk) {
|
||||
uint32_t signature;
|
||||
if(length<16 || data==nullptr || (U_POINTER_MASK_LSB(data, 3)!=0)) {
|
||||
return 0;
|
||||
}
|
||||
signature=*(const uint32_t *)data;
|
||||
if(signature==UCPTRIE_SIG) {
|
||||
return 3;
|
||||
}
|
||||
if(anyEndianOk && signature==UCPTRIE_OE_SIG) {
|
||||
return 3;
|
||||
}
|
||||
if(signature==UTRIE2_SIG) {
|
||||
return 2;
|
||||
}
|
||||
if(anyEndianOk && signature==UTRIE2_OE_SIG) {
|
||||
return 2;
|
||||
}
|
||||
if(signature==UTRIE_SIG) {
|
||||
return 1;
|
||||
}
|
||||
if(anyEndianOk && signature==UTRIE_OE_SIG) {
|
||||
return 1;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
U_CAPI int32_t U_EXPORT2
|
||||
utrie_swapAnyVersion(const UDataSwapper *ds,
|
||||
const void *inData, int32_t length, void *outData,
|
||||
UErrorCode *pErrorCode) {
|
||||
if(U_FAILURE(*pErrorCode)) { return 0; }
|
||||
switch(getVersion(inData, length, TRUE)) {
|
||||
case 1:
|
||||
return utrie_swap(ds, inData, length, outData, pErrorCode);
|
||||
case 2:
|
||||
return utrie2_swap(ds, inData, length, outData, pErrorCode);
|
||||
case 3:
|
||||
return ucptrie_swap(ds, inData, length, outData, pErrorCode);
|
||||
default:
|
||||
*pErrorCode=U_INVALID_FORMAT_ERROR;
|
||||
return 0;
|
||||
}
|
||||
}
|
|
@ -557,7 +557,10 @@ UTS46::processUnicode(const UnicodeString &src,
|
|||
destArray=dest.getBuffer();
|
||||
destLength+=newLength-labelLength;
|
||||
labelLimit=labelStart+=newLength+1;
|
||||
} else if(0xdf<=c && c<=0x200d && (c==0xdf || c==0x3c2 || c>=0x200c)) {
|
||||
continue;
|
||||
} else if(c<0xdf) {
|
||||
// pass
|
||||
} else if(c<=0x200d && (c==0xdf || c==0x3c2 || c>=0x200c)) {
|
||||
info.isTransDiff=TRUE;
|
||||
if(doMapDevChars) {
|
||||
destLength=mapDevChars(dest, labelStart, labelLimit, errorCode);
|
||||
|
@ -565,15 +568,23 @@ UTS46::processUnicode(const UnicodeString &src,
|
|||
return dest;
|
||||
}
|
||||
destArray=dest.getBuffer();
|
||||
// Do not increment labelLimit in case c was removed.
|
||||
// All deviation characters have been mapped, no need to check for them again.
|
||||
doMapDevChars=FALSE;
|
||||
} else {
|
||||
++labelLimit;
|
||||
// Do not increment labelLimit in case c was removed.
|
||||
continue;
|
||||
}
|
||||
} else if(U16_IS_SURROGATE(c)) {
|
||||
if(U16_IS_SURROGATE_LEAD(c) ?
|
||||
(labelLimit+1)==destLength || !U16_IS_TRAIL(destArray[labelLimit+1]) :
|
||||
labelLimit==labelStart || !U16_IS_LEAD(destArray[labelLimit-1])) {
|
||||
// Map an unpaired surrogate to U+FFFD before normalization so that when
|
||||
// that removes characters we do not turn two unpaired ones into a pair.
|
||||
info.labelErrors|=UIDNA_ERROR_DISALLOWED;
|
||||
dest.setCharAt(labelLimit, 0xfffd);
|
||||
destArray=dest.getBuffer();
|
||||
}
|
||||
} else {
|
||||
++labelLimit;
|
||||
}
|
||||
++labelLimit;
|
||||
}
|
||||
// Permit an empty label at the end (0<labelStart==labelLimit==destLength is ok)
|
||||
// but not an empty label elsewhere nor a completely empty domain name.
|
||||
|
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
@ -4305,7 +4305,7 @@ D7A4..D7AF >FFFD # NA <reserved-D7A4>..<reserved-D7AF>
|
|||
D7C7..D7CA >FFFD # NA <reserved-D7C7>..<reserved-D7CA>
|
||||
# D7CB..D7FB valid # 5.2 HANGUL JONGSEONG NIEUN-RIEUL..HANGUL JONGSEONG PHIEUPH-THIEUTH
|
||||
D7FC..D7FF >FFFD # NA <reserved-D7FC>..<reserved-D7FF>
|
||||
D800..DFFF >FFFD # 2.0 <surrogate-D800>..<surrogate-DFFF>
|
||||
# D800..DFFF >FFFD # 2.0 <surrogate-D800>..<surrogate-DFFF>
|
||||
E000..F8FF >FFFD # 1.1 <private-use-E000>..<private-use-F8FF>
|
||||
F900 >8C48 # 1.1 CJK COMPATIBILITY IDEOGRAPH-F900
|
||||
F901 >66F4 # 1.1 CJK COMPATIBILITY IDEOGRAPH-F901
|
||||
|
|
|
@ -20,7 +20,7 @@
|
|||
#include "unicode/uspoof.h"
|
||||
#include "unicode/uscript.h"
|
||||
#include "unicode/udata.h"
|
||||
|
||||
#include "udataswp.h"
|
||||
#include "utrie2.h"
|
||||
|
||||
#if !UCONFIG_NO_NORMALIZATION
|
||||
|
|
|
@ -48,7 +48,7 @@ cnmdptst.o cnormtst.o cnumtst.o crelativedateformattest.o crestst.o creststn.o c
|
|||
cucdapi.o cucdtst.o custrtst.o cstrcase.o cutiltst.o nucnvtst.o nccbtst.o bocu1tst.o \
|
||||
cbiditst.o cbididat.o eurocreg.o udatatst.o utf16tst.o utransts.o \
|
||||
ncnvfbts.o ncnvtst.o putiltst.o cstrtest.o udatpg_test.o utf8tst.o \
|
||||
stdnmtst.o usrchtst.o custrtrn.o sorttest.o trietest.o trie2test.o usettest.o \
|
||||
stdnmtst.o usrchtst.o custrtrn.o sorttest.o trietest.o trie2test.o ucptrietest.o usettest.o \
|
||||
uenumtst.o utmstest.o currtest.o \
|
||||
idnatest.o nfsprep.o spreptst.o sprpdata.o \
|
||||
hpmufn.o tracetst.o reapits.o uregiontest.o ulistfmttest.o\
|
||||
|
|
|
@ -182,6 +182,7 @@
|
|||
<ClCompile Include="sorttest.c" />
|
||||
<ClCompile Include="trie2test.c" />
|
||||
<ClCompile Include="trietest.c" />
|
||||
<ClCompile Include="ucptrietest.c" />
|
||||
<ClCompile Include="uenumtst.c" />
|
||||
<ClCompile Include="bocu1tst.c" />
|
||||
<ClCompile Include="ccapitst.c" />
|
||||
|
@ -284,4 +285,4 @@
|
|||
<Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
|
||||
<ImportGroup Label="ExtensionTargets">
|
||||
</ImportGroup>
|
||||
</Project>
|
||||
</Project>
|
||||
|
|
|
@ -123,6 +123,9 @@
|
|||
<ClCompile Include="trietest.c">
|
||||
<Filter>collections</Filter>
|
||||
</ClCompile>
|
||||
<ClCompile Include="ucptrietest.c">
|
||||
<Filter>collections</Filter>
|
||||
</ClCompile>
|
||||
<ClCompile Include="uenumtst.c">
|
||||
<Filter>collections</Filter>
|
||||
</ClCompile>
|
||||
|
@ -417,4 +420,4 @@
|
|||
<Filter>sprep & idna</Filter>
|
||||
</ClInclude>
|
||||
</ItemGroup>
|
||||
</Project>
|
||||
</Project>
|
||||
|
|
|
@ -27,6 +27,7 @@ void addHashtableTest(TestNode** root);
|
|||
void addCStringTest(TestNode** root);
|
||||
void addTrieTest(TestNode** root);
|
||||
void addTrie2Test(TestNode** root);
|
||||
void addUCPTrieTest(TestNode** root);
|
||||
void addEnumerationTest(TestNode** root);
|
||||
void addPosixTest(TestNode** root);
|
||||
void addSortTest(TestNode** root);
|
||||
|
@ -38,6 +39,7 @@ void addUtility(TestNode** root)
|
|||
addCStringTest(root);
|
||||
addTrieTest(root);
|
||||
addTrie2Test(root);
|
||||
addUCPTrieTest(root);
|
||||
addLocaleTest(root);
|
||||
addCLDRTest(root);
|
||||
addUnicodeTest(root);
|
||||
|
|
|
@ -421,7 +421,7 @@ testTrieUTF8(const char *testName,
|
|||
prevCP=c;
|
||||
--c; /* end of the range */
|
||||
U8_APPEND_UNSAFE(s, length, c);
|
||||
if(U_IS_SURROGATE(prevCP)) {
|
||||
if(U_IS_SURROGATE(c)) {
|
||||
// A surrogate byte sequence counts as 3 single-byte errors.
|
||||
values[countValues++]=errorValue;
|
||||
values[countValues++]=errorValue;
|
||||
|
@ -1287,31 +1287,6 @@ GrowDataArrayTest(void) {
|
|||
|
||||
/* versions 1 and 2 --------------------------------------------------------- */
|
||||
|
||||
static void
|
||||
GetVersionTest(void) {
|
||||
uint32_t data[4];
|
||||
if( /* version 1 */
|
||||
(data[0]=0x54726965, 1!=utrie2_getVersion(data, sizeof(data), FALSE)) ||
|
||||
(data[0]=0x54726965, 1!=utrie2_getVersion(data, sizeof(data), TRUE)) ||
|
||||
(data[0]=0x65697254, 0!=utrie2_getVersion(data, sizeof(data), FALSE)) ||
|
||||
(data[0]=0x65697254, 1!=utrie2_getVersion(data, sizeof(data), TRUE)) ||
|
||||
/* version 2 */
|
||||
(data[0]=0x54726932, 2!=utrie2_getVersion(data, sizeof(data), FALSE)) ||
|
||||
(data[0]=0x54726932, 2!=utrie2_getVersion(data, sizeof(data), TRUE)) ||
|
||||
(data[0]=0x32697254, 0!=utrie2_getVersion(data, sizeof(data), FALSE)) ||
|
||||
(data[0]=0x32697254, 2!=utrie2_getVersion(data, sizeof(data), TRUE)) ||
|
||||
/* illegal arguments */
|
||||
(data[0]=0x54726932, 0!=utrie2_getVersion(NULL, sizeof(data), FALSE)) ||
|
||||
(data[0]=0x54726932, 0!=utrie2_getVersion(data, 3, FALSE)) ||
|
||||
(data[0]=0x54726932, 0!=utrie2_getVersion((char *)data+1, sizeof(data), FALSE)) ||
|
||||
/* unknown signature values */
|
||||
(data[0]=0x11223344, 0!=utrie2_getVersion(data, sizeof(data), FALSE)) ||
|
||||
(data[0]=0x54726933, 0!=utrie2_getVersion(data, sizeof(data), FALSE))
|
||||
) {
|
||||
log_err("error: utrie2_getVersion() is not working as expected\n");
|
||||
}
|
||||
}
|
||||
|
||||
static UNewTrie *
|
||||
makeNewTrie1WithRanges(const char *testName,
|
||||
const SetRange setRanges[], int32_t countSetRanges,
|
||||
|
@ -1455,6 +1430,5 @@ addTrie2Test(TestNode** root) {
|
|||
addTest(root, &DummyTrieTest, "tsutil/trie2test/DummyTrieTest");
|
||||
addTest(root, &FreeBlocksTest, "tsutil/trie2test/FreeBlocksTest");
|
||||
addTest(root, &GrowDataArrayTest, "tsutil/trie2test/GrowDataArrayTest");
|
||||
addTest(root, &GetVersionTest, "tsutil/trie2test/GetVersionTest");
|
||||
addTest(root, &Trie12ConversionTest, "tsutil/trie2test/Trie12ConversionTest");
|
||||
}
|
||||
|
|
1506
icu4c/source/test/cintltst/ucptrietest.c
Normal file
1506
icu4c/source/test/cintltst/ucptrietest.c
Normal file
File diff suppressed because it is too large
Load diff
|
@ -633,6 +633,29 @@ BasicNormalizerTest::TestPreviousNext(const UChar *src, int32_t srcLength,
|
|||
const char *moves,
|
||||
UNormalizationMode mode,
|
||||
const char *name) {
|
||||
// Sanity check non-iterative normalization.
|
||||
{
|
||||
IcuTestErrorCode errorCode(*this, "TestPreviousNext");
|
||||
UnicodeString result;
|
||||
Normalizer::normalize(UnicodeString(src, srcLength), mode, 0, result, errorCode);
|
||||
if (errorCode.isFailure()) {
|
||||
dataerrln("error: non-iterative normalization of %s failed: %s",
|
||||
name, errorCode.errorName());
|
||||
errorCode.reset();
|
||||
return;
|
||||
}
|
||||
// UnicodeString::fromUTF32(expect, expectLength)
|
||||
// would turn unpaired surrogates into U+FFFD.
|
||||
for (int32_t i = 0, j = 0; i < result.length(); ++j) {
|
||||
UChar32 c = result.char32At(i);
|
||||
if (c != expect[j]) {
|
||||
errln("error: non-iterative normalization of %s did not yield the expected result",
|
||||
name);
|
||||
}
|
||||
i += U16_LENGTH(c);
|
||||
}
|
||||
}
|
||||
|
||||
// iterators
|
||||
Normalizer iter(src, srcLength, mode);
|
||||
|
||||
|
@ -1432,9 +1455,14 @@ struct StringPair { const char *input, *expected; };
|
|||
void
|
||||
BasicNormalizerTest::TestCustomComp() {
|
||||
static const StringPair pairs[]={
|
||||
{ "\\uD801\\uE000\\uDFFE", "" },
|
||||
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
|
||||
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
|
||||
// ICU 63 normalization with UCPTrie requires inert surrogate code points.
|
||||
// { "\\uD801\\uE000\\uDFFE", "" },
|
||||
// { "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
|
||||
// { "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
|
||||
{ "\\uD801\\uE000\\uDFFE", "\\uD801\\uDFFE" },
|
||||
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD800\\uD801\\uDFFE\\uDFFF" },
|
||||
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD800\\U000107FE\\uDFFF" },
|
||||
|
||||
{ "\\uE001\\U000110B9\\u0345\\u0308\\u0327", "\\uE002\\U000110B9\\u0327\\u0345" },
|
||||
{ "\\uE010\\U000F0011\\uE012", "\\uE011\\uE012" },
|
||||
{ "\\uE010\\U000F0011\\U000F0011\\uE012", "\\uE011\\U000F0010" },
|
||||
|
@ -1462,9 +1490,14 @@ BasicNormalizerTest::TestCustomComp() {
|
|||
void
|
||||
BasicNormalizerTest::TestCustomFCC() {
|
||||
static const StringPair pairs[]={
|
||||
{ "\\uD801\\uE000\\uDFFE", "" },
|
||||
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
|
||||
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
|
||||
// ICU 63 normalization with UCPTrie requires inert surrogate code points.
|
||||
// { "\\uD801\\uE000\\uDFFE", "" },
|
||||
// { "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
|
||||
// { "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
|
||||
{ "\\uD801\\uE000\\uDFFE", "\\uD801\\uDFFE" },
|
||||
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD800\\uD801\\uDFFE\\uDFFF" },
|
||||
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD800\\U000107FE\\uDFFF" },
|
||||
|
||||
// The following expected result is different from CustomComp
|
||||
// because of only-contiguous composition.
|
||||
{ "\\uE001\\U000110B9\\u0345\\u0308\\u0327", "\\uE001\\U000110B9\\u0327\\u0308\\u0345" },
|
||||
|
|
|
@ -17,17 +17,20 @@ include $(top_builddir)/icudefs.mk
|
|||
subdir = test/perf/normperf
|
||||
|
||||
## Extra files to remove for 'make clean'
|
||||
CLEANFILES = *~ $(DEPS)
|
||||
CLEANFILES = *~ $(DEPS) $(SIMPLE_DEPS)
|
||||
|
||||
## Target information
|
||||
TARGET = normperf
|
||||
SIMPLE = simplenormperf
|
||||
|
||||
CPPFLAGS += -I$(top_srcdir)/common -I$(top_srcdir)/tools/toolutil -I$(top_srcdir)/tools/ctestfw
|
||||
LIBS = $(LIBCTESTFW) $(LIBICUI18N) $(LIBICUUC) $(LIBICUTOOLUTIL) $(DEFAULT_LIBS) $(LIB_M)
|
||||
|
||||
OBJECTS = normperf.o
|
||||
SIMPLE_OBJ = simplenormperf.o
|
||||
|
||||
DEPS = $(OBJECTS:.o=.d)
|
||||
SIMPLE_DEPS = $(SIMPLE_OBJ:.o=.d)
|
||||
|
||||
## List of phony targets
|
||||
.PHONY : all all-local install install-local clean clean-local \
|
||||
|
@ -44,7 +47,7 @@ distclean : distclean-local
|
|||
dist: dist-local
|
||||
check: all check-local
|
||||
|
||||
all-local: $(TARGET)
|
||||
all-local: $(TARGET) $(SIMPLE)
|
||||
|
||||
install-local:
|
||||
|
||||
|
@ -52,7 +55,7 @@ dist-local:
|
|||
|
||||
clean-local:
|
||||
test -z "$(CLEANFILES)" || $(RMV) $(CLEANFILES)
|
||||
$(RMV) $(OBJECTS) $(TARGET)
|
||||
$(RMV) $(OBJECTS) $(SIMPLE_OBJ) $(TARGET) $(SIMPLE)
|
||||
|
||||
distclean-local: clean-local
|
||||
$(RMV) Makefile
|
||||
|
@ -67,16 +70,21 @@ $(TARGET) : $(OBJECTS)
|
|||
$(LINK.cc) -o $@ $^ $(LIBS)
|
||||
$(POST_BUILD_STEP)
|
||||
|
||||
$(SIMPLE) : $(SIMPLE_OBJ)
|
||||
$(LINK.cc) -o $@ $^ $(LIBS)
|
||||
$(POST_BUILD_STEP)
|
||||
|
||||
invoke:
|
||||
ICU_DATA=$${ICU_DATA:-$(top_builddir)/data/} TZ=PST8PDT $(INVOKE) $(INVOCATION)
|
||||
|
||||
ifeq (,$(MAKECMDGOALS))
|
||||
-include $(DEPS)
|
||||
-include $(SIMPLE_DEPS)
|
||||
else
|
||||
ifneq ($(patsubst %clean,,$(MAKECMDGOALS)),)
|
||||
ifneq ($(patsubst %install,,$(MAKECMDGOALS)),)
|
||||
-include $(DEPS)
|
||||
-include $(SIMPLE_DEPS)
|
||||
endif
|
||||
endif
|
||||
endif
|
||||
|
||||
|
|
352
icu4c/source/test/perf/normperf/simplenormperf.cpp
Normal file
352
icu4c/source/test/perf/normperf/simplenormperf.cpp
Normal file
|
@ -0,0 +1,352 @@
|
|||
// © 2018 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html
|
||||
|
||||
// simplenormperf.cpp
|
||||
// created: 2018mar15 Markus W. Scherer
|
||||
|
||||
#include <stdio.h>
|
||||
#include <string>
|
||||
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/bytestream.h"
|
||||
#include "unicode/normalizer2.h"
|
||||
#include "unicode/stringpiece.h"
|
||||
#include "unicode/unistr.h"
|
||||
#include "unicode/utf8.h"
|
||||
#include "unicode/utimer.h"
|
||||
#include "cmemory.h"
|
||||
|
||||
using icu::Normalizer2;
|
||||
using icu::UnicodeString;
|
||||
|
||||
namespace {
|
||||
|
||||
// Strings with commonly occurring BMP characters.
|
||||
class CommonChars {
|
||||
public:
|
||||
static UnicodeString getMixed(int32_t minLength) {
|
||||
return extend(UnicodeString(latin1).append(japanese).append(arabic), minLength);
|
||||
}
|
||||
static UnicodeString getLatin1(int32_t minLength) { return extend(latin1, minLength); }
|
||||
static UnicodeString getLowercaseLatin1(int32_t minLength) { return extend(lowercaseLatin1, minLength); }
|
||||
static UnicodeString getASCII(int32_t minLength) { return extend(ascii, minLength); }
|
||||
static UnicodeString getJapanese(int32_t minLength) { return extend(japanese, minLength); }
|
||||
|
||||
// Returns an array of UTF-8 offsets, one per code point.
|
||||
// Assumes all BMP characters.
|
||||
static int32_t *toUTF8WithOffsets(const UnicodeString &s16, std::string &s8, int32_t &numCodePoints) {
|
||||
s8.clear();
|
||||
s8.reserve(s16.length());
|
||||
s16.toUTF8String(s8);
|
||||
const char *s = s8.data();
|
||||
int32_t length = s8.length();
|
||||
int32_t *offsets = new int32_t[length + 1];
|
||||
int32_t numCP = 0;
|
||||
for (int32_t i = 0; i < length;) {
|
||||
offsets[numCP++] = i;
|
||||
U8_FWD_1(s, i, length);
|
||||
}
|
||||
offsets[numCP] = length;
|
||||
numCodePoints = numCP;
|
||||
return offsets;
|
||||
}
|
||||
|
||||
private:
|
||||
static UnicodeString extend(const UnicodeString &s, int32_t minLength) {
|
||||
UnicodeString result(s);
|
||||
while (result.length() < minLength) {
|
||||
UnicodeString twice = result + result;
|
||||
result = std::move(twice);
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
static const UChar *const latin1;
|
||||
static const UChar *const lowercaseLatin1;
|
||||
static const UChar *const ascii;
|
||||
static const UChar *const japanese;
|
||||
static const UChar *const arabic;
|
||||
};
|
||||
|
||||
const UChar *const CommonChars::latin1 =
|
||||
// Goethe’s Bergschloß in normal sentence case.
|
||||
u"Da droben auf jenem Berge, da steht ein altes Schloß, "
|
||||
u"wo hinter Toren und Türen sonst lauerten Ritter und Roß.\n"
|
||||
u"Verbrannt sind Türen und Tore, und überall ist es so still; "
|
||||
u"das alte verfallne Gemäuer durchklettr ich, wie ich nur will.\n"
|
||||
u"Hierneben lag ein Keller, so voll von köstlichem Wein; "
|
||||
u"nun steiget nicht mehr mit Krügen die Kellnerin heiter hinein.\n"
|
||||
u"Sie setzt den Gästen im Saale nicht mehr die Becher umher, "
|
||||
u"sie füllt zum Heiligen Mahle dem Pfaffen das Fläschchen nicht mehr.\n"
|
||||
u"Sie reicht dem lüsternen Knappen nicht mehr auf dem Gange den Trank, "
|
||||
u"und nimmt für flüchtige Gabe nicht mehr den flüchtigen Dank.\n"
|
||||
u"Denn alle Balken und Decken, sie sind schon lange verbrannt, "
|
||||
u"und Trepp und Gang und Kapelle in Schutt und Trümmer verwandt.\n"
|
||||
u"Doch als mit Zither und Flasche nach diesen felsigen Höhn "
|
||||
u"ich an dem heitersten Tage mein Liebchen steigen gesehn,\n"
|
||||
u"da drängte sich frohes Behagen hervor aus verödeter Ruh, "
|
||||
u"da gings wie in alten Tagen recht feierlich wieder zu.\n"
|
||||
u"Als wären für stattliche Gäste die weitesten Räume bereit, "
|
||||
u"als käm ein Pärchen gegangen aus jener tüchtigen Zeit.\n"
|
||||
u"Als stünd in seiner Kapelle der würdige Pfaffe schon da "
|
||||
u"und fragte: Wollt ihr einander? Wir aber lächelten: Ja!\n"
|
||||
u"Und tief bewegten Gesänge des Herzens innigsten Grund, "
|
||||
u"Es zeugte, statt der Menge, der Echo schallender Mund.\n"
|
||||
u"Und als sich gegen Abend im stillen alles verlor,"
|
||||
u"da blickte die glühende Sonne zum schroffen Gipfel empor.\n"
|
||||
u"Und Knapp und Kellnerin glänzen als Herren weit und breit; "
|
||||
u"sie nimmt sich zum Kredenzen und er zum Danke sich Zeit.\n";
|
||||
|
||||
const UChar *const CommonChars::lowercaseLatin1 =
|
||||
// Goethe’s Bergschloß in all lowercase
|
||||
u"da droben auf jenem berge, da steht ein altes schloß, "
|
||||
u"wo hinter toren und türen sonst lauerten ritter und roß.\n"
|
||||
u"verbrannt sind türen und tore, und überall ist es so still; "
|
||||
u"das alte verfallne gemäuer durchklettr ich, wie ich nur will.\n"
|
||||
u"hierneben lag ein keller, so voll von köstlichem wein; "
|
||||
u"nun steiget nicht mehr mit krügen die kellnerin heiter hinein.\n"
|
||||
u"sie setzt den gästen im saale nicht mehr die becher umher, "
|
||||
u"sie füllt zum heiligen mahle dem pfaffen das fläschchen nicht mehr.\n"
|
||||
u"sie reicht dem lüsternen knappen nicht mehr auf dem gange den trank, "
|
||||
u"und nimmt für flüchtige gabe nicht mehr den flüchtigen dank.\n"
|
||||
u"denn alle balken und decken, sie sind schon lange verbrannt, "
|
||||
u"und trepp und gang und kapelle in schutt und trümmer verwandt.\n"
|
||||
u"doch als mit zither und flasche nach diesen felsigen höhn "
|
||||
u"ich an dem heitersten tage mein liebchen steigen gesehn,\n"
|
||||
u"da drängte sich frohes behagen hervor aus verödeter ruh, "
|
||||
u"da gings wie in alten tagen recht feierlich wieder zu.\n"
|
||||
u"als wären für stattliche gäste die weitesten räume bereit, "
|
||||
u"als käm ein pärchen gegangen aus jener tüchtigen zeit.\n"
|
||||
u"als stünd in seiner kapelle der würdige pfaffe schon da "
|
||||
u"und fragte: wollt ihr einander? wir aber lächelten: ja!\n"
|
||||
u"und tief bewegten gesänge des herzens innigsten grund, "
|
||||
u"es zeugte, statt der menge, der echo schallender mund.\n"
|
||||
u"und als sich gegen abend im stillen alles verlor,"
|
||||
u"da blickte die glühende sonne zum schroffen gipfel empor.\n"
|
||||
u"und knapp und kellnerin glänzen als herren weit und breit; "
|
||||
u"sie nimmt sich zum kredenzen und er zum danke sich zeit.\n";
|
||||
|
||||
const UChar *const CommonChars::ascii =
|
||||
// Goethe’s Bergschloß in normal sentence case but ASCII-fied
|
||||
u"Da droben auf jenem Berge, da steht ein altes Schloss, "
|
||||
u"wo hinter Toren und Tueren sonst lauerten Ritter und Ross.\n"
|
||||
u"Verbrannt sind Tueren und Tore, und ueberall ist es so still; "
|
||||
u"das alte verfallne Gemaeuer durchklettr ich, wie ich nur will.\n"
|
||||
u"Hierneben lag ein Keller, so voll von koestlichem Wein; "
|
||||
u"nun steiget nicht mehr mit Kruegen die Kellnerin heiter hinein.\n"
|
||||
u"Sie setzt den Gaesten im Saale nicht mehr die Becher umher, "
|
||||
u"sie fuellt zum Heiligen Mahle dem Pfaffen das Flaeschchen nicht mehr.\n"
|
||||
u"Sie reicht dem luesternen Knappen nicht mehr auf dem Gange den Trank, "
|
||||
u"und nimmt fuer fluechtige Gabe nicht mehr den fluechtigen Dank.\n"
|
||||
u"Denn alle Balken und Decken, sie sind schon lange verbrannt, "
|
||||
u"und Trepp und Gang und Kapelle in Schutt und Truemmer verwandt.\n"
|
||||
u"Doch als mit Zither und Flasche nach diesen felsigen Hoehn "
|
||||
u"ich an dem heitersten Tage mein Liebchen steigen gesehn,\n"
|
||||
u"da draengte sich frohes Behagen hervor aus veroedeter Ruh, "
|
||||
u"da gings wie in alten Tagen recht feierlich wieder zu.\n"
|
||||
u"Als waeren fuer stattliche Gaeste die weitesten Raeume bereit, "
|
||||
u"als kaem ein Paerchen gegangen aus jener tuechtigen Zeit.\n"
|
||||
u"Als stuend in seiner Kapelle der wuerdige Pfaffe schon da "
|
||||
u"und fragte: Wollt ihr einander? Wir aber laechelten: Ja!\n"
|
||||
u"Und tief bewegten Gesaenge des Herzens innigsten Grund, "
|
||||
u"Es zeugte, statt der Menge, der Echo schallender Mund.\n"
|
||||
u"Und als sich gegen Abend im stillen alles verlor,"
|
||||
u"da blickte die gluehende Sonne zum schroffen Gipfel empor.\n"
|
||||
u"Und Knapp und Kellnerin glaenzen als Herren weit und breit; "
|
||||
u"sie nimmt sich zum Kredenzen und er zum Danke sich Zeit.\n";
|
||||
|
||||
const UChar *const CommonChars::japanese =
|
||||
// Ame ni mo makezu = Be not Defeated by the Rain, by Kenji Miyazawa.
|
||||
u"雨にもまけず風にもまけず雪にも夏の暑さにもまけぬ"
|
||||
u"丈夫なからだをもち慾はなく決して瞋らず"
|
||||
u"いつもしずかにわらっている一日に玄米四合と"
|
||||
u"味噌と少しの野菜をたべあらゆることを"
|
||||
u"じぶんをかんじょうにいれずによくみききしわかり"
|
||||
u"そしてわすれず野原の松の林の蔭の"
|
||||
u"小さな萱ぶきの小屋にいて東に病気のこどもあれば"
|
||||
u"行って看病してやり西につかれた母あれば"
|
||||
u"行ってその稲の束を負い南に死にそうな人あれば"
|
||||
u"行ってこわがらなくてもいいといい"
|
||||
u"北にけんかやそしょうがあれば"
|
||||
u"つまらないからやめろといいひでりのときはなみだをながし"
|
||||
u"さむさのなつはおろおろあるきみんなにでくのぼうとよばれ"
|
||||
u"ほめられもせずくにもされずそういうものにわたしはなりたい";
|
||||
|
||||
const UChar *const CommonChars::arabic =
|
||||
// Some Arabic for variety. "What is Unicode?"
|
||||
// http://www.unicode.org/standard/translations/arabic.html
|
||||
u"تتعامل الحواسيب بالأسام مع الأرقام فقط، "
|
||||
u"و تخزن الحروف و المحارف "
|
||||
u"الأخرى بتخصيص رقم لكل واحد "
|
||||
u"منها. قبل اختراع يونيكود كان هناك ";
|
||||
|
||||
// TODO: class BenchmarkPerCodePoint?
|
||||
|
||||
class Operation {
|
||||
public:
|
||||
Operation() {}
|
||||
virtual ~Operation();
|
||||
virtual double call(int32_t iterations, int32_t pieceLength) = 0;
|
||||
|
||||
protected:
|
||||
UTimer startTime;
|
||||
};
|
||||
|
||||
Operation::~Operation() {}
|
||||
|
||||
const int32_t kLengths[] = { 5, 12, 30, 100, 1000, 10000 };
|
||||
|
||||
int32_t getMaxLength() { return kLengths[UPRV_LENGTHOF(kLengths) - 1]; }
|
||||
|
||||
// Returns seconds per code point.
|
||||
double measure(Operation &op, int32_t pieceLength) {
|
||||
// Increase the number of iterations until we use at least one second.
|
||||
int32_t iterations = 1;
|
||||
for (;;) {
|
||||
double seconds = op.call(iterations, pieceLength);
|
||||
if (seconds >= 1) {
|
||||
if (iterations > 1) {
|
||||
return seconds / (iterations * pieceLength);
|
||||
} else {
|
||||
// Run it once more, to avoid measuring only the warm-up.
|
||||
return op.call(1, pieceLength) / (iterations * pieceLength);
|
||||
}
|
||||
}
|
||||
if (seconds < 0.01) {
|
||||
iterations *= 10;
|
||||
} else if (seconds < 0.55) {
|
||||
iterations *= 1.1 / seconds;
|
||||
} else {
|
||||
iterations *= 2;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void benchmark(const char *name, Operation &op) {
|
||||
for (int32_t i = 0; i < UPRV_LENGTHOF(kLengths); ++i) {
|
||||
int32_t pieceLength = kLengths[i];
|
||||
double secPerCp = measure(op, pieceLength);
|
||||
printf("%s %6d %12f ns/cp\n", name, (int)pieceLength, secPerCp * 1000000000);
|
||||
}
|
||||
puts("");
|
||||
}
|
||||
|
||||
class NormalizeUTF16 : public Operation {
|
||||
public:
|
||||
NormalizeUTF16(const Normalizer2 &n2, const UnicodeString &text) :
|
||||
norm2(n2), src(text), s(src.getBuffer()) {}
|
||||
virtual ~NormalizeUTF16();
|
||||
virtual double call(int32_t iterations, int32_t pieceLength);
|
||||
|
||||
private:
|
||||
const Normalizer2 &norm2;
|
||||
UnicodeString src;
|
||||
const UChar *s;
|
||||
UnicodeString dest;
|
||||
};
|
||||
|
||||
NormalizeUTF16::~NormalizeUTF16() {}
|
||||
|
||||
// Assumes all BMP characters.
|
||||
double NormalizeUTF16::call(int32_t iterations, int32_t pieceLength) {
|
||||
int32_t start = 0;
|
||||
int32_t limit = src.length() - pieceLength;
|
||||
UnicodeString piece;
|
||||
UErrorCode errorCode = U_ZERO_ERROR;
|
||||
utimer_getTime(&startTime);
|
||||
for (int32_t i = 0; i < iterations; ++i) {
|
||||
piece.setTo(FALSE, s + start, pieceLength);
|
||||
norm2.normalize(piece, dest, errorCode);
|
||||
start = (start + pieceLength) % limit;
|
||||
}
|
||||
return utimer_getElapsedSeconds(&startTime);
|
||||
}
|
||||
|
||||
class NormalizeUTF8 : public Operation {
|
||||
public:
|
||||
NormalizeUTF8(const Normalizer2 &n2, const UnicodeString &text) : norm2(n2), sink(&dest) {
|
||||
offsets = CommonChars::toUTF8WithOffsets(text, src, numCodePoints);
|
||||
s = src.data();
|
||||
}
|
||||
virtual ~NormalizeUTF8();
|
||||
virtual double call(int32_t iterations, int32_t pieceLength);
|
||||
|
||||
private:
|
||||
const Normalizer2 &norm2;
|
||||
std::string src;
|
||||
const char *s;
|
||||
int32_t *offsets;
|
||||
int32_t numCodePoints;
|
||||
std::string dest;
|
||||
icu::StringByteSink<std::string> sink;
|
||||
};
|
||||
|
||||
NormalizeUTF8::~NormalizeUTF8() {
|
||||
delete[] offsets;
|
||||
}
|
||||
|
||||
double NormalizeUTF8::call(int32_t iterations, int32_t pieceLength) {
|
||||
int32_t start = 0;
|
||||
int32_t limit = numCodePoints - pieceLength;
|
||||
UErrorCode errorCode = U_ZERO_ERROR;
|
||||
utimer_getTime(&startTime);
|
||||
for (int32_t i = 0; i < iterations; ++i) {
|
||||
int32_t start8 = offsets[start];
|
||||
int32_t limit8 = offsets[start + pieceLength];
|
||||
icu::StringPiece piece(s + start8, limit8 - start8);
|
||||
norm2.normalizeUTF8(0, piece, sink, nullptr, errorCode);
|
||||
start = (start + pieceLength) % limit;
|
||||
}
|
||||
return utimer_getElapsedSeconds(&startTime);
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
extern int main(int /*argc*/, const char * /*argv*/[]) {
|
||||
// More than the longest piece length so that we read from different parts of the string
|
||||
// for that piece length.
|
||||
int32_t maxLength = getMaxLength() * 10;
|
||||
UErrorCode errorCode = U_ZERO_ERROR;
|
||||
const Normalizer2 *nfc = Normalizer2::getNFCInstance(errorCode);
|
||||
const Normalizer2 *nfkc_cf = Normalizer2::getNFKCCasefoldInstance(errorCode);
|
||||
if (U_FAILURE(errorCode)) {
|
||||
fprintf(stderr,
|
||||
"simplenormperf: failed to get Normalizer2 instances - %s\n",
|
||||
u_errorName(errorCode));
|
||||
}
|
||||
{
|
||||
// Base line: Should remain in the fast loop without trie lookups.
|
||||
NormalizeUTF16 op(*nfc, CommonChars::getLatin1(maxLength));
|
||||
benchmark("NFC/UTF-16/latin1", op);
|
||||
}
|
||||
{
|
||||
// Base line 2: Read UTF-8, trie lookups, but should have nothing to do.
|
||||
NormalizeUTF8 op(*nfc, CommonChars::getJapanese(maxLength));
|
||||
benchmark("NFC/UTF-8/japanese", op);
|
||||
}
|
||||
{
|
||||
NormalizeUTF16 op(*nfkc_cf, CommonChars::getMixed(maxLength));
|
||||
benchmark("NFKC_CF/UTF-16/mixed", op);
|
||||
}
|
||||
{
|
||||
NormalizeUTF16 op(*nfkc_cf, CommonChars::getLowercaseLatin1(maxLength));
|
||||
benchmark("NFKC_CF/UTF-16/lowercaseLatin1", op);
|
||||
}
|
||||
{
|
||||
NormalizeUTF16 op(*nfkc_cf, CommonChars::getJapanese(maxLength));
|
||||
benchmark("NFKC_CF/UTF-16/japanese", op);
|
||||
}
|
||||
{
|
||||
NormalizeUTF8 op(*nfkc_cf, CommonChars::getMixed(maxLength));
|
||||
benchmark("NFKC_CF/UTF-8/mixed", op);
|
||||
}
|
||||
{
|
||||
NormalizeUTF8 op(*nfkc_cf, CommonChars::getLowercaseLatin1(maxLength));
|
||||
benchmark("NFKC_CF/UTF-8/lowercaseLatin1", op);
|
||||
}
|
||||
{
|
||||
NormalizeUTF8 op(*nfkc_cf, CommonChars::getJapanese(maxLength));
|
||||
benchmark("NFKC_CF/UTF-8/japanese", op);
|
||||
}
|
||||
return 0;
|
||||
}
|
16
icu4c/source/test/testdata/testnorm.txt
vendored
16
icu4c/source/test/testdata/testnorm.txt
vendored
|
@ -44,9 +44,10 @@
|
|||
0360..0361:234
|
||||
0362:233
|
||||
0363..036F:230
|
||||
D802:2 # surrogates with non-zero combining classes
|
||||
D803:3
|
||||
D804:4
|
||||
# ICU 63 normalization with UCPTrie requires inert surrogate code points.
|
||||
# D802:2 # surrogates with non-zero combining classes
|
||||
# D803:3
|
||||
# D804:4
|
||||
110B9:9
|
||||
110BA:7
|
||||
|
||||
|
@ -58,10 +59,11 @@ D804:4
|
|||
00C4=0041 0308
|
||||
00C5=0041 030A
|
||||
00C7=0043 0327
|
||||
D800>D7FF # surrogates with mappings, and mappings to empty strings
|
||||
D801>
|
||||
DFFE>
|
||||
DFFF>FFFF
|
||||
# ICU 63 normalization with UCPTrie requires inert surrogate code points.
|
||||
# D800>D7FF # surrogates with mappings, and mappings to empty strings
|
||||
# D801>
|
||||
# DFFE>
|
||||
# DFFF>FFFF
|
||||
E000>
|
||||
E001=61 338 # composition with trail<=33FF and composite>7FFF
|
||||
E002=E001 308 # recursive mapping needs reordering
|
||||
|
|
|
@ -266,6 +266,11 @@ void parseFile(std::ifstream &f, Normalizer2DataBuilder &builder) {
|
|||
fprintf(stderr, "gennorm2 error: parsing code point range from %s\n", line);
|
||||
exit(errorCode.reset());
|
||||
}
|
||||
if (endCP >= 0xd800 && startCP <= 0xdfff) {
|
||||
fprintf(stderr, "gennorm2 error: value or mapping for surrogate code points: %s\n",
|
||||
line);
|
||||
exit(U_ILLEGAL_ARGUMENT_ERROR);
|
||||
}
|
||||
delimiter=u_skipWhitespace(delimiter);
|
||||
if(*delimiter==':') {
|
||||
const char *s=u_skipWhitespace(delimiter+1);
|
||||
|
|
|
@ -29,7 +29,9 @@
|
|||
#include "unicode/errorcode.h"
|
||||
#include "unicode/localpointer.h"
|
||||
#include "unicode/putil.h"
|
||||
#include "unicode/ucptrie.h"
|
||||
#include "unicode/udata.h"
|
||||
#include "unicode/umutablecptrie.h"
|
||||
#include "unicode/uniset.h"
|
||||
#include "unicode/unistr.h"
|
||||
#include "unicode/usetiter.h"
|
||||
|
@ -41,7 +43,6 @@
|
|||
#include "norms.h"
|
||||
#include "toolutil.h"
|
||||
#include "unewdata.h"
|
||||
#include "utrie2.h"
|
||||
#include "uvectr32.h"
|
||||
#include "writesrc.h"
|
||||
|
||||
|
@ -58,8 +59,8 @@ static UDataInfo dataInfo={
|
|||
0,
|
||||
|
||||
{ 0x4e, 0x72, 0x6d, 0x32 }, /* dataFormat="Nrm2" */
|
||||
{ 3, 0, 0, 0 }, /* formatVersion */
|
||||
{ 10, 0, 0, 0 } /* dataVersion (Unicode version) */
|
||||
{ 4, 0, 0, 0 }, /* formatVersion */
|
||||
{ 11, 0, 0, 0 } /* dataVersion (Unicode version) */
|
||||
};
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
@ -94,14 +95,14 @@ const HangulIterator::Range HangulIterator::ranges[4]={
|
|||
Normalizer2DataBuilder::Normalizer2DataBuilder(UErrorCode &errorCode) :
|
||||
norms(errorCode),
|
||||
phase(0), overrideHandling(OVERRIDE_PREVIOUS), optimization(OPTIMIZE_NORMAL),
|
||||
norm16Trie(nullptr), norm16TrieLength(0) {
|
||||
norm16TrieBytes(nullptr), norm16TrieLength(0) {
|
||||
memset(unicodeVersion, 0, sizeof(unicodeVersion));
|
||||
memset(indexes, 0, sizeof(indexes));
|
||||
memset(smallFCD, 0, sizeof(smallFCD));
|
||||
}
|
||||
|
||||
Normalizer2DataBuilder::~Normalizer2DataBuilder() {
|
||||
utrie2_close(norm16Trie);
|
||||
delete[] norm16TrieBytes;
|
||||
}
|
||||
|
||||
void
|
||||
|
@ -407,11 +408,13 @@ void Normalizer2DataBuilder::postProcess(Norm &norm) {
|
|||
|
||||
class Norm16Writer : public Norms::Enumerator {
|
||||
public:
|
||||
Norm16Writer(Norms &n, Normalizer2DataBuilder &b) : Norms::Enumerator(n), builder(b) {}
|
||||
Norm16Writer(UMutableCPTrie *trie, Norms &n, Normalizer2DataBuilder &b) :
|
||||
Norms::Enumerator(n), builder(b), norm16Trie(trie) {}
|
||||
void rangeHandler(UChar32 start, UChar32 end, Norm &norm) U_OVERRIDE {
|
||||
builder.writeNorm16(start, end, norm);
|
||||
builder.writeNorm16(norm16Trie, start, end, norm);
|
||||
}
|
||||
Normalizer2DataBuilder &builder;
|
||||
UMutableCPTrie *norm16Trie;
|
||||
};
|
||||
|
||||
void Normalizer2DataBuilder::setSmallFCD(UChar32 c) {
|
||||
|
@ -419,7 +422,7 @@ void Normalizer2DataBuilder::setSmallFCD(UChar32 c) {
|
|||
smallFCD[lead>>8]|=(uint8_t)1<<((lead>>5)&7);
|
||||
}
|
||||
|
||||
void Normalizer2DataBuilder::writeNorm16(UChar32 start, UChar32 end, Norm &norm) {
|
||||
void Normalizer2DataBuilder::writeNorm16(UMutableCPTrie *norm16Trie, UChar32 start, UChar32 end, Norm &norm) {
|
||||
if((norm.leadCC|norm.trailCC)!=0) {
|
||||
for(UChar32 c=start; c<=end; ++c) {
|
||||
setSmallFCD(c);
|
||||
|
@ -484,7 +487,7 @@ void Normalizer2DataBuilder::writeNorm16(UChar32 start, UChar32 end, Norm &norm)
|
|||
norm16|=Normalizer2Impl::HAS_COMP_BOUNDARY_AFTER;
|
||||
}
|
||||
IcuToolErrorCode errorCode("gennorm2/writeNorm16()");
|
||||
utrie2_setRange32(norm16Trie, start, end, (uint32_t)norm16, TRUE, errorCode);
|
||||
umutablecptrie_setRange(norm16Trie, start, end, (uint32_t)norm16, errorCode);
|
||||
|
||||
// Set the minimum code points for real data lookups in the quick check loops.
|
||||
UBool isDecompNo=
|
||||
|
@ -502,13 +505,13 @@ void Normalizer2DataBuilder::writeNorm16(UChar32 start, UChar32 end, Norm &norm)
|
|||
}
|
||||
}
|
||||
|
||||
void Normalizer2DataBuilder::setHangulData() {
|
||||
void Normalizer2DataBuilder::setHangulData(UMutableCPTrie *norm16Trie) {
|
||||
HangulIterator hi;
|
||||
const HangulIterator::Range *range;
|
||||
// Check that none of the Hangul/Jamo code points have data.
|
||||
while((range=hi.nextRange())!=NULL) {
|
||||
for(UChar32 c=range->start; c<=range->end; ++c) {
|
||||
if(utrie2_get32(norm16Trie, c)>Normalizer2Impl::INERT) {
|
||||
if(umutablecptrie_get(norm16Trie, c)>Normalizer2Impl::INERT) {
|
||||
fprintf(stderr,
|
||||
"gennorm2 error: "
|
||||
"illegal mapping/composition/ccc data for Hangul or Jamo U+%04lX\n",
|
||||
|
@ -524,13 +527,13 @@ void Normalizer2DataBuilder::setHangulData() {
|
|||
if(Hangul::JAMO_V_BASE<indexes[Normalizer2Impl::IX_MIN_COMP_NO_MAYBE_CP]) {
|
||||
indexes[Normalizer2Impl::IX_MIN_COMP_NO_MAYBE_CP]=Hangul::JAMO_V_BASE;
|
||||
}
|
||||
utrie2_setRange32(norm16Trie, Hangul::JAMO_L_BASE, Hangul::JAMO_L_END,
|
||||
Normalizer2Impl::JAMO_L, TRUE, errorCode);
|
||||
utrie2_setRange32(norm16Trie, Hangul::JAMO_V_BASE, Hangul::JAMO_V_END,
|
||||
Normalizer2Impl::JAMO_VT, TRUE, errorCode);
|
||||
umutablecptrie_setRange(norm16Trie, Hangul::JAMO_L_BASE, Hangul::JAMO_L_END,
|
||||
Normalizer2Impl::JAMO_L, errorCode);
|
||||
umutablecptrie_setRange(norm16Trie, Hangul::JAMO_V_BASE, Hangul::JAMO_V_END,
|
||||
Normalizer2Impl::JAMO_VT, errorCode);
|
||||
// JAMO_T_BASE+1: not U+11A7
|
||||
utrie2_setRange32(norm16Trie, Hangul::JAMO_T_BASE+1, Hangul::JAMO_T_END,
|
||||
Normalizer2Impl::JAMO_VT, TRUE, errorCode);
|
||||
umutablecptrie_setRange(norm16Trie, Hangul::JAMO_T_BASE+1, Hangul::JAMO_T_END,
|
||||
Normalizer2Impl::JAMO_VT, errorCode);
|
||||
|
||||
// Hangul LV encoded as minYesNo
|
||||
uint32_t lv=indexes[Normalizer2Impl::IX_MIN_YES_NO];
|
||||
|
@ -542,49 +545,16 @@ void Normalizer2DataBuilder::setHangulData() {
|
|||
}
|
||||
// Set the first LV, then write all other Hangul syllables as LVT,
|
||||
// then overwrite the remaining LV.
|
||||
// The UTrie2 should be able to compact this into 7 32-item blocks
|
||||
// because JAMO_T_COUNT is 28 and the UTrie2 granularity is 4.
|
||||
// (7*32=8*28 smallest common multiple)
|
||||
utrie2_set32(norm16Trie, Hangul::HANGUL_BASE, lv, errorCode);
|
||||
utrie2_setRange32(norm16Trie, Hangul::HANGUL_BASE+1, Hangul::HANGUL_END,
|
||||
lvt, TRUE, errorCode);
|
||||
umutablecptrie_set(norm16Trie, Hangul::HANGUL_BASE, lv, errorCode);
|
||||
umutablecptrie_setRange(norm16Trie, Hangul::HANGUL_BASE+1, Hangul::HANGUL_END, lvt, errorCode);
|
||||
UChar32 c=Hangul::HANGUL_BASE;
|
||||
while((c+=Hangul::JAMO_T_COUNT)<=Hangul::HANGUL_END) {
|
||||
utrie2_set32(norm16Trie, c, lv, errorCode);
|
||||
umutablecptrie_set(norm16Trie, c, lv, errorCode);
|
||||
}
|
||||
errorCode.assertSuccess();
|
||||
}
|
||||
|
||||
namespace {
|
||||
|
||||
struct Norm16Summary {
|
||||
uint32_t maxNorm16;
|
||||
// ANDing values yields 0 bits where any value has a 0.
|
||||
// Used for worst-case HAS_COMP_BOUNDARY_AFTER.
|
||||
uint32_t andedNorm16;
|
||||
};
|
||||
|
||||
} // namespace
|
||||
|
||||
U_CDECL_BEGIN
|
||||
|
||||
static UBool U_CALLCONV
|
||||
enumRangeMaxValue(const void *context, UChar32 /*start*/, UChar32 /*end*/, uint32_t value) {
|
||||
Norm16Summary *p=(Norm16Summary *)context;
|
||||
if(value>p->maxNorm16) {
|
||||
p->maxNorm16=value;
|
||||
}
|
||||
p->andedNorm16&=value;
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
U_CDECL_END
|
||||
|
||||
void Normalizer2DataBuilder::processData() {
|
||||
IcuToolErrorCode errorCode("gennorm2/processData()");
|
||||
norm16Trie=utrie2_open(Normalizer2Impl::INERT, Normalizer2Impl::INERT, errorCode);
|
||||
errorCode.assertSuccess();
|
||||
|
||||
LocalUCPTriePointer Normalizer2DataBuilder::processData() {
|
||||
// Build composition lists before recursive decomposition,
|
||||
// so that we still have the raw, pair-wise mappings.
|
||||
CompositionBuilder compBuilder(norms);
|
||||
|
@ -652,13 +622,19 @@ void Normalizer2DataBuilder::processData() {
|
|||
indexes[Normalizer2Impl::IX_MIN_COMP_NO_MAYBE_CP]=0x110000;
|
||||
indexes[Normalizer2Impl::IX_MIN_LCCC_CP]=0x110000;
|
||||
|
||||
IcuToolErrorCode errorCode("gennorm2/processData()");
|
||||
UMutableCPTrie *norm16Trie = umutablecptrie_open(
|
||||
Normalizer2Impl::INERT, Normalizer2Impl::INERT, errorCode);
|
||||
errorCode.assertSuccess();
|
||||
|
||||
// Map each code point to its norm16 value,
|
||||
// including the properties that fit directly,
|
||||
// and the offset to the "extra data" if necessary.
|
||||
Norm16Writer norm16Writer(norms, *this);
|
||||
Norm16Writer norm16Writer(norm16Trie, norms, *this);
|
||||
norms.enumRanges(norm16Writer);
|
||||
// TODO: iterate via getRange() instead of callback?
|
||||
|
||||
setHangulData();
|
||||
setHangulData(norm16Trie);
|
||||
|
||||
// Look for the "worst" norm16 value of any supplementary code point
|
||||
// corresponding to a lead surrogate, and set it as that surrogate's value.
|
||||
|
@ -670,22 +646,63 @@ void Normalizer2DataBuilder::processData() {
|
|||
// and select the best value that only breaks the composition and/or decomposition
|
||||
// inner loops if necessary.
|
||||
// However, that seems like overkill for an optimization for supplementary characters.
|
||||
for(UChar lead=0xd800; lead<0xdc00; ++lead) {
|
||||
uint32_t surrogateCPNorm16=utrie2_get32(norm16Trie, lead);
|
||||
Norm16Summary summary={ surrogateCPNorm16, surrogateCPNorm16 };
|
||||
utrie2_enumForLeadSurrogate(norm16Trie, lead, NULL, enumRangeMaxValue, &summary);
|
||||
uint32_t norm16=summary.maxNorm16;
|
||||
if(norm16>=(uint32_t)indexes[Normalizer2Impl::IX_LIMIT_NO_NO] &&
|
||||
norm16>(uint32_t)indexes[Normalizer2Impl::IX_MIN_NO_NO]) {
|
||||
// Set noNo ("worst" value) if it got into "less-bad" maybeYes or ccc!=0.
|
||||
// Otherwise it might end up at something like JAMO_VT which stays in
|
||||
// the inner decomposition quick check loop.
|
||||
norm16=(uint32_t)indexes[Normalizer2Impl::IX_LIMIT_NO_NO]-1;
|
||||
//
|
||||
// First check that surrogate code *points* are inert.
|
||||
// The parser should have rejected values/mappings for them.
|
||||
uint32_t value;
|
||||
UChar32 end = umutablecptrie_getRange(norm16Trie, 0xd800, UCPTRIE_RANGE_NORMAL, 0,
|
||||
nullptr, nullptr, &value);
|
||||
if (value != Normalizer2Impl::INERT || end < 0xdfff) {
|
||||
fprintf(stderr,
|
||||
"gennorm2 error: not all surrogate code points are inert: U+d800..U+%04x=%lx\n",
|
||||
(int)end, (long)value);
|
||||
exit(U_INTERNAL_PROGRAM_ERROR);
|
||||
}
|
||||
uint32_t maxNorm16 = 0;
|
||||
// ANDing values yields 0 bits where any value has a 0.
|
||||
// Used for worst-case HAS_COMP_BOUNDARY_AFTER.
|
||||
uint32_t andedNorm16 = 0;
|
||||
end = 0;
|
||||
for (UChar32 start = 0x10000;;) {
|
||||
if (start > end) {
|
||||
end = umutablecptrie_getRange(norm16Trie, start, UCPTRIE_RANGE_NORMAL, 0,
|
||||
nullptr, nullptr, &value);
|
||||
if (end < 0) { break; }
|
||||
}
|
||||
if ((start & 0x3ff) == 0) {
|
||||
// Data for a new lead surrogate.
|
||||
maxNorm16 = andedNorm16 = value;
|
||||
} else {
|
||||
if (value > maxNorm16) {
|
||||
maxNorm16 = value;
|
||||
}
|
||||
andedNorm16 &= value;
|
||||
}
|
||||
// Intersect each range with the code points for one lead surrogate.
|
||||
UChar32 leadEnd = start | 0x3ff;
|
||||
if (leadEnd <= end) {
|
||||
// End of the supplementary block for a lead surrogate.
|
||||
if (maxNorm16 >= (uint32_t)indexes[Normalizer2Impl::IX_LIMIT_NO_NO]) {
|
||||
// Set noNo ("worst" value) if it got into "less-bad" maybeYes or ccc!=0.
|
||||
// Otherwise it might end up at something like JAMO_VT which stays in
|
||||
// the inner decomposition quick check loop.
|
||||
maxNorm16 = (uint32_t)indexes[Normalizer2Impl::IX_LIMIT_NO_NO];
|
||||
}
|
||||
maxNorm16 =
|
||||
(maxNorm16 & ~Normalizer2Impl::HAS_COMP_BOUNDARY_AFTER)|
|
||||
(andedNorm16 & Normalizer2Impl::HAS_COMP_BOUNDARY_AFTER);
|
||||
if (maxNorm16 != Normalizer2Impl::INERT) {
|
||||
umutablecptrie_set(norm16Trie, U16_LEAD(start), maxNorm16, errorCode);
|
||||
}
|
||||
if (value == Normalizer2Impl::INERT) {
|
||||
// Potentially skip inert supplementary blocks for several lead surrogates.
|
||||
start = (end + 1) & ~0x3ff;
|
||||
} else {
|
||||
start = leadEnd + 1;
|
||||
}
|
||||
} else {
|
||||
start = end + 1;
|
||||
}
|
||||
norm16=
|
||||
(norm16&~Normalizer2Impl::HAS_COMP_BOUNDARY_AFTER)|
|
||||
(summary.andedNorm16&Normalizer2Impl::HAS_COMP_BOUNDARY_AFTER);
|
||||
utrie2_set32ForLeadSurrogateCodeUnit(norm16Trie, lead, norm16, errorCode);
|
||||
}
|
||||
|
||||
// Adjust supplementary minimum code points to break quick check loops at their lead surrogates.
|
||||
|
@ -705,14 +722,19 @@ void Normalizer2DataBuilder::processData() {
|
|||
indexes[Normalizer2Impl::IX_MIN_LCCC_CP]=U16_LEAD(minCP);
|
||||
}
|
||||
|
||||
utrie2_freeze(norm16Trie, UTRIE2_16_VALUE_BITS, errorCode);
|
||||
norm16TrieLength=utrie2_serialize(norm16Trie, NULL, 0, errorCode);
|
||||
LocalUCPTriePointer builtTrie(
|
||||
umutablecptrie_buildImmutable(norm16Trie, UCPTRIE_TYPE_FAST, UCPTRIE_VALUE_BITS_16, errorCode));
|
||||
norm16TrieLength=ucptrie_toBinary(builtTrie.getAlias(), nullptr, 0, errorCode);
|
||||
if(errorCode.get()!=U_BUFFER_OVERFLOW_ERROR) {
|
||||
fprintf(stderr, "gennorm2 error: unable to freeze/serialize the normalization trie - %s\n",
|
||||
fprintf(stderr, "gennorm2 error: unable to build/serialize the normalization trie - %s\n",
|
||||
errorCode.errorName());
|
||||
exit(errorCode.reset());
|
||||
}
|
||||
umutablecptrie_close(norm16Trie);
|
||||
errorCode.reset();
|
||||
norm16TrieBytes=new uint8_t[norm16TrieLength];
|
||||
ucptrie_toBinary(builtTrie.getAlias(), norm16TrieBytes, norm16TrieLength, errorCode);
|
||||
errorCode.assertSuccess();
|
||||
|
||||
int32_t offset=(int32_t)sizeof(indexes);
|
||||
indexes[Normalizer2Impl::IX_NORM_TRIE_OFFSET]=offset;
|
||||
|
@ -750,16 +772,13 @@ void Normalizer2DataBuilder::processData() {
|
|||
u_versionFromString(unicodeVersion, U_UNICODE_VERSION);
|
||||
}
|
||||
memcpy(dataInfo.dataVersion, unicodeVersion, 4);
|
||||
return builtTrie;
|
||||
}
|
||||
|
||||
void Normalizer2DataBuilder::writeBinaryFile(const char *filename) {
|
||||
processData();
|
||||
|
||||
IcuToolErrorCode errorCode("gennorm2/writeBinaryFile()");
|
||||
LocalArray<uint8_t> norm16TrieBytes(new uint8_t[norm16TrieLength]);
|
||||
utrie2_serialize(norm16Trie, norm16TrieBytes.getAlias(), norm16TrieLength, errorCode);
|
||||
errorCode.assertSuccess();
|
||||
|
||||
UNewDataMemory *pData=
|
||||
udata_create(NULL, NULL, filename, &dataInfo,
|
||||
haveCopyright ? U_COPYRIGHT_STRING : NULL, errorCode);
|
||||
|
@ -769,7 +788,7 @@ void Normalizer2DataBuilder::writeBinaryFile(const char *filename) {
|
|||
exit(errorCode.reset());
|
||||
}
|
||||
udata_writeBlock(pData, indexes, sizeof(indexes));
|
||||
udata_writeBlock(pData, norm16TrieBytes.getAlias(), norm16TrieLength);
|
||||
udata_writeBlock(pData, norm16TrieBytes, norm16TrieLength);
|
||||
udata_writeUString(pData, toUCharPtr(extraData.getBuffer()), extraData.length());
|
||||
udata_writeBlock(pData, smallFCD, sizeof(smallFCD));
|
||||
int32_t writtenSize=udata_finish(pData, errorCode);
|
||||
|
@ -787,7 +806,7 @@ void Normalizer2DataBuilder::writeBinaryFile(const char *filename) {
|
|||
|
||||
void
|
||||
Normalizer2DataBuilder::writeCSourceFile(const char *filename) {
|
||||
processData();
|
||||
LocalUCPTriePointer norm16Trie = processData();
|
||||
|
||||
IcuToolErrorCode errorCode("gennorm2/writeCSourceFile()");
|
||||
const char *basename=findBasename(filename);
|
||||
|
@ -797,10 +816,7 @@ Normalizer2DataBuilder::writeCSourceFile(const char *filename) {
|
|||
if(extension!=NULL) {
|
||||
dataName.truncate((int32_t)(extension-basename));
|
||||
}
|
||||
errorCode.assertSuccess();
|
||||
|
||||
LocalArray<uint8_t> norm16TrieBytes(new uint8_t[norm16TrieLength]);
|
||||
utrie2_serialize(norm16Trie, norm16TrieBytes.getAlias(), norm16TrieLength, errorCode);
|
||||
const char *name=dataName.data();
|
||||
errorCode.assertSuccess();
|
||||
|
||||
FILE *f=usrc_create(path.data(), basename, "icu/source/tools/gennorm2/n2builder.cpp");
|
||||
|
@ -808,43 +824,31 @@ Normalizer2DataBuilder::writeCSourceFile(const char *filename) {
|
|||
fprintf(stderr, "gennorm2/writeCSourceFile() error: unable to create the output file %s\n",
|
||||
filename);
|
||||
exit(U_FILE_ACCESS_ERROR);
|
||||
return;
|
||||
}
|
||||
fputs("#ifdef INCLUDED_FROM_NORMALIZER2_CPP\n\n", f);
|
||||
char line[100];
|
||||
sprintf(line, "static const UVersionInfo %s_formatVersion={", dataName.data());
|
||||
|
||||
char line[100], line2[100], line3[100];
|
||||
sprintf(line, "static const UVersionInfo %s_formatVersion={", name);
|
||||
usrc_writeArray(f, line, dataInfo.formatVersion, 8, 4, "};\n");
|
||||
sprintf(line, "static const UVersionInfo %s_dataVersion={", dataName.data());
|
||||
sprintf(line, "static const UVersionInfo %s_dataVersion={", name);
|
||||
usrc_writeArray(f, line, dataInfo.dataVersion, 8, 4, "};\n\n");
|
||||
sprintf(line, "static const int32_t %s_indexes[Normalizer2Impl::IX_COUNT]={\n",
|
||||
dataName.data());
|
||||
usrc_writeArray(f,
|
||||
line,
|
||||
indexes, 32, Normalizer2Impl::IX_COUNT,
|
||||
"\n};\n\n");
|
||||
sprintf(line, "static const uint16_t %s_trieIndex[%%ld]={\n", dataName.data());
|
||||
usrc_writeUTrie2Arrays(f,
|
||||
line, NULL,
|
||||
norm16Trie,
|
||||
"\n};\n\n");
|
||||
sprintf(line, "static const uint16_t %s_extraData[%%ld]={\n", dataName.data());
|
||||
usrc_writeArray(f,
|
||||
line,
|
||||
extraData.getBuffer(), 16, extraData.length(),
|
||||
"\n};\n\n");
|
||||
sprintf(line, "static const uint8_t %s_smallFCD[%%ld]={\n", dataName.data());
|
||||
usrc_writeArray(f,
|
||||
line,
|
||||
smallFCD, 8, sizeof(smallFCD),
|
||||
"\n};\n\n");
|
||||
sprintf(line, "static const UTrie2 %s_trie={\n", dataName.data());
|
||||
char line2[100];
|
||||
sprintf(line2, "%s_trieIndex", dataName.data());
|
||||
usrc_writeUTrie2Struct(f,
|
||||
line,
|
||||
norm16Trie, line2, NULL,
|
||||
"};\n");
|
||||
fputs("\n#endif // INCLUDED_FROM_NORMALIZER2_CPP\n", f);
|
||||
sprintf(line, "static const int32_t %s_indexes[Normalizer2Impl::IX_COUNT]={\n", name);
|
||||
usrc_writeArray(f, line, indexes, 32, Normalizer2Impl::IX_COUNT, "\n};\n\n");
|
||||
|
||||
sprintf(line, "static const uint16_t %s_trieIndex[%%ld]={\n", name);
|
||||
sprintf(line2, "static const uint16_t %s_trieData[%%ld]={\n", name);
|
||||
usrc_writeUCPTrieArrays(f, line, line2, norm16Trie.getAlias(), "\n};\n\n");
|
||||
sprintf(line, "static const UCPTrie %s_trie={\n", name);
|
||||
sprintf(line2, "%s_trieIndex", name);
|
||||
sprintf(line3, "%s_trieData", name);
|
||||
usrc_writeUCPTrieStruct(f, line, norm16Trie.getAlias(), line2, line3, "};\n\n");
|
||||
|
||||
sprintf(line, "static const uint16_t %s_extraData[%%ld]={\n", name);
|
||||
usrc_writeArray(f, line, extraData.getBuffer(), 16, extraData.length(), "\n};\n\n");
|
||||
sprintf(line, "static const uint8_t %s_smallFCD[%%ld]={\n", name);
|
||||
usrc_writeArray(f, line, smallFCD, 8, sizeof(smallFCD), "\n};\n\n");
|
||||
|
||||
fputs("#endif // INCLUDED_FROM_NORMALIZER2_CPP\n", f);
|
||||
fclose(f);
|
||||
}
|
||||
|
||||
|
|
|
@ -24,10 +24,10 @@
|
|||
#if !UCONFIG_NO_NORMALIZATION
|
||||
|
||||
#include "unicode/errorcode.h"
|
||||
#include "unicode/umutablecptrie.h"
|
||||
#include "unicode/unistr.h"
|
||||
#include "normalizer2impl.h" // for IX_COUNT
|
||||
#include "toolutil.h"
|
||||
#include "utrie2.h"
|
||||
#include "norms.h"
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
@ -95,9 +95,9 @@ private:
|
|||
return indexes[Normalizer2Impl::IX_MIN_MAYBE_YES]-
|
||||
((2*Normalizer2Impl::MAX_DELTA+1)<<Normalizer2Impl::DELTA_SHIFT);
|
||||
}
|
||||
void writeNorm16(UChar32 start, UChar32 end, Norm &norm);
|
||||
void setHangulData();
|
||||
void processData();
|
||||
void writeNorm16(UMutableCPTrie *norm16Trie, UChar32 start, UChar32 end, Norm &norm);
|
||||
void setHangulData(UMutableCPTrie *norm16Trie);
|
||||
LocalUCPTriePointer processData();
|
||||
|
||||
Norms norms;
|
||||
|
||||
|
@ -107,7 +107,7 @@ private:
|
|||
Optimization optimization;
|
||||
|
||||
int32_t indexes[Normalizer2Impl::IX_COUNT];
|
||||
UTrie2 *norm16Trie;
|
||||
uint8_t *norm16TrieBytes;
|
||||
int32_t norm16TrieLength;
|
||||
UnicodeString extraData;
|
||||
uint8_t smallFCD[0x100];
|
||||
|
|
|
@ -12,12 +12,12 @@
|
|||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include "unicode/errorcode.h"
|
||||
#include "unicode/umutablecptrie.h"
|
||||
#include "unicode/unistr.h"
|
||||
#include "unicode/utf16.h"
|
||||
#include "normalizer2impl.h"
|
||||
#include "norms.h"
|
||||
#include "toolutil.h"
|
||||
#include "utrie2.h"
|
||||
#include "uvectr32.h"
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
@ -67,7 +67,7 @@ UChar32 Norm::combine(UChar32 trail) const {
|
|||
}
|
||||
|
||||
Norms::Norms(UErrorCode &errorCode) {
|
||||
normTrie=utrie2_open(0, 0, &errorCode);
|
||||
normTrie = umutablecptrie_open(0, 0, &errorCode);
|
||||
normMem=utm_open("gennorm2 normalization structs", 10000, 0x110100, sizeof(Norm));
|
||||
// Default "inert" Norm struct at index 0. Practically immutable.
|
||||
norms=allocNorm();
|
||||
|
@ -75,7 +75,7 @@ Norms::Norms(UErrorCode &errorCode) {
|
|||
}
|
||||
|
||||
Norms::~Norms() {
|
||||
utrie2_close(normTrie);
|
||||
umutablecptrie_close(normTrie);
|
||||
int32_t normsLength=utm_countItems(normMem);
|
||||
for(int32_t i=1; i<normsLength; ++i) {
|
||||
delete norms[i].mapping;
|
||||
|
@ -92,7 +92,7 @@ Norm *Norms::allocNorm() {
|
|||
}
|
||||
|
||||
Norm *Norms::getNorm(UChar32 c) {
|
||||
uint32_t i=utrie2_get32(normTrie, c);
|
||||
uint32_t i = umutablecptrie_get(normTrie, c);
|
||||
if(i==0) {
|
||||
return nullptr;
|
||||
}
|
||||
|
@ -100,7 +100,7 @@ Norm *Norms::getNorm(UChar32 c) {
|
|||
}
|
||||
|
||||
const Norm *Norms::getNorm(UChar32 c) const {
|
||||
uint32_t i=utrie2_get32(normTrie, c);
|
||||
uint32_t i = umutablecptrie_get(normTrie, c);
|
||||
if(i==0) {
|
||||
return nullptr;
|
||||
}
|
||||
|
@ -108,18 +108,18 @@ const Norm *Norms::getNorm(UChar32 c) const {
|
|||
}
|
||||
|
||||
const Norm &Norms::getNormRef(UChar32 c) const {
|
||||
return norms[utrie2_get32(normTrie, c)];
|
||||
return norms[umutablecptrie_get(normTrie, c)];
|
||||
}
|
||||
|
||||
Norm *Norms::createNorm(UChar32 c) {
|
||||
uint32_t i=utrie2_get32(normTrie, c);
|
||||
uint32_t i=umutablecptrie_get(normTrie, c);
|
||||
if(i!=0) {
|
||||
return norms+i;
|
||||
} else {
|
||||
/* allocate Norm */
|
||||
Norm *p=allocNorm();
|
||||
IcuToolErrorCode errorCode("gennorm2/createNorm()");
|
||||
utrie2_set32(normTrie, c, (uint32_t)(p-norms), errorCode);
|
||||
umutablecptrie_set(normTrie, c, (uint32_t)(p - norms), errorCode);
|
||||
return p;
|
||||
}
|
||||
}
|
||||
|
@ -153,28 +153,20 @@ UBool Norms::combinesWithCCBetween(const Norm &norm, uint8_t lowCC, int32_t high
|
|||
return FALSE;
|
||||
}
|
||||
|
||||
U_CDECL_BEGIN
|
||||
|
||||
static UBool U_CALLCONV
|
||||
enumRangeHandler(const void *context, UChar32 start, UChar32 end, uint32_t value) {
|
||||
return ((Norms::Enumerator *)context)->rangeHandler(start, end, value);
|
||||
}
|
||||
|
||||
U_CDECL_END
|
||||
|
||||
void Norms::enumRanges(Enumerator &e) {
|
||||
utrie2_enum(normTrie, nullptr, enumRangeHandler, &e);
|
||||
UChar32 start = 0, end;
|
||||
uint32_t i;
|
||||
while ((end = umutablecptrie_getRange(normTrie, start, UCPTRIE_RANGE_NORMAL, 0,
|
||||
nullptr, nullptr, &i)) >= 0) {
|
||||
if (i > 0) {
|
||||
e.rangeHandler(start, end, norms[i]);
|
||||
}
|
||||
start = end + 1;
|
||||
}
|
||||
}
|
||||
|
||||
Norms::Enumerator::~Enumerator() {}
|
||||
|
||||
UBool Norms::Enumerator::rangeHandler(UChar32 start, UChar32 end, uint32_t value) {
|
||||
if(value!=0) {
|
||||
rangeHandler(start, end, norms.getNormRefByIndex(value));
|
||||
}
|
||||
return TRUE;
|
||||
}
|
||||
|
||||
void CompositionBuilder::rangeHandler(UChar32 start, UChar32 end, Norm &norm) {
|
||||
if(norm.mappingType!=Norm::ROUND_TRIP) { return; }
|
||||
if(start!=end) {
|
||||
|
|
|
@ -15,12 +15,12 @@
|
|||
#if !UCONFIG_NO_NORMALIZATION
|
||||
|
||||
#include "unicode/errorcode.h"
|
||||
#include "unicode/umutablecptrie.h"
|
||||
#include "unicode/uniset.h"
|
||||
#include "unicode/unistr.h"
|
||||
#include "unicode/utf16.h"
|
||||
#include "normalizer2impl.h"
|
||||
#include "toolutil.h"
|
||||
#include "utrie2.h"
|
||||
#include "uvectr32.h"
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
@ -176,8 +176,6 @@ public:
|
|||
virtual ~Enumerator();
|
||||
/** Called for enumerated value!=0. */
|
||||
virtual void rangeHandler(UChar32 start, UChar32 end, Norm &norm) = 0;
|
||||
/** @internal Public only for C callback. */
|
||||
UBool rangeHandler(UChar32 start, UChar32 end, uint32_t value);
|
||||
protected:
|
||||
Norms &norms;
|
||||
};
|
||||
|
@ -190,7 +188,7 @@ private:
|
|||
Norms(const Norms &other) = delete;
|
||||
Norms &operator=(const Norms &other) = delete;
|
||||
|
||||
UTrie2 *normTrie;
|
||||
UMutableCPTrie *normTrie;
|
||||
UToolMemory *normMem;
|
||||
Norm *norms;
|
||||
};
|
||||
|
|
|
@ -1018,6 +1018,11 @@ addCollation(ParseState* state, TableResource *result, const char *collationTyp
|
|||
icu::CollationInfo::printReorderRanges(
|
||||
*t->data, t->settings->reorderCodes, t->settings->reorderCodesLength);
|
||||
}
|
||||
#if 0 // debugging output
|
||||
} else {
|
||||
printf("%s~%s collation tailoring part sizes:\n", state->filename, collationType);
|
||||
icu::CollationInfo::printSizes(totalSize, indexes);
|
||||
#endif
|
||||
}
|
||||
struct SResource *collationBin = bin_open(state->bundle, "%%CollationBin", totalSize, dest, NULL, NULL, status);
|
||||
result->add(collationBin, line, *status);
|
||||
|
|
|
@ -243,7 +243,7 @@ uprops_swap(const UDataSwapper *ds,
|
|||
* swap the main properties UTrie
|
||||
* PT serialized properties trie, see utrie.h (byte size: 4*(i0-16))
|
||||
*/
|
||||
utrie2_swapAnyVersion(ds,
|
||||
utrie_swapAnyVersion(ds,
|
||||
inData32+UPROPS_INDEX_COUNT,
|
||||
4*(dataIndexes[UPROPS_PROPS32_INDEX]-UPROPS_INDEX_COUNT),
|
||||
outData32+UPROPS_INDEX_COUNT,
|
||||
|
@ -274,7 +274,7 @@ uprops_swap(const UDataSwapper *ds,
|
|||
* swap the additional UTrie
|
||||
* i3 additionalTrieIndex; -- 32-bit unit index to the additional trie for more properties
|
||||
*/
|
||||
utrie2_swapAnyVersion(ds,
|
||||
utrie_swapAnyVersion(ds,
|
||||
inData32+dataIndexes[UPROPS_ADDITIONAL_TRIE_INDEX],
|
||||
4*(dataIndexes[UPROPS_ADDITIONAL_VECTORS_INDEX]-dataIndexes[UPROPS_ADDITIONAL_TRIE_INDEX]),
|
||||
outData32+dataIndexes[UPROPS_ADDITIONAL_TRIE_INDEX],
|
||||
|
@ -391,7 +391,7 @@ ucase_swap(const UDataSwapper *ds,
|
|||
|
||||
/* swap the UTrie */
|
||||
count=indexes[UCASE_IX_TRIE_SIZE];
|
||||
utrie2_swapAnyVersion(ds, inBytes+offset, count, outBytes+offset, pErrorCode);
|
||||
utrie_swapAnyVersion(ds, inBytes+offset, count, outBytes+offset, pErrorCode);
|
||||
offset+=count;
|
||||
|
||||
/* swap the uint16_t exceptions[] and unfold[] */
|
||||
|
@ -493,7 +493,7 @@ ubidi_swap(const UDataSwapper *ds,
|
|||
|
||||
/* swap the UTrie */
|
||||
count=indexes[UBIDI_IX_TRIE_SIZE];
|
||||
utrie2_swapAnyVersion(ds, inBytes+offset, count, outBytes+offset, pErrorCode);
|
||||
utrie_swapAnyVersion(ds, inBytes+offset, count, outBytes+offset, pErrorCode);
|
||||
offset+=count;
|
||||
|
||||
/* swap the uint32_t mirrors[] */
|
||||
|
|
|
@ -22,6 +22,7 @@
|
|||
#include <time.h>
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/putil.h"
|
||||
#include "unicode/ucptrie.h"
|
||||
#include "utrie2.h"
|
||||
#include "cstring.h"
|
||||
#include "writesrc.h"
|
||||
|
@ -228,6 +229,52 @@ usrc_writeUTrie2Struct(FILE *f,
|
|||
}
|
||||
}
|
||||
|
||||
U_CAPI void U_EXPORT2
|
||||
usrc_writeUCPTrieArrays(FILE *f,
|
||||
const char *indexPrefix, const char *dataPrefix,
|
||||
const UCPTrie *pTrie,
|
||||
const char *postfix) {
|
||||
usrc_writeArray(f, indexPrefix, pTrie->index, 16, pTrie->indexLength, postfix);
|
||||
int32_t width=
|
||||
pTrie->valueWidth==UCPTRIE_VALUE_BITS_16 ? 16 :
|
||||
pTrie->valueWidth==UCPTRIE_VALUE_BITS_32 ? 32 :
|
||||
pTrie->valueWidth==UCPTRIE_VALUE_BITS_8 ? 8 : 0;
|
||||
usrc_writeArray(f, dataPrefix, pTrie->data.ptr0, width, pTrie->dataLength, postfix);
|
||||
}
|
||||
|
||||
U_CAPI void U_EXPORT2
|
||||
usrc_writeUCPTrieStruct(FILE *f,
|
||||
const char *prefix,
|
||||
const UCPTrie *pTrie,
|
||||
const char *indexName, const char *dataName,
|
||||
const char *postfix) {
|
||||
if(prefix!=NULL) {
|
||||
fputs(prefix, f);
|
||||
}
|
||||
fprintf(
|
||||
f,
|
||||
" %s,\n" // index
|
||||
" { %s },\n", // data (union)
|
||||
indexName,
|
||||
dataName);
|
||||
fprintf(
|
||||
f,
|
||||
" %ld, %ld,\n" // indexLength, dataLength
|
||||
" 0x%lx, 0x%x,\n" // highStart, shifted12HighStart
|
||||
" %d, %d,\n" // type, valueWidth
|
||||
" 0, 0,\n" // reserved32, reserved16
|
||||
" 0x%x, 0x%lx,\n" // index3NullOffset, dataNullOffset
|
||||
" 0x%lx,\n", // nullValue
|
||||
(long)pTrie->indexLength, (long)pTrie->dataLength,
|
||||
(long)pTrie->highStart, pTrie->shifted12HighStart,
|
||||
pTrie->type, pTrie->valueWidth,
|
||||
pTrie->index3NullOffset, (long)pTrie->dataNullOffset,
|
||||
(long)pTrie->nullValue);
|
||||
if(postfix!=NULL) {
|
||||
fputs(postfix, f);
|
||||
}
|
||||
}
|
||||
|
||||
U_CAPI void U_EXPORT2
|
||||
usrc_writeArrayOfMostlyInvChars(FILE *f,
|
||||
const char *prefix,
|
||||
|
|
|
@ -23,6 +23,7 @@
|
|||
|
||||
#include <stdio.h>
|
||||
#include "unicode/utypes.h"
|
||||
#include "unicode/ucptrie.h"
|
||||
#include "utrie2.h"
|
||||
|
||||
/**
|
||||
|
@ -75,6 +76,27 @@ usrc_writeUTrie2Struct(FILE *f,
|
|||
const char *indexName, const char *dataName,
|
||||
const char *postfix);
|
||||
|
||||
/**
|
||||
* Calls usrc_writeArray() for the index and data arrays of a UCPTrie.
|
||||
*/
|
||||
U_CAPI void U_EXPORT2
|
||||
usrc_writeUCPTrieArrays(FILE *f,
|
||||
const char *indexPrefix, const char *dataPrefix,
|
||||
const UCPTrie *pTrie,
|
||||
const char *postfix);
|
||||
|
||||
/**
|
||||
* Writes the UCPTrie struct values.
|
||||
* The {} and declaration etc. need to be included in prefix/postfix or
|
||||
* printed before and after the array contents.
|
||||
*/
|
||||
U_CAPI void U_EXPORT2
|
||||
usrc_writeUCPTrieStruct(FILE *f,
|
||||
const char *prefix,
|
||||
const UCPTrie *pTrie,
|
||||
const char *indexName, const char *dataName,
|
||||
const char *postfix);
|
||||
|
||||
/**
|
||||
* Writes the contents of an array of mostly invariant characters.
|
||||
* Characters 0..0x1f are printed as numbers,
|
||||
|
|
|
@ -652,6 +652,15 @@ public final class ICUBinary {
|
|||
}
|
||||
}
|
||||
|
||||
public static byte[] getBytes(ByteBuffer bytes, int length, int additionalSkipLength) {
|
||||
byte[] dest = new byte[length];
|
||||
bytes.get(dest);
|
||||
if (additionalSkipLength > 0) {
|
||||
skipBytes(bytes, additionalSkipLength);
|
||||
}
|
||||
return dest;
|
||||
}
|
||||
|
||||
public static String getString(ByteBuffer bytes, int length, int additionalSkipLength) {
|
||||
CharSequence cs = bytes.asCharBuffer();
|
||||
String s = cs.subSequence(0, length).toString();
|
||||
|
|
|
@ -12,11 +12,13 @@ package com.ibm.icu.impl;
|
|||
import java.io.IOException;
|
||||
import java.nio.ByteBuffer;
|
||||
import java.util.ArrayList;
|
||||
import java.util.Iterator;
|
||||
|
||||
import com.ibm.icu.text.UTF16;
|
||||
import com.ibm.icu.text.UnicodeSet;
|
||||
import com.ibm.icu.util.CodePointMap;
|
||||
import com.ibm.icu.util.CodePointTrie;
|
||||
import com.ibm.icu.util.ICUUncheckedIOException;
|
||||
import com.ibm.icu.util.MutableCodePointTrie;
|
||||
import com.ibm.icu.util.VersionInfo;
|
||||
|
||||
/**
|
||||
|
@ -180,8 +182,7 @@ public final class Normalizer2Impl {
|
|||
insert(c, cc);
|
||||
}
|
||||
}
|
||||
// s must be in NFD, otherwise change the implementation.
|
||||
public void append(CharSequence s, int start, int limit,
|
||||
public void append(CharSequence s, int start, int limit, boolean isNFD,
|
||||
int leadCC, int trailCC) {
|
||||
if(start==limit) {
|
||||
return;
|
||||
|
@ -202,8 +203,11 @@ public final class Normalizer2Impl {
|
|||
c=Character.codePointAt(s, start);
|
||||
start+=Character.charCount(c);
|
||||
if(start<limit) {
|
||||
// s must be in NFD, otherwise we need to use getCC().
|
||||
leadCC=getCCFromYesOrMaybe(impl.getNorm16(c));
|
||||
if (isNFD) {
|
||||
leadCC = getCCFromYesOrMaybe(impl.getNorm16(c));
|
||||
} else {
|
||||
leadCC = impl.getCC(impl.getNorm16(c));
|
||||
}
|
||||
} else {
|
||||
leadCC=trailCC;
|
||||
}
|
||||
|
@ -359,6 +363,24 @@ public final class Normalizer2Impl {
|
|||
// TODO: Propose widening UTF16 methods that take char to take int.
|
||||
// TODO: Propose widening UTF16 methods that take String to take CharSequence.
|
||||
public static final class UTF16Plus {
|
||||
/**
|
||||
* Is this code point a lead surrogate (U+d800..U+dbff)?
|
||||
* @param c code unit or code point
|
||||
* @return true or false
|
||||
*/
|
||||
public static boolean isLeadSurrogate(int c) { return (c & 0xfffffc00) == 0xd800; }
|
||||
/**
|
||||
* Is this code point a trail surrogate (U+dc00..U+dfff)?
|
||||
* @param c code unit or code point
|
||||
* @return true or false
|
||||
*/
|
||||
public static boolean isTrailSurrogate(int c) { return (c & 0xfffffc00) == 0xdc00; }
|
||||
/**
|
||||
* Is this code point a surrogate (U+d800..U+dfff)?
|
||||
* @param c code unit or code point
|
||||
* @return true or false
|
||||
*/
|
||||
public static boolean isSurrogate(int c) { return (c & 0xfffff800) == 0xd800; }
|
||||
/**
|
||||
* Assuming c is a surrogate code point (UTF16.isSurrogate(c)),
|
||||
* is it a lead surrogate?
|
||||
|
@ -420,7 +442,7 @@ public final class Normalizer2Impl {
|
|||
private static final class IsAcceptable implements ICUBinary.Authenticate {
|
||||
@Override
|
||||
public boolean isDataVersionAcceptable(byte version[]) {
|
||||
return version[0]==3;
|
||||
return version[0]==4;
|
||||
}
|
||||
}
|
||||
private static final IsAcceptable IS_ACCEPTABLE = new IsAcceptable();
|
||||
|
@ -457,8 +479,9 @@ public final class Normalizer2Impl {
|
|||
// Read the normTrie.
|
||||
int offset=inIndexes[IX_NORM_TRIE_OFFSET];
|
||||
int nextOffset=inIndexes[IX_EXTRA_DATA_OFFSET];
|
||||
normTrie=Trie2_16.createFromSerialized(bytes);
|
||||
int trieLength=normTrie.getSerializedLength();
|
||||
int triePosition = bytes.position();
|
||||
normTrie = CodePointTrie.Fast16.fromBinary(bytes);
|
||||
int trieLength = bytes.position() - triePosition;
|
||||
if(trieLength>(nextOffset-offset)) {
|
||||
throw new ICUUncheckedIOException("Normalizer2 data: not enough bytes for normTrie");
|
||||
}
|
||||
|
@ -487,46 +510,46 @@ public final class Normalizer2Impl {
|
|||
return load(ICUBinary.getRequiredData(name));
|
||||
}
|
||||
|
||||
private void enumLcccRange(int start, int end, int norm16, UnicodeSet set) {
|
||||
if (norm16 > MIN_NORMAL_MAYBE_YES && norm16 != JAMO_VT) {
|
||||
set.add(start, end);
|
||||
} else if (minNoNoCompNoMaybeCC <= norm16 && norm16 < limitNoNo) {
|
||||
int fcd16=getFCD16(start);
|
||||
if(fcd16>0xff) { set.add(start, end); }
|
||||
}
|
||||
}
|
||||
|
||||
private void enumNorm16PropertyStartsRange(int start, int end, int value, UnicodeSet set) {
|
||||
/* add the start code point to the USet */
|
||||
set.add(start);
|
||||
if(start!=end && isAlgorithmicNoNo(value) && (value & DELTA_TCCC_MASK) > DELTA_TCCC_1) {
|
||||
// Range of code points with same-norm16-value algorithmic decompositions.
|
||||
// They might have different non-zero FCD16 values.
|
||||
int prevFCD16=getFCD16(start);
|
||||
while(++start<=end) {
|
||||
int fcd16=getFCD16(start);
|
||||
if(fcd16!=prevFCD16) {
|
||||
set.add(start);
|
||||
prevFCD16=fcd16;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public void addLcccChars(UnicodeSet set) {
|
||||
Iterator<Trie2.Range> trieIterator=normTrie.iterator();
|
||||
Trie2.Range range;
|
||||
while(trieIterator.hasNext() && !(range=trieIterator.next()).leadSurrogate) {
|
||||
enumLcccRange(range.startCodePoint, range.endCodePoint, range.value, set);
|
||||
int start = 0;
|
||||
CodePointMap.Range range = new CodePointMap.Range();
|
||||
while (normTrie.getRange(start, CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, INERT,
|
||||
null, range)) {
|
||||
int end = range.getEnd();
|
||||
int norm16 = range.getValue();
|
||||
if (norm16 > MIN_NORMAL_MAYBE_YES && norm16 != JAMO_VT) {
|
||||
set.add(start, end);
|
||||
} else if (minNoNoCompNoMaybeCC <= norm16 && norm16 < limitNoNo) {
|
||||
int fcd16 = getFCD16(start);
|
||||
if (fcd16 > 0xff) { set.add(start, end); }
|
||||
}
|
||||
start = end + 1;
|
||||
}
|
||||
}
|
||||
|
||||
public void addPropertyStarts(UnicodeSet set) {
|
||||
/* add the start code point of each same-value range of each trie */
|
||||
Iterator<Trie2.Range> trieIterator=normTrie.iterator();
|
||||
Trie2.Range range;
|
||||
while(trieIterator.hasNext() && !(range=trieIterator.next()).leadSurrogate) {
|
||||
enumNorm16PropertyStartsRange(range.startCodePoint, range.endCodePoint, range.value, set);
|
||||
// Add the start code point of each same-value range of the trie.
|
||||
int start = 0;
|
||||
CodePointMap.Range range = new CodePointMap.Range();
|
||||
while (normTrie.getRange(start, CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, INERT,
|
||||
null, range)) {
|
||||
int end = range.getEnd();
|
||||
int value = range.getValue();
|
||||
set.add(start);
|
||||
if (start != end && isAlgorithmicNoNo(value) &&
|
||||
(value & DELTA_TCCC_MASK) > DELTA_TCCC_1) {
|
||||
// Range of code points with same-norm16-value algorithmic decompositions.
|
||||
// They might have different non-zero FCD16 values.
|
||||
int prevFCD16 = getFCD16(start);
|
||||
while (++start <= end) {
|
||||
int fcd16 = getFCD16(start);
|
||||
if (fcd16 != prevFCD16) {
|
||||
set.add(start);
|
||||
prevFCD16 = fcd16;
|
||||
}
|
||||
}
|
||||
}
|
||||
start = end + 1;
|
||||
}
|
||||
|
||||
/* add Hangul LV syllables and LV+1 because of skippables */
|
||||
|
@ -538,20 +561,21 @@ public final class Normalizer2Impl {
|
|||
}
|
||||
|
||||
public void addCanonIterPropertyStarts(UnicodeSet set) {
|
||||
/* add the start code point of each same-value range of the canonical iterator data trie */
|
||||
// Add the start code point of each same-value range of the canonical iterator data trie.
|
||||
ensureCanonIterData();
|
||||
// currently only used for the SEGMENT_STARTER property
|
||||
Iterator<Trie2.Range> trieIterator=canonIterData.iterator(segmentStarterMapper);
|
||||
Trie2.Range range;
|
||||
while(trieIterator.hasNext() && !(range=trieIterator.next()).leadSurrogate) {
|
||||
/* add the start code point to the USet */
|
||||
set.add(range.startCodePoint);
|
||||
// Currently only used for the SEGMENT_STARTER property.
|
||||
int start = 0;
|
||||
CodePointMap.Range range = new CodePointMap.Range();
|
||||
while (canonIterData.getRange(start, segmentStarterMapper, range)) {
|
||||
set.add(start);
|
||||
start = range.getEnd() + 1;
|
||||
}
|
||||
}
|
||||
private static final Trie2.ValueMapper segmentStarterMapper=new Trie2.ValueMapper() {
|
||||
private static final CodePointMap.ValueFilter segmentStarterMapper =
|
||||
new CodePointMap.ValueFilter() {
|
||||
@Override
|
||||
public int map(int in) {
|
||||
return in&CANON_NOT_SEGMENT_STARTER;
|
||||
public int apply(int value) {
|
||||
return value & CANON_NOT_SEGMENT_STARTER;
|
||||
}
|
||||
};
|
||||
|
||||
|
@ -574,12 +598,14 @@ public final class Normalizer2Impl {
|
|||
*/
|
||||
public synchronized Normalizer2Impl ensureCanonIterData() {
|
||||
if(canonIterData==null) {
|
||||
Trie2Writable newData=new Trie2Writable(0, 0);
|
||||
MutableCodePointTrie mutableTrie = new MutableCodePointTrie(0, 0);
|
||||
canonStartSets=new ArrayList<UnicodeSet>();
|
||||
Iterator<Trie2.Range> trieIterator=normTrie.iterator();
|
||||
Trie2.Range range;
|
||||
while(trieIterator.hasNext() && !(range=trieIterator.next()).leadSurrogate) {
|
||||
final int norm16=range.value;
|
||||
int start = 0;
|
||||
CodePointMap.Range range = new CodePointMap.Range();
|
||||
while (normTrie.getRange(start, CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, INERT,
|
||||
null, range)) {
|
||||
final int end = range.getEnd();
|
||||
final int norm16 = range.getValue();
|
||||
if(isInert(norm16) || (minYesNo<=norm16 && norm16<minNoNo)) {
|
||||
// Inert, or 2-way mapping (including Hangul syllable).
|
||||
// We do not write a canonStartSet for any yesNo character.
|
||||
|
@ -587,10 +613,11 @@ public final class Normalizer2Impl {
|
|||
// starter's compositions list, and the other characters in
|
||||
// 2-way mappings get CANON_NOT_SEGMENT_STARTER set because they are
|
||||
// "maybe" characters.
|
||||
start = end + 1;
|
||||
continue;
|
||||
}
|
||||
for(int c=range.startCodePoint; c<=range.endCodePoint; ++c) {
|
||||
final int oldValue=newData.get(c);
|
||||
for (int c = start; c <= end; ++c) {
|
||||
final int oldValue = mutableTrie.get(c);
|
||||
int newValue=oldValue;
|
||||
if(isMaybeOrNonZeroCC(norm16)) {
|
||||
// not a segment starter if it occurs in a decomposition or has cc!=0
|
||||
|
@ -608,7 +635,7 @@ public final class Normalizer2Impl {
|
|||
if (isDecompNoAlgorithmic(norm16_2)) {
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
c2 = mapAlgorithmic(c2, norm16_2);
|
||||
norm16_2 = getNorm16(c2);
|
||||
norm16_2 = getRawNorm16(c2);
|
||||
// No compatibility mappings for the CanonicalIterator.
|
||||
assert(!(isHangulLV(norm16_2) || isHangulLVT(norm16_2)));
|
||||
}
|
||||
|
@ -628,36 +655,43 @@ public final class Normalizer2Impl {
|
|||
// add c to first code point's start set
|
||||
int limit=mapping+length;
|
||||
c2=extraData.codePointAt(mapping);
|
||||
addToStartSet(newData, c, c2);
|
||||
addToStartSet(mutableTrie, c, c2);
|
||||
// Set CANON_NOT_SEGMENT_STARTER for each remaining code point of a
|
||||
// one-way mapping. A 2-way mapping is possible here after
|
||||
// intermediate algorithmic mapping.
|
||||
if(norm16_2>=minNoNo) {
|
||||
while((mapping+=Character.charCount(c2))<limit) {
|
||||
c2=extraData.codePointAt(mapping);
|
||||
int c2Value=newData.get(c2);
|
||||
int c2Value = mutableTrie.get(c2);
|
||||
if((c2Value&CANON_NOT_SEGMENT_STARTER)==0) {
|
||||
newData.set(c2, c2Value|CANON_NOT_SEGMENT_STARTER);
|
||||
mutableTrie.set(c2, c2Value|CANON_NOT_SEGMENT_STARTER);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// c decomposed to c2 algorithmically; c has cc==0
|
||||
addToStartSet(newData, c, c2);
|
||||
addToStartSet(mutableTrie, c, c2);
|
||||
}
|
||||
}
|
||||
if(newValue!=oldValue) {
|
||||
newData.set(c, newValue);
|
||||
mutableTrie.set(c, newValue);
|
||||
}
|
||||
}
|
||||
start = end + 1;
|
||||
}
|
||||
canonIterData=newData.toTrie2_32();
|
||||
canonIterData = mutableTrie.buildImmutable(
|
||||
CodePointTrie.Type.SMALL, CodePointTrie.ValueWidth.BITS_32);
|
||||
}
|
||||
return this;
|
||||
}
|
||||
|
||||
public int getNorm16(int c) { return normTrie.get(c); }
|
||||
// The trie stores values for lead surrogate code *units*.
|
||||
// Surrogate code *points* are inert.
|
||||
public int getNorm16(int c) {
|
||||
return UTF16Plus.isLeadSurrogate(c) ? INERT : normTrie.get(c);
|
||||
}
|
||||
public int getRawNorm16(int c) { return normTrie.get(c); }
|
||||
|
||||
public int getCompQuickCheck(int norm16) {
|
||||
if(norm16<minNoNo || MIN_YES_YES_WITH_CC<=norm16) {
|
||||
|
@ -730,7 +764,7 @@ public final class Normalizer2Impl {
|
|||
}
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
c=mapAlgorithmic(c, norm16);
|
||||
norm16=getNorm16(c);
|
||||
norm16 = getRawNorm16(c);
|
||||
}
|
||||
}
|
||||
if(norm16<=minYesNo || isHangulLVT(norm16)) {
|
||||
|
@ -763,7 +797,7 @@ public final class Normalizer2Impl {
|
|||
// Maps to an isCompYesAndZeroCC.
|
||||
decomp=c=mapAlgorithmic(c, norm16);
|
||||
// The mapping might decompose further.
|
||||
norm16 = getNorm16(c);
|
||||
norm16 = getRawNorm16(c);
|
||||
}
|
||||
if (norm16 < minYesNo) {
|
||||
if(decomp<0) {
|
||||
|
@ -857,7 +891,7 @@ public final class Normalizer2Impl {
|
|||
set.add(value);
|
||||
}
|
||||
if((canonValue&CANON_HAS_COMPOSITIONS)!=0) {
|
||||
int norm16=getNorm16(c);
|
||||
int norm16 = getRawNorm16(c);
|
||||
if(norm16==JAMO_L) {
|
||||
int syllable=Hangul.HANGUL_BASE+(c-Hangul.JAMO_L_BASE)*Hangul.JAMO_VT_COUNT;
|
||||
set.add(syllable, syllable+Hangul.JAMO_VT_COUNT-1);
|
||||
|
@ -975,27 +1009,23 @@ public final class Normalizer2Impl {
|
|||
// count code units below the minimum or with irrelevant data for the quick check
|
||||
for(prevSrc=src; src!=limit;) {
|
||||
if( (c=s.charAt(src))<minNoCP ||
|
||||
isMostDecompYesAndZeroCC(norm16=normTrie.getFromU16SingleLead((char)c))
|
||||
isMostDecompYesAndZeroCC(norm16=normTrie.bmpGet(c))
|
||||
) {
|
||||
++src;
|
||||
} else if(!UTF16.isSurrogate((char)c)) {
|
||||
} else if (!UTF16Plus.isLeadSurrogate(c)) {
|
||||
break;
|
||||
} else {
|
||||
char c2;
|
||||
if(UTF16Plus.isSurrogateLead(c)) {
|
||||
if((src+1)!=limit && Character.isLowSurrogate(c2=s.charAt(src+1))) {
|
||||
c=Character.toCodePoint((char)c, c2);
|
||||
if ((src + 1) != limit && Character.isLowSurrogate(c2 = s.charAt(src + 1))) {
|
||||
c = Character.toCodePoint((char)c, c2);
|
||||
norm16 = normTrie.suppGet(c);
|
||||
if (isMostDecompYesAndZeroCC(norm16)) {
|
||||
src += 2;
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
} else /* trail surrogate */ {
|
||||
if(prevSrc<src && Character.isHighSurrogate(c2=s.charAt(src-1))) {
|
||||
--src;
|
||||
c=Character.toCodePoint(c2, (char)c);
|
||||
}
|
||||
}
|
||||
if(isMostDecompYesAndZeroCC(norm16=getNorm16(c))) {
|
||||
src+=Character.charCount(c);
|
||||
} else {
|
||||
break;
|
||||
++src; // unpaired lead surrogate: inert
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -1055,7 +1085,7 @@ public final class Normalizer2Impl {
|
|||
c=Character.codePointAt(s, src);
|
||||
cc=getCC(getNorm16(c));
|
||||
};
|
||||
buffer.append(s, 0, src, firstCC, prevCC);
|
||||
buffer.append(s, 0, src, false, firstCC, prevCC);
|
||||
buffer.append(s, src, limit);
|
||||
}
|
||||
|
||||
|
@ -1083,28 +1113,22 @@ public final class Normalizer2Impl {
|
|||
return true;
|
||||
}
|
||||
if( (c=s.charAt(src))<minNoMaybeCP ||
|
||||
isCompYesAndZeroCC(norm16=normTrie.getFromU16SingleLead((char)c))
|
||||
isCompYesAndZeroCC(norm16=normTrie.bmpGet(c))
|
||||
) {
|
||||
++src;
|
||||
} else {
|
||||
prevSrc = src++;
|
||||
if(!UTF16.isSurrogate((char)c)) {
|
||||
if (!UTF16Plus.isLeadSurrogate(c)) {
|
||||
break;
|
||||
} else {
|
||||
char c2;
|
||||
if(UTF16Plus.isSurrogateLead(c)) {
|
||||
if(src!=limit && Character.isLowSurrogate(c2=s.charAt(src))) {
|
||||
++src;
|
||||
c=Character.toCodePoint((char)c, c2);
|
||||
if (src != limit && Character.isLowSurrogate(c2 = s.charAt(src))) {
|
||||
++src;
|
||||
c = Character.toCodePoint((char)c, c2);
|
||||
norm16 = normTrie.suppGet(c);
|
||||
if (!isCompYesAndZeroCC(norm16)) {
|
||||
break;
|
||||
}
|
||||
} else /* trail surrogate */ {
|
||||
if(prevBoundary<prevSrc && Character.isHighSurrogate(c2=s.charAt(prevSrc-1))) {
|
||||
--prevSrc;
|
||||
c=Character.toCodePoint(c2, (char)c);
|
||||
}
|
||||
}
|
||||
if(!isCompYesAndZeroCC(norm16=getNorm16(c))) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -1325,28 +1349,22 @@ public final class Normalizer2Impl {
|
|||
return (src<<1)|qcResult; // "yes" or "maybe"
|
||||
}
|
||||
if( (c=s.charAt(src))<minNoMaybeCP ||
|
||||
isCompYesAndZeroCC(norm16=normTrie.getFromU16SingleLead((char)c))
|
||||
isCompYesAndZeroCC(norm16=normTrie.bmpGet(c))
|
||||
) {
|
||||
++src;
|
||||
} else {
|
||||
prevSrc = src++;
|
||||
if(!UTF16.isSurrogate((char)c)) {
|
||||
if (!UTF16Plus.isLeadSurrogate(c)) {
|
||||
break;
|
||||
} else {
|
||||
char c2;
|
||||
if(UTF16Plus.isSurrogateLead(c)) {
|
||||
if(src!=limit && Character.isLowSurrogate(c2=s.charAt(src))) {
|
||||
++src;
|
||||
c=Character.toCodePoint((char)c, c2);
|
||||
if (src != limit && Character.isLowSurrogate(c2 = s.charAt(src))) {
|
||||
++src;
|
||||
c = Character.toCodePoint((char)c, c2);
|
||||
norm16 = normTrie.suppGet(c);
|
||||
if (!isCompYesAndZeroCC(norm16)) {
|
||||
break;
|
||||
}
|
||||
} else /* trail surrogate */ {
|
||||
if(prevBoundary<prevSrc && Character.isHighSurrogate(c2=s.charAt(prevSrc-1))) {
|
||||
--prevSrc;
|
||||
c=Character.toCodePoint(c2, (char)c);
|
||||
}
|
||||
}
|
||||
if(!isCompYesAndZeroCC(norm16=getNorm16(c))) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -1468,17 +1486,10 @@ public final class Normalizer2Impl {
|
|||
prevFCD16=0;
|
||||
++src;
|
||||
} else {
|
||||
if(UTF16.isSurrogate((char)c)) {
|
||||
if (UTF16Plus.isLeadSurrogate(c)) {
|
||||
char c2;
|
||||
if(UTF16Plus.isSurrogateLead(c)) {
|
||||
if((src+1)!=limit && Character.isLowSurrogate(c2=s.charAt(src+1))) {
|
||||
c=Character.toCodePoint((char)c, c2);
|
||||
}
|
||||
} else /* trail surrogate */ {
|
||||
if(prevSrc<src && Character.isHighSurrogate(c2=s.charAt(src-1))) {
|
||||
--src;
|
||||
c=Character.toCodePoint(c2, (char)c);
|
||||
}
|
||||
if ((src + 1) != limit && Character.isLowSurrogate(c2 = s.charAt(src + 1))) {
|
||||
c = Character.toCodePoint((char)c, c2);
|
||||
}
|
||||
}
|
||||
if((fcd16=getFCD16FromNormData(c))<=0xff) {
|
||||
|
@ -1810,7 +1821,7 @@ public final class Normalizer2Impl {
|
|||
}
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
c=mapAlgorithmic(c, norm16);
|
||||
norm16=getNorm16(c);
|
||||
norm16 = getRawNorm16(c);
|
||||
}
|
||||
if (norm16 < minYesNo) {
|
||||
// c does not decompose
|
||||
|
@ -1831,7 +1842,7 @@ public final class Normalizer2Impl {
|
|||
leadCC=0;
|
||||
}
|
||||
++mapping; // skip over the firstUnit
|
||||
buffer.append(extraData, mapping, mapping+length, leadCC, trailCC);
|
||||
buffer.append(extraData, mapping, mapping+length, true, leadCC, trailCC);
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -1921,7 +1932,7 @@ public final class Normalizer2Impl {
|
|||
}
|
||||
int composite=compositeAndFwd>>1;
|
||||
if((compositeAndFwd&1)!=0) {
|
||||
addComposites(getCompositionsListForComposite(getNorm16(composite)), set);
|
||||
addComposites(getCompositionsListForComposite(getRawNorm16(composite)), set);
|
||||
}
|
||||
set.add(composite);
|
||||
} while((firstUnit&COMP_1_LAST_TUPLE)==0);
|
||||
|
@ -2045,7 +2056,7 @@ public final class Normalizer2Impl {
|
|||
// Is the composite a starter that combines forward?
|
||||
if((compositeAndFwd&1)!=0) {
|
||||
compositionsList=
|
||||
getCompositionsListForComposite(getNorm16(composite));
|
||||
getCompositionsListForComposite(getRawNorm16(composite));
|
||||
} else {
|
||||
compositionsList=-1;
|
||||
}
|
||||
|
@ -2083,7 +2094,7 @@ public final class Normalizer2Impl {
|
|||
}
|
||||
|
||||
public int composePair(int a, int b) {
|
||||
int norm16=getNorm16(a); // maps an out-of-range 'a' to inert norm16=0
|
||||
int norm16=getNorm16(a); // maps an out-of-range 'a' to inert norm16
|
||||
int list;
|
||||
if(isInert(norm16)) {
|
||||
return -1;
|
||||
|
@ -2220,19 +2231,19 @@ public final class Normalizer2Impl {
|
|||
return getFCD16(Character.codePointBefore(s, p));
|
||||
}
|
||||
|
||||
private void addToStartSet(Trie2Writable newData, int origin, int decompLead) {
|
||||
int canonValue=newData.get(decompLead);
|
||||
private void addToStartSet(MutableCodePointTrie mutableTrie, int origin, int decompLead) {
|
||||
int canonValue = mutableTrie.get(decompLead);
|
||||
if((canonValue&(CANON_HAS_SET|CANON_VALUE_MASK))==0 && origin!=0) {
|
||||
// origin is the first character whose decomposition starts with
|
||||
// the character for which we are setting the value.
|
||||
newData.set(decompLead, canonValue|origin);
|
||||
mutableTrie.set(decompLead, canonValue|origin);
|
||||
} else {
|
||||
// origin is not the first character, or it is U+0000.
|
||||
UnicodeSet set;
|
||||
if((canonValue&CANON_HAS_SET)==0) {
|
||||
int firstOrigin=canonValue&CANON_VALUE_MASK;
|
||||
canonValue=(canonValue&~CANON_VALUE_MASK)|CANON_HAS_SET|canonStartSets.size();
|
||||
newData.set(decompLead, canonValue);
|
||||
mutableTrie.set(decompLead, canonValue);
|
||||
canonStartSets.add(set=new UnicodeSet());
|
||||
if(firstOrigin!=0) {
|
||||
set.add(firstOrigin);
|
||||
|
@ -2263,12 +2274,12 @@ public final class Normalizer2Impl {
|
|||
private int centerNoNoDelta;
|
||||
private int minMaybeYes;
|
||||
|
||||
private Trie2_16 normTrie;
|
||||
private CodePointTrie.Fast16 normTrie;
|
||||
private String maybeYesCompositions;
|
||||
private String extraData; // mappings and/or compositions for yesYes, yesNo & noNo characters
|
||||
private byte[] smallFCD; // [0x100] one bit per 32 BMP code points, set if any FCD!=0
|
||||
|
||||
private Trie2_32 canonIterData;
|
||||
private CodePointTrie canonIterData;
|
||||
private ArrayList<UnicodeSet> canonStartSets;
|
||||
|
||||
// bits in canonIterData
|
||||
|
|
|
@ -10,6 +10,7 @@ package com.ibm.icu.impl;
|
|||
|
||||
import java.util.EnumSet;
|
||||
|
||||
import com.ibm.icu.impl.Normalizer2Impl.UTF16Plus;
|
||||
import com.ibm.icu.lang.UCharacter;
|
||||
import com.ibm.icu.lang.UCharacterCategory;
|
||||
import com.ibm.icu.lang.UCharacterDirection;
|
||||
|
@ -223,19 +224,31 @@ public final class UTS46 extends IDNA {
|
|||
promoteAndResetLabelErrors(info);
|
||||
destLength+=newLength-labelLength;
|
||||
labelLimit=labelStart+=newLength+1;
|
||||
} else if(0xdf<=c && c<=0x200d && (c==0xdf || c==0x3c2 || c>=0x200c)) {
|
||||
continue;
|
||||
} else if(c<0xdf) {
|
||||
// pass
|
||||
} else if(c<=0x200d && (c==0xdf || c==0x3c2 || c>=0x200c)) {
|
||||
setTransitionalDifferent(info);
|
||||
if(doMapDevChars) {
|
||||
destLength=mapDevChars(dest, labelStart, labelLimit);
|
||||
// Do not increment labelLimit in case c was removed.
|
||||
// All deviation characters have been mapped, no need to check for them again.
|
||||
doMapDevChars=false;
|
||||
} else {
|
||||
++labelLimit;
|
||||
// Do not increment labelLimit in case c was removed.
|
||||
continue;
|
||||
}
|
||||
} else if(Character.isSurrogate(c)) {
|
||||
if(UTF16Plus.isSurrogateLead(c) ?
|
||||
(labelLimit+1)==destLength ||
|
||||
!Character.isLowSurrogate(dest.charAt(labelLimit+1)) :
|
||||
labelLimit==labelStart ||
|
||||
!Character.isHighSurrogate(dest.charAt(labelLimit-1))) {
|
||||
// Map an unpaired surrogate to U+FFFD before normalization so that when
|
||||
// that removes characters we do not turn two unpaired ones into a pair.
|
||||
addLabelError(info, Error.DISALLOWED);
|
||||
dest.setCharAt(labelLimit, '\ufffd');
|
||||
}
|
||||
} else {
|
||||
++labelLimit;
|
||||
}
|
||||
++labelLimit;
|
||||
}
|
||||
// Permit an empty label at the end (0<labelStart==labelLimit==destLength is ok)
|
||||
// but not an empty label elsewhere nor a completely empty domain name.
|
||||
|
|
460
icu4j/main/classes/core/src/com/ibm/icu/util/CodePointMap.java
Normal file
460
icu4j/main/classes/core/src/com/ibm/icu/util/CodePointMap.java
Normal file
|
@ -0,0 +1,460 @@
|
|||
// © 2018 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html#License
|
||||
|
||||
// created: 2018may10 Markus W. Scherer
|
||||
|
||||
package com.ibm.icu.util;
|
||||
|
||||
import java.util.Iterator;
|
||||
import java.util.NoSuchElementException;
|
||||
|
||||
/**
|
||||
* Abstract map from Unicode code points (U+0000..U+10FFFF) to integer values.
|
||||
* This does not implement java.util.Map.
|
||||
*
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public abstract class CodePointMap implements Iterable<CodePointMap.Range> {
|
||||
/**
|
||||
* Selectors for how getRange() should report value ranges overlapping with surrogates.
|
||||
* Most users should use NORMAL.
|
||||
*
|
||||
* @see #getRange
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public enum RangeOption {
|
||||
/**
|
||||
* getRange() enumerates all same-value ranges as stored in the trie.
|
||||
* Most users should use this option.
|
||||
*
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
NORMAL,
|
||||
/**
|
||||
* getRange() enumerates all same-value ranges as stored in the trie,
|
||||
* except that lead surrogates (U+D800..U+DBFF) are treated as having the
|
||||
* surrogateValue, which is passed to getRange() as a separate parameter.
|
||||
* The surrogateValue is not transformed via filter().
|
||||
* See {@link Character#isHighSurrogate}.
|
||||
*
|
||||
* <p>Most users should use NORMAL instead.
|
||||
*
|
||||
* <p>This option is useful for tries that map surrogate code *units* to
|
||||
* special values optimized for UTF-16 string processing
|
||||
* or for special error behavior for unpaired surrogates,
|
||||
* but those values are not to be associated with the lead surrogate code *points*.
|
||||
*
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
FIXED_LEAD_SURROGATES,
|
||||
/**
|
||||
* getRange() enumerates all same-value ranges as stored in the trie,
|
||||
* except that all surrogates (U+D800..U+DFFF) are treated as having the
|
||||
* surrogateValue, which is passed to getRange() as a separate parameter.
|
||||
* The surrogateValue is not transformed via filter().
|
||||
* See {@link Character#isSurrogate}.
|
||||
*
|
||||
* <p>Most users should use NORMAL instead.
|
||||
*
|
||||
* <p>This option is useful for tries that map surrogate code *units* to
|
||||
* special values optimized for UTF-16 string processing
|
||||
* or for special error behavior for unpaired surrogates,
|
||||
* but those values are not to be associated with the lead surrogate code *points*.
|
||||
*
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
FIXED_ALL_SURROGATES
|
||||
}
|
||||
|
||||
/**
|
||||
* Callback function interface: Modifies a trie value.
|
||||
* Optionally called by getRange().
|
||||
* The modified value will be returned by the getRange() function.
|
||||
*
|
||||
* <p>Can be used to ignore some of the value bits,
|
||||
* make a filter for one of several values,
|
||||
* return a value index computed from the trie value, etc.
|
||||
*
|
||||
* @see #getRange
|
||||
* @see #iterator
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public interface ValueFilter {
|
||||
/**
|
||||
* Modifies the trie value.
|
||||
*
|
||||
* @param value trie value
|
||||
* @return modified value
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public int apply(int value);
|
||||
}
|
||||
|
||||
/**
|
||||
* Range iteration result data.
|
||||
* Code points from start to end map to the same value.
|
||||
* The value may have been modified by {@link ValueFilter#apply(int)},
|
||||
* or it may be the surrogateValue if a RangeOption other than "normal" was used.
|
||||
*
|
||||
* @see #getRange
|
||||
* @see #iterator
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public static final class Range {
|
||||
private int start;
|
||||
private int end;
|
||||
private int value;
|
||||
|
||||
/**
|
||||
* Constructor. Sets start and end to -1 and value to 0.
|
||||
*
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public Range() {
|
||||
start = end = -1;
|
||||
value = 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* @return the start code point
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public int getStart() { return start; }
|
||||
/**
|
||||
* @return the (inclusive) end code point
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public int getEnd() { return end; }
|
||||
/**
|
||||
* @return the range value
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public int getValue() { return value; }
|
||||
/**
|
||||
* Sets the range. When using {@link #iterator()},
|
||||
* iteration will resume after the newly set end.
|
||||
*
|
||||
* @param start new start code point
|
||||
* @param end new end code point
|
||||
* @param value new value
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public void set(int start, int end, int value) {
|
||||
this.start = start;
|
||||
this.end = end;
|
||||
this.value = value;
|
||||
}
|
||||
}
|
||||
|
||||
private final class RangeIterator implements Iterator<Range> {
|
||||
private Range range = new Range();
|
||||
|
||||
@Override
|
||||
public boolean hasNext() {
|
||||
return -1 <= range.end && range.end < 0x10ffff;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Range next() {
|
||||
if (getRange(range.end + 1, null, range)) {
|
||||
return range;
|
||||
} else {
|
||||
throw new NoSuchElementException();
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public final void remove() {
|
||||
throw new UnsupportedOperationException();
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Iterates over code points of a string and fetches trie values.
|
||||
* This does not implement java.util.Iterator.
|
||||
*
|
||||
* <pre>
|
||||
* void onString(CodePointMap map, CharSequence s, int start) {
|
||||
* CodePointMap.StringIterator iter = map.stringIterator(s, start);
|
||||
* while (iter.next()) {
|
||||
* int end = iter.getIndex(); // code point from between start and end
|
||||
* useValue(s, start, end, iter.getCodePoint(), iter.getValue());
|
||||
* start = end;
|
||||
* }
|
||||
* }
|
||||
* </pre>
|
||||
*
|
||||
* <p>This class is not intended for public subclassing.
|
||||
*
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public class StringIterator {
|
||||
/**
|
||||
* @internal
|
||||
* @deprecated This API is ICU internal only.
|
||||
*/
|
||||
@Deprecated
|
||||
protected CharSequence s;
|
||||
/**
|
||||
* @internal
|
||||
* @deprecated This API is ICU internal only.
|
||||
*/
|
||||
@Deprecated
|
||||
protected int sIndex;
|
||||
/**
|
||||
* @internal
|
||||
* @deprecated This API is ICU internal only.
|
||||
*/
|
||||
@Deprecated
|
||||
protected int c;
|
||||
/**
|
||||
* @internal
|
||||
* @deprecated This API is ICU internal only.
|
||||
*/
|
||||
@Deprecated
|
||||
protected int value;
|
||||
|
||||
/**
|
||||
* @internal
|
||||
* @deprecated This API is ICU internal only.
|
||||
*/
|
||||
@Deprecated
|
||||
protected StringIterator(CharSequence s, int sIndex) {
|
||||
this.s = s;
|
||||
this.sIndex = sIndex;
|
||||
c = -1;
|
||||
value = 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* Resets the iterator to a new string and/or a new string index.
|
||||
*
|
||||
* @param s string to iterate over
|
||||
* @param sIndex string index where the iteration will start
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public void reset(CharSequence s, int sIndex) {
|
||||
this.s = s;
|
||||
this.sIndex = sIndex;
|
||||
c = -1;
|
||||
value = 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* Reads the next code point, post-increments the string index,
|
||||
* and gets a value from the trie.
|
||||
* Sets the trie error value if the code point is an unpaired surrogate.
|
||||
*
|
||||
* @return true if the string index was not yet at the end of the string;
|
||||
* otherwise the iterator did not advance
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public boolean next() {
|
||||
if (sIndex >= s.length()) {
|
||||
return false;
|
||||
}
|
||||
c = Character.codePointAt(s, sIndex);
|
||||
sIndex += Character.charCount(c);
|
||||
value = get(c);
|
||||
return true;
|
||||
}
|
||||
|
||||
/**
|
||||
* Reads the previous code point, pre-decrements the string index,
|
||||
* and gets a value from the trie.
|
||||
* Sets the trie error value if the code point is an unpaired surrogate.
|
||||
*
|
||||
* @return true if the string index was not yet at the start of the string;
|
||||
* otherwise the iterator did not advance
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public boolean previous() {
|
||||
if (sIndex <= 0) {
|
||||
return false;
|
||||
}
|
||||
c = Character.codePointBefore(s, sIndex);
|
||||
sIndex -= Character.charCount(c);
|
||||
value = get(c);
|
||||
return true;
|
||||
}
|
||||
/**
|
||||
* @return the string index
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public final int getIndex() { return sIndex; }
|
||||
/**
|
||||
* @return the code point
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public final int getCodePoint() { return c; }
|
||||
/**
|
||||
* @return the trie value,
|
||||
* or the trie error value if the code point is an unpaired surrogate
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public final int getValue() { return value; }
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns the value for a code point as stored in the trie, with range checking.
|
||||
* Returns the trie error value if c is not in the range 0..U+10FFFF.
|
||||
*
|
||||
* @param c the code point
|
||||
* @return the trie value,
|
||||
* or the trie error value if the code point is not in the range 0..U+10FFFF
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public abstract int get(int c);
|
||||
|
||||
/**
|
||||
* Sets the range object to a range of code points beginning with the start parameter.
|
||||
* The range end is the the last code point such that
|
||||
* all those from start to there have the same value.
|
||||
* Returns false if start is not 0..U+10FFFF.
|
||||
* Can be used to efficiently iterate over all same-value ranges in a trie.
|
||||
*
|
||||
* <p>If the {@link ValueFilter} parameter is not null, then
|
||||
* the value to be delivered is passed through that filter, and the return value is the end
|
||||
* of the range where all values are modified to the same actual value.
|
||||
* The value is unchanged if that parameter is null.
|
||||
*
|
||||
* <p>Example:
|
||||
* <pre>
|
||||
* int start = 0;
|
||||
* CodePointMap.Range range = new CodePointMap.Range();
|
||||
* while (trie.getRange(start, null, range)) {
|
||||
* int end = range.getEnd();
|
||||
* int value = range.getValue();
|
||||
* // Work with the range start..end and its value.
|
||||
* start = end + 1;
|
||||
* }
|
||||
* </pre>
|
||||
*
|
||||
* @param start range start
|
||||
* @param filter an object that may modify the trie data value,
|
||||
* or null if the values from the trie are to be used unmodified
|
||||
* @param range the range object that will be set to the code point range and value
|
||||
* @return true if start is 0..U+10FFFF; otherwise no new range is fetched
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public abstract boolean getRange(int start, ValueFilter filter, Range range);
|
||||
|
||||
/**
|
||||
* Sets the range object to a range of code points beginning with the start parameter.
|
||||
* The range end is the the last code point such that
|
||||
* all those from start to there have the same value.
|
||||
* Returns false if start is not 0..U+10FFFF.
|
||||
*
|
||||
* <p>Same as the simpler {@link #getRange(int, ValueFilter, Range)} but optionally
|
||||
* modifies the range if it overlaps with surrogate code points.
|
||||
*
|
||||
* @param start range start
|
||||
* @param option defines whether surrogates are treated normally,
|
||||
* or as having the surrogateValue; usually {@value RangeOption#NORMAL}
|
||||
* @param surrogateValue value for surrogates; ignored if option=={@value RangeOption#NORMAL}
|
||||
* @param filter an object that may modify the trie data value,
|
||||
* or null if the values from the trie are to be used unmodified
|
||||
* @param range the range object that will be set to the code point range and value
|
||||
* @return true if start is 0..U+10FFFF; otherwise no new range is fetched
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public boolean getRange(int start, RangeOption option, int surrogateValue,
|
||||
ValueFilter filter, Range range) {
|
||||
assert option != null;
|
||||
if (!getRange(start, filter, range)) {
|
||||
return false;
|
||||
}
|
||||
if (option == RangeOption.NORMAL) {
|
||||
return true;
|
||||
}
|
||||
int surrEnd = option == RangeOption.FIXED_ALL_SURROGATES ? 0xdfff : 0xdbff;
|
||||
int end = range.end;
|
||||
if (end < 0xd7ff || start > surrEnd) {
|
||||
return true;
|
||||
}
|
||||
// The range overlaps with surrogates, or ends just before the first one.
|
||||
if (range.value == surrogateValue) {
|
||||
if (end >= surrEnd) {
|
||||
// Surrogates followed by a non-surrValue range,
|
||||
// or surrogates are part of a larger surrValue range.
|
||||
return true;
|
||||
}
|
||||
} else {
|
||||
if (start <= 0xd7ff) {
|
||||
range.end = 0xd7ff; // Non-surrValue range ends before surrValue surrogates.
|
||||
return true;
|
||||
}
|
||||
// Start is a surrogate with a non-surrValue code *unit* value.
|
||||
// Return a surrValue code *point* range.
|
||||
range.value = surrogateValue;
|
||||
if (end > surrEnd) {
|
||||
range.end = surrEnd; // Surrogate range ends before non-surrValue rest of range.
|
||||
return true;
|
||||
}
|
||||
}
|
||||
// See if the surrValue surrogate range can be merged with
|
||||
// an immediately following range.
|
||||
if (getRange(surrEnd + 1, filter, range) && range.value == surrogateValue) {
|
||||
range.start = start;
|
||||
return true;
|
||||
}
|
||||
range.start = start;
|
||||
range.end = surrEnd;
|
||||
range.value = surrogateValue;
|
||||
return true;
|
||||
}
|
||||
|
||||
/**
|
||||
* Convenience iterator over same-trie-value code point ranges.
|
||||
* Same as looping over all ranges with {@link #getRange(int, ValueFilter, Range)}
|
||||
* without filtering.
|
||||
* Adjacent ranges have different trie values.
|
||||
*
|
||||
* <p>The iterator always returns the same Range object.
|
||||
*
|
||||
* @return a Range iterator
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
@Override
|
||||
public Iterator<Range> iterator() {
|
||||
return new RangeIterator();
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns an iterator (not a java.util.Iterator) over code points of a string
|
||||
* for fetching trie values.
|
||||
*
|
||||
* @param s string to iterate over
|
||||
* @param sIndex string index where the iteration will start
|
||||
* @return the iterator
|
||||
* @draft ICU 63
|
||||
* @provisional This API might change or be removed in a future release.
|
||||
*/
|
||||
public StringIterator stringIterator(CharSequence s, int sIndex) {
|
||||
return new StringIterator(s, sIndex);
|
||||
}
|
||||
}
|
1271
icu4j/main/classes/core/src/com/ibm/icu/util/CodePointTrie.java
Normal file
1271
icu4j/main/classes/core/src/com/ibm/icu/util/CodePointTrie.java
Normal file
File diff suppressed because it is too large
Load diff
File diff suppressed because it is too large
Load diff
4
icu4j/main/shared/data/icudata.jar
Executable file → Normal file
4
icu4j/main/shared/data/icudata.jar
Executable file → Normal file
|
@ -1,3 +1,3 @@
|
|||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:70c249360d5cc010c75203f5add8040cbcc4f33229e1d82d34b6185d69832143
|
||||
size 12510210
|
||||
oid sha256:a8be41753876c867630b4e740d692e0ae7ced119086a22cd4844ea7bf174d6f7
|
||||
size 12509408
|
||||
|
|
|
@ -1,3 +1,3 @@
|
|||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:93a0bf4221a173b33aeda78f4646092caad816a6832310a89278de249ec18634
|
||||
oid sha256:55923dda88f8bf3affc2cf6d774a92a49e5fbc4be5583769bfe90fc7f319d2b1
|
||||
size 92857
|
||||
|
|
4
icu4j/main/shared/data/testdata.jar
Executable file → Normal file
4
icu4j/main/shared/data/testdata.jar
Executable file → Normal file
|
@ -1,3 +1,3 @@
|
|||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:47978ca4c19730c3d4387d9058679115dbf1e21964b993a889a38680fd3dfe47
|
||||
size 813186
|
||||
oid sha256:0d399ead8487d2beff526c723212022ba354501bb3777481f16b53241d24a8d1
|
||||
size 813119
|
||||
|
|
|
@ -2632,9 +2632,14 @@ public class BasicTest extends TestFmwk {
|
|||
@Test
|
||||
public void TestCustomComp() {
|
||||
String [][] pairs={
|
||||
{ "\\uD801\\uE000\\uDFFE", "" },
|
||||
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
|
||||
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
|
||||
// ICU 63 normalization with CodePointTrie requires inert surrogate code points.
|
||||
// { "\\uD801\\uE000\\uDFFE", "" },
|
||||
// { "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
|
||||
// { "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
|
||||
{ "\\uD801\\uE000\\uDFFE", "\\uD801\\uDFFE" },
|
||||
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD800\\uD801\\uDFFE\\uDFFF" },
|
||||
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD800\\U000107FE\\uDFFF" },
|
||||
|
||||
{ "\\uE001\\U000110B9\\u0345\\u0308\\u0327", "\\uE002\\U000110B9\\u0327\\u0345" },
|
||||
{ "\\uE010\\U000F0011\\uE012", "\\uE011\\uE012" },
|
||||
{ "\\uE010\\U000F0011\\U000F0011\\uE012", "\\uE011\\U000F0010" },
|
||||
|
@ -2661,9 +2666,14 @@ public class BasicTest extends TestFmwk {
|
|||
@Test
|
||||
public void TestCustomFCC() {
|
||||
String[][] pairs={
|
||||
{ "\\uD801\\uE000\\uDFFE", "" },
|
||||
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
|
||||
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
|
||||
// ICU 63 normalization with CodePointTrie requires inert surrogate code points.
|
||||
// { "\\uD801\\uE000\\uDFFE", "" },
|
||||
// { "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD7FF\\uFFFF" },
|
||||
// { "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD7FF\\U000107FE\\uFFFF" },
|
||||
{ "\\uD801\\uE000\\uDFFE", "\\uD801\\uDFFE" },
|
||||
{ "\\uD800\\uD801\\uE000\\uDFFE\\uDFFF", "\\uD800\\uD801\\uDFFE\\uDFFF" },
|
||||
{ "\\uD800\\uD801\\uDFFE\\uDFFF", "\\uD800\\U000107FE\\uDFFF" },
|
||||
|
||||
// The following expected result is different from CustomComp
|
||||
// because of only-contiguous composition.
|
||||
{ "\\uE001\\U000110B9\\u0345\\u0308\\u0327", "\\uE001\\U000110B9\\u0327\\u0308\\u0345" },
|
||||
|
|
|
@ -0,0 +1,985 @@
|
|||
// © 2018 and later: Unicode, Inc. and others.
|
||||
// License & terms of use: http://www.unicode.org/copyright.html#License
|
||||
|
||||
// created: 2018jul10 Markus W. Scherer
|
||||
|
||||
// This is a fairly straight port from cintltst/ucptrietest.c.
|
||||
// It wants to remain close to the C code, rather than be completely colloquial Java.
|
||||
|
||||
package com.ibm.icu.dev.test.util;
|
||||
|
||||
import java.io.ByteArrayOutputStream;
|
||||
import java.nio.ByteBuffer;
|
||||
import java.util.Arrays;
|
||||
|
||||
import org.junit.Test;
|
||||
import org.junit.runner.RunWith;
|
||||
import org.junit.runners.JUnit4;
|
||||
|
||||
import com.ibm.icu.dev.test.TestFmwk;
|
||||
import com.ibm.icu.impl.Normalizer2Impl.UTF16Plus;
|
||||
import com.ibm.icu.util.CodePointMap;
|
||||
import com.ibm.icu.util.CodePointTrie;
|
||||
import com.ibm.icu.util.MutableCodePointTrie;
|
||||
|
||||
@RunWith(JUnit4.class)
|
||||
public final class CodePointTrieTest extends TestFmwk {
|
||||
/* Values for setting possibly overlapping, out-of-order ranges of values */
|
||||
private static class SetRange {
|
||||
SetRange(int start, int limit, int value) {
|
||||
this.start = start;
|
||||
this.limit = limit;
|
||||
this.value = value;
|
||||
}
|
||||
|
||||
final int start, limit;
|
||||
final int value;
|
||||
}
|
||||
|
||||
// Returned from getSpecialValues(). Values extracted from an array of CheckRange.
|
||||
private static class SpecialValues {
|
||||
SpecialValues(int i, int initialValue, int errorValue) {
|
||||
this.i = i;
|
||||
this.initialValue = initialValue;
|
||||
this.errorValue = errorValue;
|
||||
}
|
||||
|
||||
final int i;
|
||||
final int initialValue;
|
||||
final int errorValue;
|
||||
}
|
||||
|
||||
/*
|
||||
* Values for testing:
|
||||
* value is set from the previous boundary's limit to before
|
||||
* this boundary's limit
|
||||
*
|
||||
* There must be an entry with limit 0 and the intialValue.
|
||||
* It may be preceded by an entry with negative limit and the errorValue.
|
||||
*/
|
||||
private static class CheckRange {
|
||||
CheckRange(int limit, int value) {
|
||||
this.limit = limit;
|
||||
this.value = value;
|
||||
}
|
||||
|
||||
final int limit;
|
||||
final int value;
|
||||
}
|
||||
|
||||
private static int skipSpecialValues(CheckRange checkRanges[]) {
|
||||
int i;
|
||||
for(i=0; i<checkRanges.length && checkRanges[i].limit<=0; ++i) {}
|
||||
return i;
|
||||
}
|
||||
|
||||
private static SpecialValues getSpecialValues(CheckRange checkRanges[]) {
|
||||
int i=0;
|
||||
int initialValue, errorValue;
|
||||
if(i<checkRanges.length && checkRanges[i].limit<0) {
|
||||
errorValue=checkRanges[i++].value;
|
||||
} else {
|
||||
errorValue=0xad;
|
||||
}
|
||||
if(i<checkRanges.length && checkRanges[i].limit==0) {
|
||||
initialValue=checkRanges[i++].value;
|
||||
} else {
|
||||
initialValue=0;
|
||||
}
|
||||
return new SpecialValues(i, initialValue, errorValue);
|
||||
}
|
||||
|
||||
/* ucptrie_enum() callback, modifies a value */
|
||||
private static class TestValueFilter implements CodePointMap.ValueFilter {
|
||||
@Override
|
||||
public int apply(int value) {
|
||||
return value ^ 0x5555;
|
||||
}
|
||||
}
|
||||
private static final TestValueFilter testFilter = new TestValueFilter();
|
||||
|
||||
private boolean
|
||||
doCheckRange(String name, String variant,
|
||||
int start, boolean getRangeResult, CodePointMap.Range range,
|
||||
int expEnd, int expValue) {
|
||||
if (!getRangeResult) {
|
||||
if (expEnd >= 0) {
|
||||
fail(String.format( // log_err(
|
||||
"error: %s getRanges (%s) fails to deliver range [U+%04x..U+%04x].0x%x\n",
|
||||
name, variant, start, expEnd, expValue));
|
||||
}
|
||||
return false;
|
||||
}
|
||||
if (expEnd < 0) {
|
||||
fail(String.format(
|
||||
"error: %s getRanges (%s) delivers unexpected range [U+%04x..U+%04x].0x%x\n",
|
||||
name, variant, range.getStart(), range.getEnd(), range.getValue()));
|
||||
return false;
|
||||
}
|
||||
if (range.getStart() != start || range.getEnd() != expEnd || range.getValue() != expValue) {
|
||||
fail(String.format(
|
||||
"error: %s getRanges (%s) delivers wrong range [U+%04x..U+%04x].0x%x " +
|
||||
"instead of [U+%04x..U+%04x].0x%x\n",
|
||||
name, variant, range.getStart(), range.getEnd(), range.getValue(),
|
||||
start, expEnd, expValue));
|
||||
return false;
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
// Test iteration starting from various UTF-8/16 and trie structure boundaries.
|
||||
// Also test starting partway through lead & trail surrogates for fixed-surrogate-value options,
|
||||
// and partway through supplementary code points.
|
||||
private static int iterStarts[] = {
|
||||
0, 0x7f, 0x80, 0x7ff, 0x800, 0xfff, 0x1000,
|
||||
0xd7ff, 0xd800, 0xd888, 0xdddd, 0xdfff, 0xe000,
|
||||
0xffff, 0x10000, 0x12345, 0x10ffff, 0x110000
|
||||
};
|
||||
|
||||
private void
|
||||
testTrieGetRanges(String testName, CodePointMap trie,
|
||||
CodePointMap.RangeOption option, int surrValue,
|
||||
CheckRange checkRanges[]) {
|
||||
String typeName = trie instanceof MutableCodePointTrie ? "mutableTrie" : "trie";
|
||||
CodePointMap.Range range = new CodePointMap.Range();
|
||||
for (int s = 0; s < iterStarts.length; ++s) {
|
||||
int start = iterStarts[s];
|
||||
int i, i0;
|
||||
int expEnd;
|
||||
int expValue;
|
||||
boolean getRangeResult;
|
||||
// No need to go from each iteration start to the very end.
|
||||
int innerLoopCount;
|
||||
|
||||
String name = String.format("%s/%s(%s) min=U+%04x", typeName, option, testName, start);
|
||||
|
||||
// Skip over special values and low ranges.
|
||||
for (i = 0; i < checkRanges.length && checkRanges[i].limit <= start; ++i) {}
|
||||
i0 = i;
|
||||
// without value handler
|
||||
for (innerLoopCount = 0;; ++i, start = range.getEnd() + 1) {
|
||||
if (i < checkRanges.length) {
|
||||
expEnd = checkRanges[i].limit - 1;
|
||||
expValue = checkRanges[i].value;
|
||||
} else {
|
||||
expEnd = -1;
|
||||
expValue = 0x5005;
|
||||
}
|
||||
getRangeResult = option != CodePointMap.RangeOption.NORMAL ?
|
||||
trie.getRange(start, option, surrValue, null, range) :
|
||||
trie.getRange(start, null, range);
|
||||
if (!doCheckRange(name, "without value handler",
|
||||
start, getRangeResult, range, expEnd, expValue)) {
|
||||
break;
|
||||
}
|
||||
if (s != 0 && ++innerLoopCount == 5) { break; }
|
||||
}
|
||||
// with value handler
|
||||
for (i = i0, start = iterStarts[s], innerLoopCount = 0;;
|
||||
++i, start = range.getEnd() + 1) {
|
||||
if (i < checkRanges.length) {
|
||||
expEnd = checkRanges[i].limit - 1;
|
||||
expValue = checkRanges[i].value ^ 0x5555;
|
||||
} else {
|
||||
expEnd = -1;
|
||||
expValue = 0x5005;
|
||||
}
|
||||
getRangeResult = trie.getRange(start, option, surrValue ^ 0x5555, testFilter, range);
|
||||
if (!doCheckRange(name, "with value handler",
|
||||
start, getRangeResult, range, expEnd, expValue)) {
|
||||
break;
|
||||
}
|
||||
if (s != 0 && ++innerLoopCount == 5) { break; }
|
||||
}
|
||||
// C also tests without value (with a NULL value pointer),
|
||||
// but that does not apply to Java.
|
||||
}
|
||||
}
|
||||
|
||||
// Note: There is much less to do here in polymorphic Java than in C
|
||||
// where we have many specialized macros in addition to generic functions.
|
||||
private void
|
||||
testTrieGetters(String testName, CodePointTrie trie,
|
||||
CodePointTrie.Type type, CodePointTrie.ValueWidth valueWidth,
|
||||
CheckRange checkRanges[]) {
|
||||
int value, value2;
|
||||
int start, limit;
|
||||
int i;
|
||||
int countErrors=0;
|
||||
|
||||
CodePointTrie.Fast fastTrie =
|
||||
type == CodePointTrie.Type.FAST ? (CodePointTrie.Fast)trie : null;
|
||||
String typeName = "trie";
|
||||
|
||||
SpecialValues specials = getSpecialValues(checkRanges);
|
||||
|
||||
start=0;
|
||||
for(i=specials.i; i<checkRanges.length; ++i) {
|
||||
limit=checkRanges[i].limit;
|
||||
value=checkRanges[i].value;
|
||||
|
||||
while(start<limit) {
|
||||
if (start <= 0x7f) {
|
||||
value2 = trie.asciiGet(start);
|
||||
if (value != value2) {
|
||||
fail(String.format(
|
||||
"error: %s(%s).fromASCII(U+%04x)==0x%x instead of 0x%x\n",
|
||||
typeName, testName, start, value2, value));
|
||||
++countErrors;
|
||||
}
|
||||
}
|
||||
if (fastTrie != null) {
|
||||
if(start<=0xffff) {
|
||||
value2 = fastTrie.bmpGet(start);
|
||||
if(value!=value2) {
|
||||
fail(String.format(
|
||||
"error: %s(%s).fromBMP(U+%04x)==0x%x instead of 0x%x\n",
|
||||
typeName, testName, start, value2, value));
|
||||
++countErrors;
|
||||
}
|
||||
} else {
|
||||
value2 = fastTrie.suppGet(start);
|
||||
if(value!=value2) {
|
||||
fail(String.format(
|
||||
"error: %s(%s).fromSupp(U+%04x)==0x%x instead of 0x%x\n",
|
||||
typeName, testName, start, value2, value));
|
||||
++countErrors;
|
||||
}
|
||||
}
|
||||
}
|
||||
value2 = trie.get(start);
|
||||
if(value!=value2) {
|
||||
fail(String.format(
|
||||
"error: %s(%s).get(U+%04x)==0x%x instead of 0x%x\n",
|
||||
typeName, testName, start, value2, value));
|
||||
++countErrors;
|
||||
}
|
||||
++start;
|
||||
if(countErrors>10) {
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* test errorValue */
|
||||
value = trie.get(-1);
|
||||
value2 = trie.get(0x110000);
|
||||
if(value!=specials.errorValue || value2!=specials.errorValue) {
|
||||
fail(String.format(
|
||||
"error: %s(%s).get(out of range) != errorValue\n",
|
||||
typeName, testName));
|
||||
}
|
||||
}
|
||||
|
||||
private void
|
||||
testBuilderGetters(String testName, MutableCodePointTrie mutableTrie, CheckRange checkRanges[]) {
|
||||
int value, value2;
|
||||
int start, limit;
|
||||
int i;
|
||||
int countErrors=0;
|
||||
|
||||
String typeName = "mutableTrie";
|
||||
|
||||
SpecialValues specials=getSpecialValues(checkRanges);
|
||||
|
||||
start=0;
|
||||
for(i=specials.i; i<checkRanges.length; ++i) {
|
||||
limit=checkRanges[i].limit;
|
||||
value=checkRanges[i].value;
|
||||
|
||||
while(start<limit) {
|
||||
value2=mutableTrie.get(start);
|
||||
if(value!=value2) {
|
||||
fail(String.format(
|
||||
"error: %s(%s).get(U+%04x)==0x%x instead of 0x%x\n",
|
||||
typeName, testName, start, value2, value));
|
||||
++countErrors;
|
||||
}
|
||||
++start;
|
||||
if(countErrors>10) {
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* test errorValue */
|
||||
value=mutableTrie.get(-1);
|
||||
value2=mutableTrie.get(0x110000);
|
||||
if(value!=specials.errorValue || value2!=specials.errorValue) {
|
||||
fail(String.format(
|
||||
"error: %s(%s).get(out of range) != errorValue\n",
|
||||
typeName, testName));
|
||||
}
|
||||
}
|
||||
|
||||
private static boolean ACCIDENTAL_SURROGATE_PAIR(CharSequence s, int cp) {
|
||||
return s.length() > 0 &&
|
||||
Character.isHighSurrogate(s.charAt(s.length() - 1)) &&
|
||||
UTF16Plus.isTrailSurrogate(cp);
|
||||
}
|
||||
|
||||
private void
|
||||
testTrieUTF16(String testName,
|
||||
CodePointTrie trie, CodePointTrie.ValueWidth valueWidth,
|
||||
CheckRange checkRanges[]) {
|
||||
StringBuilder s = new StringBuilder();
|
||||
int[] values = new int[16000];
|
||||
|
||||
int errorValue = trie.get(-1);
|
||||
int value, expected;
|
||||
int prevCP, c, c2;
|
||||
int i, sIndex, countValues;
|
||||
|
||||
/* write a string */
|
||||
prevCP=0;
|
||||
countValues=0;
|
||||
for(i=skipSpecialValues(checkRanges); i<checkRanges.length; ++i) {
|
||||
value=checkRanges[i].value;
|
||||
/* write three code points */
|
||||
if(!ACCIDENTAL_SURROGATE_PAIR(s, prevCP)) {
|
||||
s.appendCodePoint(prevCP); /* start of the range */
|
||||
values[countValues++]=value;
|
||||
}
|
||||
c=checkRanges[i].limit;
|
||||
prevCP=(prevCP+c)/2; /* middle of the range */
|
||||
if(!ACCIDENTAL_SURROGATE_PAIR(s, prevCP)) {
|
||||
s.appendCodePoint(prevCP);
|
||||
values[countValues++]=value;
|
||||
}
|
||||
prevCP=c;
|
||||
--c; /* end of the range */
|
||||
if(!ACCIDENTAL_SURROGATE_PAIR(s, c)) {
|
||||
s.appendCodePoint(c);
|
||||
values[countValues++]=value;
|
||||
}
|
||||
}
|
||||
CodePointMap.StringIterator si = trie.stringIterator(s, 0);
|
||||
|
||||
/* try forward */
|
||||
sIndex = 0;
|
||||
i=0;
|
||||
while (sIndex < s.length()) {
|
||||
c2 = s.codePointAt(sIndex);
|
||||
sIndex += Character.charCount(c2);
|
||||
assertTrue("next() at " + si.getIndex(), si.next());
|
||||
c = si.getCodePoint();
|
||||
value = si.getValue();
|
||||
expected = UTF16Plus.isSurrogate(c) ? errorValue : values[i];
|
||||
if(value!=expected) {
|
||||
fail(String.format(
|
||||
"error: wrong value from UCPTRIE_NEXT(%s)(U+%04x): 0x%x instead of 0x%x\n",
|
||||
testName, c, value, expected));
|
||||
}
|
||||
if(c!=c2) {
|
||||
fail(String.format(
|
||||
"error: wrong code point from UCPTRIE_NEXT(%s): U+%04x != U+%04x\n",
|
||||
testName, c, c2));
|
||||
continue;
|
||||
}
|
||||
++i;
|
||||
}
|
||||
assertFalse("next() at the end", si.next());
|
||||
|
||||
/* try backward */
|
||||
sIndex = s.length();
|
||||
i=countValues;
|
||||
while (sIndex > 0) {
|
||||
--i;
|
||||
c2 = s.codePointBefore(sIndex);
|
||||
sIndex -= Character.charCount(c2);
|
||||
assertTrue("previous() at " + si.getIndex(), si.previous());
|
||||
c = si.getCodePoint();
|
||||
value = si.getValue();
|
||||
expected = UTF16Plus.isSurrogate(c) ? errorValue : values[i];
|
||||
if(value!=expected) {
|
||||
fail(String.format(
|
||||
"error: wrong value from UCPTRIE_PREV(%s)(U+%04x): 0x%x instead of 0x%x\n",
|
||||
testName, c, value, expected));
|
||||
}
|
||||
if(c!=c2) {
|
||||
fail(String.format(
|
||||
"error: wrong code point from UCPTRIE_PREV(%s): U+%04x != U+%04x\n",
|
||||
testName, c, c2));
|
||||
}
|
||||
}
|
||||
assertFalse("previous() at the start", si.previous());
|
||||
}
|
||||
|
||||
private void
|
||||
testTrie(String testName, CodePointTrie trie,
|
||||
CodePointTrie.Type type, CodePointTrie.ValueWidth valueWidth,
|
||||
CheckRange checkRanges[]) {
|
||||
testTrieGetters(testName, trie, type, valueWidth, checkRanges);
|
||||
testTrieGetRanges(testName, trie, CodePointMap.RangeOption.NORMAL, 0, checkRanges);
|
||||
if (type == CodePointTrie.Type.FAST) {
|
||||
testTrieUTF16(testName, trie, valueWidth, checkRanges);
|
||||
// Java: no testTrieUTF8(testName, trie, valueWidth, checkRanges);
|
||||
}
|
||||
}
|
||||
|
||||
private void
|
||||
testBuilder(String testName, MutableCodePointTrie mutableTrie, CheckRange checkRanges[]) {
|
||||
testBuilderGetters(testName, mutableTrie, checkRanges);
|
||||
testTrieGetRanges(testName, mutableTrie, CodePointMap.RangeOption.NORMAL, 0, checkRanges);
|
||||
}
|
||||
|
||||
private void
|
||||
testTrieSerialize(String testName, MutableCodePointTrie mutableTrie,
|
||||
CodePointTrie.Type type, CodePointTrie.ValueWidth valueWidth, boolean withSwap,
|
||||
CheckRange checkRanges[]) {
|
||||
CodePointTrie trie;
|
||||
int length1;
|
||||
|
||||
/* clone the trie so that the caller can reuse the original */
|
||||
mutableTrie = mutableTrie.clone();
|
||||
|
||||
/*
|
||||
* This is not a loop, but simply a block that we can exit with "break"
|
||||
* when something goes wrong.
|
||||
*/
|
||||
do {
|
||||
trie = mutableTrie.buildImmutable(type, valueWidth);
|
||||
ByteArrayOutputStream os = new ByteArrayOutputStream();
|
||||
length1=trie.toBinary(os);
|
||||
assertEquals(testName + ".toBinary() length", os.size(), length1);
|
||||
ByteBuffer storage = ByteBuffer.wrap(os.toByteArray());
|
||||
// Java: no preflighting
|
||||
|
||||
testTrie(testName, trie, type, valueWidth, checkRanges);
|
||||
trie=null;
|
||||
|
||||
// Java: There is no code for "swapping" the endianness of data.
|
||||
// withSwap is unused.
|
||||
|
||||
trie = CodePointTrie.fromBinary(type, valueWidth, storage);
|
||||
if(type != trie.getType()) {
|
||||
fail(String.format(
|
||||
"error: trie serialization (%s) did not preserve trie type\n", testName));
|
||||
break;
|
||||
}
|
||||
if(valueWidth != trie.getValueWidth()) {
|
||||
fail(String.format(
|
||||
"error: trie serialization (%s) did not preserve data value width\n", testName));
|
||||
break;
|
||||
}
|
||||
if(os.size()!=storage.position()) {
|
||||
fail(String.format(
|
||||
"error: trie serialization (%s) lengths different: " +
|
||||
"serialize vs. unserialize\n", testName));
|
||||
break;
|
||||
}
|
||||
|
||||
{
|
||||
storage.rewind();
|
||||
CodePointTrie any = CodePointTrie.fromBinary(null, null, storage);
|
||||
if (type != any.getType()) {
|
||||
fail(String.format(
|
||||
"error: ucptrie_openFromBinary(" +
|
||||
"UCPTRIE_TYPE_ANY, UCPTRIE_VALUE_BITS_ANY).getType() wrong\n"));
|
||||
}
|
||||
if (valueWidth != any.getValueWidth()) {
|
||||
fail(String.format(
|
||||
"error: ucptrie_openFromBinary(" +
|
||||
"UCPTRIE_TYPE_ANY, UCPTRIE_VALUE_BITS_ANY).getValueWidth() wrong\n"));
|
||||
}
|
||||
}
|
||||
|
||||
testTrie(testName, trie, type, valueWidth, checkRanges);
|
||||
{
|
||||
/* make a mutable trie from an immutable one */
|
||||
int value, value2;
|
||||
MutableCodePointTrie mutable2 = MutableCodePointTrie.fromCodePointMap(trie);
|
||||
|
||||
value=mutable2.get(0xa1);
|
||||
mutable2.set(0xa1, 789);
|
||||
value2=mutable2.get(0xa1);
|
||||
mutable2.set(0xa1, value);
|
||||
if(value2!=789) {
|
||||
fail(String.format(
|
||||
"error: modifying a mutableTrie-from-UCPTrie (%s) failed\n",
|
||||
testName));
|
||||
}
|
||||
testBuilder(testName, mutable2, checkRanges);
|
||||
}
|
||||
} while(false);
|
||||
}
|
||||
|
||||
private MutableCodePointTrie
|
||||
testTrieSerializeAllValueWidth(String testName,
|
||||
MutableCodePointTrie mutableTrie, boolean withClone,
|
||||
CheckRange checkRanges[]) {
|
||||
int oredValues = 0;
|
||||
int i;
|
||||
for (i = 0; i < checkRanges.length; ++i) {
|
||||
oredValues |= checkRanges[i].value;
|
||||
}
|
||||
|
||||
testBuilder(testName, mutableTrie, checkRanges);
|
||||
|
||||
if (oredValues <= 0xffff) {
|
||||
String name = testName + ".16";
|
||||
testTrieSerialize(name, mutableTrie,
|
||||
CodePointTrie.Type.FAST, CodePointTrie.ValueWidth.BITS_16, withClone,
|
||||
checkRanges);
|
||||
}
|
||||
|
||||
String name = testName + ".32";
|
||||
testTrieSerialize(name, mutableTrie,
|
||||
CodePointTrie.Type.FAST, CodePointTrie.ValueWidth.BITS_32, withClone,
|
||||
checkRanges);
|
||||
|
||||
if (oredValues <= 0xff) {
|
||||
name = testName + ".8";
|
||||
testTrieSerialize(name, mutableTrie,
|
||||
CodePointTrie.Type.FAST, CodePointTrie.ValueWidth.BITS_8, withClone,
|
||||
checkRanges);
|
||||
}
|
||||
|
||||
if (oredValues <= 0xffff) {
|
||||
name = testName + ".small16";
|
||||
testTrieSerialize(name, mutableTrie,
|
||||
CodePointTrie.Type.SMALL, CodePointTrie.ValueWidth.BITS_16, withClone,
|
||||
checkRanges);
|
||||
}
|
||||
|
||||
return mutableTrie;
|
||||
}
|
||||
|
||||
private MutableCodePointTrie
|
||||
makeTrieWithRanges(String testName, boolean withClone,
|
||||
SetRange setRanges[], CheckRange checkRanges[]) {
|
||||
MutableCodePointTrie mutableTrie;
|
||||
int value;
|
||||
int start, limit;
|
||||
int i;
|
||||
|
||||
System.out.println("\ntesting Trie " + testName);
|
||||
SpecialValues specials = getSpecialValues(checkRanges);
|
||||
mutableTrie = new MutableCodePointTrie(specials.initialValue, specials.errorValue);
|
||||
|
||||
/* set values from setRanges[] */
|
||||
for(i=0; i<setRanges.length; ++i) {
|
||||
if(withClone && i==setRanges.length/2) {
|
||||
/* switch to a clone in the middle of setting values */
|
||||
MutableCodePointTrie clone = mutableTrie.clone();
|
||||
mutableTrie = clone;
|
||||
}
|
||||
start=setRanges[i].start;
|
||||
limit=setRanges[i].limit;
|
||||
value=setRanges[i].value;
|
||||
if ((limit - start) == 1) {
|
||||
mutableTrie.set(start, value);
|
||||
} else {
|
||||
mutableTrie.setRange(start, limit-1, value);
|
||||
}
|
||||
}
|
||||
|
||||
return mutableTrie;
|
||||
}
|
||||
|
||||
private void
|
||||
testTrieRanges(String testName, boolean withClone, SetRange setRanges[], CheckRange checkRanges[]) {
|
||||
MutableCodePointTrie mutableTrie = makeTrieWithRanges(
|
||||
testName, withClone, setRanges, checkRanges);
|
||||
if (mutableTrie != null) {
|
||||
mutableTrie = testTrieSerializeAllValueWidth(testName, mutableTrie, withClone, checkRanges);
|
||||
}
|
||||
}
|
||||
|
||||
/* test data ----------------------------------------------------------------*/
|
||||
|
||||
/* set consecutive ranges, even with value 0 */
|
||||
private static final SetRange
|
||||
setRanges1[]={
|
||||
new SetRange(0, 0x40, 0 ),
|
||||
new SetRange(0x40, 0xe7, 0x34),
|
||||
new SetRange(0xe7, 0x3400, 0 ),
|
||||
new SetRange(0x3400, 0x9fa6, 0x61),
|
||||
new SetRange(0x9fa6, 0xda9e, 0x31),
|
||||
new SetRange(0xdada, 0xeeee, 0xff),
|
||||
new SetRange(0xeeee, 0x11111, 1 ),
|
||||
new SetRange(0x11111, 0x44444, 0x61),
|
||||
new SetRange(0x44444, 0x60003, 0 ),
|
||||
new SetRange(0xf0003, 0xf0004, 0xf ),
|
||||
new SetRange(0xf0004, 0xf0006, 0x10),
|
||||
new SetRange(0xf0006, 0xf0007, 0x11),
|
||||
new SetRange(0xf0007, 0xf0040, 0x12),
|
||||
new SetRange(0xf0040, 0x110000, 0 )
|
||||
};
|
||||
|
||||
private static final CheckRange
|
||||
checkRanges1[]={
|
||||
new CheckRange(0, 0),
|
||||
new CheckRange(0x40, 0),
|
||||
new CheckRange(0xe7, 0x34),
|
||||
new CheckRange(0x3400, 0),
|
||||
new CheckRange(0x9fa6, 0x61),
|
||||
new CheckRange(0xda9e, 0x31),
|
||||
new CheckRange(0xdada, 0),
|
||||
new CheckRange(0xeeee, 0xff),
|
||||
new CheckRange(0x11111, 1),
|
||||
new CheckRange(0x44444, 0x61),
|
||||
new CheckRange(0xf0003, 0),
|
||||
new CheckRange(0xf0004, 0xf),
|
||||
new CheckRange(0xf0006, 0x10),
|
||||
new CheckRange(0xf0007, 0x11),
|
||||
new CheckRange(0xf0040, 0x12),
|
||||
new CheckRange(0x110000, 0)
|
||||
};
|
||||
|
||||
/* set some interesting overlapping ranges */
|
||||
private static final SetRange
|
||||
setRanges2[]={
|
||||
new SetRange(0x21, 0x7f, 0x5555),
|
||||
new SetRange(0x2f800, 0x2fedc, 0x7a ),
|
||||
new SetRange(0x72, 0xdd, 3 ),
|
||||
new SetRange(0xdd, 0xde, 4 ),
|
||||
new SetRange(0x201, 0x240, 6 ), /* 3 consecutive blocks with the same pattern but */
|
||||
new SetRange(0x241, 0x280, 6 ), /* discontiguous value ranges, testing iteration */
|
||||
new SetRange(0x281, 0x2c0, 6 ),
|
||||
new SetRange(0x2f987, 0x2fa98, 5 ),
|
||||
new SetRange(0x2f777, 0x2f883, 0 ),
|
||||
new SetRange(0x2fedc, 0x2ffaa, 1 ),
|
||||
new SetRange(0x2ffaa, 0x2ffab, 2 ),
|
||||
new SetRange(0x2ffbb, 0x2ffc0, 7 )
|
||||
};
|
||||
|
||||
private static final CheckRange
|
||||
checkRanges2[]={
|
||||
new CheckRange(0, 0),
|
||||
new CheckRange(0x21, 0),
|
||||
new CheckRange(0x72, 0x5555),
|
||||
new CheckRange(0xdd, 3),
|
||||
new CheckRange(0xde, 4),
|
||||
new CheckRange(0x201, 0),
|
||||
new CheckRange(0x240, 6),
|
||||
new CheckRange(0x241, 0),
|
||||
new CheckRange(0x280, 6),
|
||||
new CheckRange(0x281, 0),
|
||||
new CheckRange(0x2c0, 6),
|
||||
new CheckRange(0x2f883, 0),
|
||||
new CheckRange(0x2f987, 0x7a),
|
||||
new CheckRange(0x2fa98, 5),
|
||||
new CheckRange(0x2fedc, 0x7a),
|
||||
new CheckRange(0x2ffaa, 1),
|
||||
new CheckRange(0x2ffab, 2),
|
||||
new CheckRange(0x2ffbb, 0),
|
||||
new CheckRange(0x2ffc0, 7),
|
||||
new CheckRange(0x110000, 0)
|
||||
};
|
||||
|
||||
/* use a non-zero initial value */
|
||||
private static final SetRange
|
||||
setRanges3[]={
|
||||
new SetRange(0x31, 0xa4, 1),
|
||||
new SetRange(0x3400, 0x6789, 2),
|
||||
new SetRange(0x8000, 0x89ab, 9),
|
||||
new SetRange(0x9000, 0xa000, 4),
|
||||
new SetRange(0xabcd, 0xbcde, 3),
|
||||
new SetRange(0x55555, 0x110000, 6), /* highStart<U+ffff with non-initialValue */
|
||||
new SetRange(0xcccc, 0x55555, 6)
|
||||
};
|
||||
|
||||
private static final CheckRange
|
||||
checkRanges3[]={
|
||||
new CheckRange(0, 9), /* non-zero initialValue */
|
||||
new CheckRange(0x31, 9),
|
||||
new CheckRange(0xa4, 1),
|
||||
new CheckRange(0x3400, 9),
|
||||
new CheckRange(0x6789, 2),
|
||||
new CheckRange(0x9000, 9),
|
||||
new CheckRange(0xa000, 4),
|
||||
new CheckRange(0xabcd, 9),
|
||||
new CheckRange(0xbcde, 3),
|
||||
new CheckRange(0xcccc, 9),
|
||||
new CheckRange(0x110000, 6)
|
||||
};
|
||||
|
||||
/* empty or single-value tries, testing highStart==0 */
|
||||
private static final SetRange
|
||||
setRangesEmpty[]={
|
||||
// new SetRange(0, 0, 0), /* need some values for it to compile */
|
||||
};
|
||||
|
||||
private static final CheckRange
|
||||
checkRangesEmpty[]={
|
||||
new CheckRange(0, 3),
|
||||
new CheckRange(0x110000, 3)
|
||||
};
|
||||
|
||||
private static final SetRange
|
||||
setRangesSingleValue[]={
|
||||
new SetRange(0, 0x110000, 5),
|
||||
};
|
||||
|
||||
private static final CheckRange
|
||||
checkRangesSingleValue[]={
|
||||
new CheckRange(0, 3),
|
||||
new CheckRange(0x110000, 5)
|
||||
};
|
||||
|
||||
@Test
|
||||
public void TrieTestSet1() {
|
||||
testTrieRanges("set1", false, setRanges1, checkRanges1);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void TrieTestSet2Overlap() {
|
||||
testTrieRanges("set2-overlap", false, setRanges2, checkRanges2);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void TrieTestSet3Initial9() {
|
||||
testTrieRanges("set3-initial-9", false, setRanges3, checkRanges3);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void TrieTestSetEmpty() {
|
||||
testTrieRanges("set-empty", false, setRangesEmpty, checkRangesEmpty);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void TrieTestSetSingleValue() {
|
||||
testTrieRanges("set-single-value", false, setRangesSingleValue, checkRangesSingleValue);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void TrieTestSet2OverlapWithClone() {
|
||||
testTrieRanges("set2-overlap.withClone", true, setRanges2, checkRanges2);
|
||||
}
|
||||
|
||||
/* test mutable-trie memory management -------------------------------------- */
|
||||
|
||||
@Test
|
||||
public void FreeBlocksTest() {
|
||||
final CheckRange
|
||||
checkRanges[]={
|
||||
new CheckRange(0, 1),
|
||||
new CheckRange(0x740, 1),
|
||||
new CheckRange(0x780, 2),
|
||||
new CheckRange(0x880, 3),
|
||||
new CheckRange(0x110000, 1)
|
||||
};
|
||||
String testName="free-blocks";
|
||||
|
||||
MutableCodePointTrie mutableTrie;
|
||||
int i;
|
||||
|
||||
mutableTrie=new MutableCodePointTrie(1, 0xad);
|
||||
|
||||
/*
|
||||
* Repeatedly set overlapping same-value ranges to stress the free-data-block management.
|
||||
* If it fails, it will overflow the data array.
|
||||
*/
|
||||
for(i=0; i<(0x120000>>4)/2; ++i) { // 4=UCPTRIE_SHIFT_3
|
||||
mutableTrie.setRange(0x740, 0x840-1, 1);
|
||||
mutableTrie.setRange(0x780, 0x880-1, 1);
|
||||
mutableTrie.setRange(0x740, 0x840-1, 2);
|
||||
mutableTrie.setRange(0x780, 0x880-1, 3);
|
||||
}
|
||||
/* make blocks that will be free during compaction */
|
||||
mutableTrie.setRange(0x1000, 0x3000-1, 2);
|
||||
mutableTrie.setRange(0x2000, 0x4000-1, 3);
|
||||
mutableTrie.setRange(0x1000, 0x4000-1, 1);
|
||||
|
||||
mutableTrie = testTrieSerializeAllValueWidth(testName, mutableTrie, false, checkRanges);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void GrowDataArrayTest() {
|
||||
final CheckRange
|
||||
checkRanges[]={
|
||||
new CheckRange(0, 1),
|
||||
new CheckRange(0x720, 2),
|
||||
new CheckRange(0x7a0, 3),
|
||||
new CheckRange(0x8a0, 4),
|
||||
new CheckRange(0x110000, 5)
|
||||
};
|
||||
String testName="grow-data";
|
||||
|
||||
MutableCodePointTrie mutableTrie;
|
||||
int i;
|
||||
|
||||
mutableTrie=new MutableCodePointTrie(1, 0xad);
|
||||
|
||||
/*
|
||||
* Use umutablecptrie_set() not umutablecptrie_setRange() to write non-initialValue-data.
|
||||
* Should grow/reallocate the data array to a sufficient length.
|
||||
*/
|
||||
for(i=0; i<0x1000; ++i) {
|
||||
mutableTrie.set(i, 2);
|
||||
}
|
||||
for(i=0x720; i<0x1100; ++i) { /* some overlap */
|
||||
mutableTrie.set(i, 3);
|
||||
}
|
||||
for(i=0x7a0; i<0x900; ++i) {
|
||||
mutableTrie.set(i, 4);
|
||||
}
|
||||
for(i=0x8a0; i<0x110000; ++i) {
|
||||
mutableTrie.set(i, 5);
|
||||
}
|
||||
|
||||
mutableTrie = testTrieSerializeAllValueWidth(testName, mutableTrie, false, checkRanges);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void ManyAllSameBlocksTest() {
|
||||
String testName="many-all-same";
|
||||
|
||||
MutableCodePointTrie mutableTrie;
|
||||
int i;
|
||||
CheckRange[] checkRanges = new CheckRange[(0x110000 >> 12) + 1];
|
||||
|
||||
mutableTrie = new MutableCodePointTrie(0xff33, 0xad);
|
||||
checkRanges[0] = new CheckRange(0, 0xff33); // initialValue
|
||||
|
||||
// Many all-same-value blocks.
|
||||
for (i = 0; i < 0x110000; i += 0x1000) {
|
||||
int value = i >> 12;
|
||||
mutableTrie.setRange(i, i + 0xfff, value);
|
||||
checkRanges[value + 1] = new CheckRange(i + 0x1000, value);
|
||||
}
|
||||
for (i = 0; i < 0x110000; i += 0x1000) {
|
||||
int expected = i >> 12;
|
||||
int v0 = mutableTrie.get(i);
|
||||
int vfff = mutableTrie.get(i + 0xfff);
|
||||
if (v0 != expected || vfff != expected) {
|
||||
fail(String.format(
|
||||
"error: MutableCodePointTrie U+%04x unexpected value\n", i));
|
||||
}
|
||||
}
|
||||
|
||||
mutableTrie = testTrieSerializeAllValueWidth(testName, mutableTrie, false, checkRanges);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void MuchDataTest() {
|
||||
String testName="much-data";
|
||||
|
||||
MutableCodePointTrie mutableTrie;
|
||||
int r, c;
|
||||
CheckRange[] checkRanges = new CheckRange[(0x10000 >> 6) + (0x10240 >> 4) + 10];
|
||||
|
||||
mutableTrie = new MutableCodePointTrie(0xff33, 0xad);
|
||||
checkRanges[0] = new CheckRange(0, 0xff33); // initialValue
|
||||
r = 1;
|
||||
|
||||
// Add much data that does not compact well,
|
||||
// to get more than 128k data values after compaction.
|
||||
for (c = 0; c < 0x10000; c += 0x40) {
|
||||
int value = c >> 4;
|
||||
mutableTrie.setRange(c, c + 0x3f, value);
|
||||
checkRanges[r++] = new CheckRange(c + 0x40, value);
|
||||
}
|
||||
checkRanges[r++] = new CheckRange(0x20000, 0xff33);
|
||||
for (c = 0x20000; c < 0x30230; c += 0x10) {
|
||||
int value = c >> 4;
|
||||
mutableTrie.setRange(c, c + 0xf, value);
|
||||
checkRanges[r++] = new CheckRange(c + 0x10, value);
|
||||
}
|
||||
mutableTrie.setRange(0x30230, 0x30233, 0x3023);
|
||||
checkRanges[r++] = new CheckRange(0x30234, 0x3023);
|
||||
mutableTrie.setRange(0x30234, 0xdffff, 0x5005);
|
||||
checkRanges[r++] = new CheckRange(0xe0000, 0x5005);
|
||||
mutableTrie.setRange(0xe0000, 0x10ffff, 0x9009);
|
||||
checkRanges[r++] = new CheckRange(0x110000, 0x9009);
|
||||
|
||||
checkRanges = Arrays.copyOf(checkRanges, r);
|
||||
testBuilder(testName, mutableTrie, checkRanges);
|
||||
testTrieSerialize("much-data.16", mutableTrie,
|
||||
CodePointTrie.Type.FAST, CodePointTrie.ValueWidth.BITS_16, false,
|
||||
checkRanges);
|
||||
}
|
||||
|
||||
private void testGetRangesFixedSurr(String testName, MutableCodePointTrie mutableTrie,
|
||||
CodePointMap.RangeOption option, CheckRange checkRanges[]) {
|
||||
testTrieGetRanges(testName, mutableTrie, option, 5, checkRanges);
|
||||
MutableCodePointTrie clone = mutableTrie.clone();
|
||||
CodePointTrie trie =
|
||||
clone.buildImmutable(CodePointTrie.Type.FAST, CodePointTrie.ValueWidth.BITS_16);
|
||||
testTrieGetRanges(testName, trie, option, 5, checkRanges);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void TrieTestGetRangesFixedSurr() {
|
||||
final SetRange
|
||||
setRangesFixedSurr[]={
|
||||
new SetRange(0xd000, 0xd7ff, 5),
|
||||
new SetRange(0xd7ff, 0xe001, 3),
|
||||
new SetRange(0xe001, 0xf900, 5),
|
||||
};
|
||||
|
||||
final CheckRange
|
||||
checkRangesFixedLeadSurr1[]={
|
||||
new CheckRange(0, 0),
|
||||
new CheckRange(0xd000, 0),
|
||||
new CheckRange(0xd7ff, 5),
|
||||
new CheckRange(0xd800, 3),
|
||||
new CheckRange(0xdc00, 5),
|
||||
new CheckRange(0xe001, 3),
|
||||
new CheckRange(0xf900, 5),
|
||||
new CheckRange(0x110000, 0)
|
||||
};
|
||||
|
||||
final CheckRange
|
||||
checkRangesFixedAllSurr1[]={
|
||||
new CheckRange(0, 0),
|
||||
new CheckRange(0xd000, 0),
|
||||
new CheckRange(0xd7ff, 5),
|
||||
new CheckRange(0xd800, 3),
|
||||
new CheckRange(0xe000, 5),
|
||||
new CheckRange(0xe001, 3),
|
||||
new CheckRange(0xf900, 5),
|
||||
new CheckRange(0x110000, 0)
|
||||
};
|
||||
|
||||
final CheckRange
|
||||
checkRangesFixedLeadSurr3[]={
|
||||
new CheckRange(0, 0),
|
||||
new CheckRange(0xd000, 0),
|
||||
new CheckRange(0xdc00, 5),
|
||||
new CheckRange(0xe001, 3),
|
||||
new CheckRange(0xf900, 5),
|
||||
new CheckRange(0x110000, 0)
|
||||
};
|
||||
|
||||
final CheckRange
|
||||
checkRangesFixedAllSurr3[]={
|
||||
new CheckRange(0, 0),
|
||||
new CheckRange(0xd000, 0),
|
||||
new CheckRange(0xe000, 5),
|
||||
new CheckRange(0xe001, 3),
|
||||
new CheckRange(0xf900, 5),
|
||||
new CheckRange(0x110000, 0)
|
||||
};
|
||||
|
||||
final CheckRange
|
||||
checkRangesFixedSurr4[]={
|
||||
new CheckRange(0, 0),
|
||||
new CheckRange(0xd000, 0),
|
||||
new CheckRange(0xf900, 5),
|
||||
new CheckRange(0x110000, 0)
|
||||
};
|
||||
|
||||
MutableCodePointTrie mutableTrie = makeTrieWithRanges(
|
||||
"fixedSurr", false, setRangesFixedSurr, checkRangesFixedLeadSurr1);
|
||||
testGetRangesFixedSurr("fixedLeadSurr1", mutableTrie,
|
||||
CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, checkRangesFixedLeadSurr1);
|
||||
testGetRangesFixedSurr("fixedAllSurr1", mutableTrie,
|
||||
CodePointMap.RangeOption.FIXED_ALL_SURROGATES, checkRangesFixedAllSurr1);
|
||||
// Setting a range in the middle of lead surrogates makes no difference.
|
||||
mutableTrie.setRange(0xd844, 0xd899, 5);
|
||||
testGetRangesFixedSurr("fixedLeadSurr2", mutableTrie,
|
||||
CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, checkRangesFixedLeadSurr1);
|
||||
// Bridge the gap before the lead surrogates.
|
||||
mutableTrie.set(0xd7ff, 5);
|
||||
testGetRangesFixedSurr("fixedLeadSurr3", mutableTrie,
|
||||
CodePointMap.RangeOption.FIXED_LEAD_SURROGATES, checkRangesFixedLeadSurr3);
|
||||
testGetRangesFixedSurr("fixedAllSurr3", mutableTrie,
|
||||
CodePointMap.RangeOption.FIXED_ALL_SURROGATES, checkRangesFixedAllSurr3);
|
||||
// Bridge the gap after the trail surrogates.
|
||||
mutableTrie.set(0xe000, 5);
|
||||
testGetRangesFixedSurr("fixedSurr4", mutableTrie,
|
||||
CodePointMap.RangeOption.FIXED_ALL_SURROGATES, checkRangesFixedSurr4);
|
||||
}
|
||||
}
|
Loading…
Add table
Reference in a new issue