mirror of
https://github.com/unicode-org/icu.git
synced 2025-04-04 13:05:31 +00:00
ICU-22707 gennorm2 & C++ norm2impl support MaybeNo
- .nrm formatVersion 5 - updated data format doc & design doc
This commit is contained in:
parent
d0e43d6943
commit
94ef2757a3
16 changed files with 1134 additions and 919 deletions
|
@ -109,15 +109,21 @@ Per starter that combines forward, old and new data stores a linear, sorted list
|
|||
* Canonical ordering requires ccc data.
|
||||
* Composition only combines the most recent starter with one other character, if such a mapping is defined. It only ever combines one pair into one composite per step.
|
||||
* Every composition is the reverse of a corresponding decomposition. That is, a decomposition can either be a one-way mapping (from one code point to a sequence of one or more others but not back from that sequence to the original), or it can be a two-way mapping (from one code point to a pair of others, and back).
|
||||
* Note: Custom mappings may also map some characters away, that is, to an empty string. The ICU 4.2 implementation is not prepared to handle such a case because it does not occur in standard Unicode normalization. This will need to be supported for custom tables.
|
||||
* Note: Unicode NFKC\_Casefold and UTS #46 map each Default\_Ignorable\_Code\_Point to an empty string.
|
||||
* Note: Mappings may also map some characters away, that is, to an empty string. The ICU 4.2 implementation was not prepared to handle such a case.
|
||||
This needs to be supported for custom tables.
|
||||
For example, Unicode NFKC\_Casefold and UTS #46 map each Default\_Ignorable\_Code\_Point to an empty string.
|
||||
* A starter is defined as a character with ccc=0.
|
||||
* Only a starter can combine-forward, but most starters don't. (The set of compositions/2-way mappings in standard Unicode normalization increases only slowly.)
|
||||
* A composite (result of combining a pair of characters) must have ccc=0, or else the result of composition may not be in canonical order because there is not another reordering step.
|
||||
* A composite can combine-forward. The composition algorithm tries to combine the new composite with following characters. (For example, base characters with two diacritics, and Hangul LVT.)
|
||||
* The ICU implementation recomposes starting from a fully decomposed sequence. Therefore, the lookup value needs to indicate combines-forward only for characters that do not have a mapping. The composition table result then indicates whether a composite combines-forward, and the index to the combined mapping+composition data is then found via the index from the composite's lookup result.
|
||||
* ICU 49 composePair() needs to know whether the first character combines forward even if it is a composite. formatVersion 2 separates the YesNo range into two parts accordingly, adding the yesNoMappingsOnly threshold.
|
||||
* A composite cannot combine-back because the composition algorithm does not try to combine an earlier starter with the new composite.
|
||||
* A composite itself cannot combine-back because the composition algorithm does not try to combine
|
||||
an earlier starter with the new composite.
|
||||
However, when a character has a two-way mapping which starts with a combine-back character,
|
||||
then the composite needs to be marked as combine-back (NF*C_QC=Maybe)
|
||||
so that normalization and the quick check work properly.
|
||||
Such characters occur in Unicode 16 for the first time.
|
||||
* The algorithm allows for a character to both combine-back and combine-forward.
|
||||
Such characters occur in Unicode 16 for the first time.
|
||||
* Hangul syllables are algorithmically decomposed into Jamos, and algorithmically recomposed from them. The actual mappings are not stored in the table.
|
||||
|
@ -144,16 +150,22 @@ A simple mapping to one code point can be stored directly in the lookup value, w
|
|||
|
||||
ICU does not allow tailoring of Hangul/Jamo mappings and compositions, except to make the relevant characters completely inert.
|
||||
|
||||
MaybeNo is both forbidden and irrelevant:
|
||||
MaybeNo is possible, and Unicode 16 adds the first such characters.
|
||||
|
||||
* If a character has a one-way mapping, it has NoNo quick check values.
|
||||
* If it has a two-way mapping, then it is a composite, but the Unicode composition would not try to combine it with a preceding character.
|
||||
* Composition sees NFD, so it sees no characters with mappings.
|
||||
* However, if a two-way mapping starts with a "Maybe" character (combines-back),
|
||||
then the composite must also be marked as combines-back, that is, MaybeNo rather than YesNo.
|
||||
* The character and some surrounding ones need to be decomposed,
|
||||
and composition may combine the first character in the mapping with a previous starter,
|
||||
in which case the original composite would not occur in the result.
|
||||
|
||||
* Forbidden: If it has a one-way mapping, it has NoNo quick check values. If it has a two-way mapping, then it is a composite, but the Unicode composition would not try to combine it with a preceding character.
|
||||
* Irrelevant: Composition sees NFD, so it sees no characters with mappings. A combines-backward character would never combine with anything.
|
||||
|
||||
NoYes is impossible: If it has no mapping, it will occur in NFC.
|
||||
|
||||
A YesNo only ever decomposes into a YesYes+MaybeYes sequence or a YesNo+MaybeYes sequence. That is, a YesNo's decomposition (A=B+C) decomposes further if and only if the first (B) of its two components has a decomposition.
|
||||
|
||||
A YesNo always has ccc=lccc=0.
|
||||
A YesNo or MaybeNo always has ccc=lccc=0.
|
||||
|
||||
Only a starter can combine-forward, therefore no character can have ccc≠0 and combine forward.
|
||||
|
||||
|
@ -161,11 +173,9 @@ A NoNo can have any of its components decompose further, but this is only visibl
|
|||
|
||||
NoNo with combine-forward is impossible: A one-way mapping prevents composition (which starts from NFD where there are no decomposable characters).
|
||||
|
||||
### Per-character lookup values, .nrm formatVersion 3
|
||||
### Per-character lookup values, .nrm formatVersion 3+
|
||||
|
||||
Since ICU 60
|
||||
|
||||
Changes from version 2:
|
||||
Changes from version 2 to 3 (ICU 60):
|
||||
|
||||
* 16-bit value bit 0 used for has-composition-boundary-after, ccc & indexes shifted left by 1.
|
||||
* 16-bit values for delta mappings carry tccc data in bits 2..1.
|
||||
|
@ -178,7 +188,16 @@ Changes from version 2:
|
|||
* The extraData firstUnit bit 5 is no longer necessary (norm16 bit 0 used instead of firstUnit `MAPPING_NO_COMP_BOUNDARY_AFTER`), is reserved again, and always set to 0.
|
||||
* A mapping to an empty string has explicit lccc=1 and tccc=255 values.
|
||||
|
||||
Possible combinations and their encoding:
|
||||
Changes from version 3 to 4 (ICU 63):
|
||||
|
||||
Switch to UCPTrie=CodePointTrie. No more explicit mappings for surrogate code points.
|
||||
|
||||
Changes from version 4 to 5 (ICU 76):
|
||||
|
||||
Support for MaybeNo characters. Addition of two new ranges between "algorithmic NoNo" and MaybeYes,
|
||||
with thresholds minMaybeNo and minMaybeNoCombinesFwd.
|
||||
|
||||
#### Possible combinations and their encoding
|
||||
|
||||
_The rows of the table, from bottom to top, are encoded with increasing 16-bit "norm16" values as noted in the last column. Per-row and per-row-group properties are determined via norm16 range checks._
|
||||
|
||||
|
@ -352,10 +371,41 @@ _The rows of the table, from bottom to top, are encoded with increasing 16-bit "
|
|||
</td>
|
||||
<td style="width:60px">U+1611E GURUNG KHEMA VOWEL SIGN AA</td>
|
||||
<td style="width:456px;height:31px">
|
||||
≥minMaybeYes which is 8-aligned
|
||||
≥minMaybeYes
|
||||
<br />
|
||||
index into composition table
|
||||
index into maybeData composition table
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="background-color:rgb(255,242,204);width:71px;height:31px">Maybe</td>
|
||||
<td style="background-color:rgb(244,204,204);width:71px;height:47px">No</td>
|
||||
<td style="width:52px;height:31px">0</td>
|
||||
<td style="width:75px;height:31px">no</td>
|
||||
<td style="width:74px;height:31px">yes</td>
|
||||
<td style="width:476px;height:31px">
|
||||
Has 2-way mapping, both combine-back & combine-fwd
|
||||
</td>
|
||||
<td style="width:60px">U+16121 GURUNG KHEMA VOWEL SIGN U</td>
|
||||
<td style="width:456px;height:31px">
|
||||
≥minMaybeNoCombinesFwd
|
||||
<br />
|
||||
index into maybeData decomp+comp table
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="background-color:rgb(255,242,204);width:71px;height:31px">Maybe</td>
|
||||
<td style="background-color:rgb(244,204,204);width:71px;height:47px">No</td>
|
||||
<td style="width:52px;height:31px">0</td>
|
||||
<td style="width:75px;height:31px">no</td>
|
||||
<td style="width:74px;height:31px">no</td>
|
||||
<td style="width:476px;height:31px">
|
||||
Has 2-way mapping & combine-back
|
||||
</td>
|
||||
<td style="width:60px">U+16126 GURUNG KHEMA VOWEL SIGN O</td>
|
||||
<td style="width:456px;height:31px">
|
||||
≥minMaybeNo which is 8-aligned
|
||||
<br />
|
||||
index into maybeData decomposition table
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
|
@ -384,9 +434,9 @@ _The rows of the table, from bottom to top, are encoded with increasing 16-bit "
|
|||
</td>
|
||||
<td style="width:60px">A</td>
|
||||
<td style="width:456px;height:47px">
|
||||
≥minNoNoDelta=minMaybeYes-((2*maxDelta+1)<<3)
|
||||
≥minNoNoDelta=minMaybeNo-((2*maxDelta+1)<<3)
|
||||
<br />
|
||||
delta=0 is at minMaybeYes-((maxDelta-1)<<3); it must not be used
|
||||
delta=0 is at minMaybeNo-((maxDelta-1)<<3); it must not be used
|
||||
<br />
|
||||
bits 2..1: tccc=0 or 1 or >1
|
||||
</td>
|
||||
|
@ -1157,15 +1207,28 @@ The minYesNoMappingsOnly distinction was added in ICU 49, .nrm formatVersion 2.0
|
|||
|
||||
### Additional data indexed by the trie value
|
||||
|
||||
(**Implemented in ICU 4.4, .nrm formatVersion 1.0. Modified in ICU 49, .nrm formatVersion 2.0 and in ICU 60, .nrm formatVersion 3.0.**)
|
||||
ICU | formatVersion
|
||||
--- | -------------
|
||||
4.4 | 1
|
||||
49 | 2
|
||||
60 | 3
|
||||
63 | 4
|
||||
76 | 5
|
||||
|
||||
"Extra data" per code point, if it has a mapping or if it combines-forward, is stored in 16-bit-unit arrays. The character's lookup value is an index into one of these arrays. It is probably handy to have two arrays, so that indexes can be allocated independently for the two ranges of 16-bit lookup values that are indexes into extra data.
|
||||
"Extra data" per code point, if it has a mapping or if it combines-forward, is stored in a 16-bit-unit array with many per-character data sections. The character's lookup value, or part of its bits, is an index to one of these sections.
|
||||
|
||||
* One array with composition lists for MaybeYes characters which don't also have a mapping.
|
||||
* Usually, MaybeYes characters don't have composition lists, so this array will usually be empty.
|
||||
* One array with
|
||||
* Composition lists for YesYes characters which don't also have a mapping
|
||||
* Mappings and optional composition lists for YesNo characters which do have a mapping
|
||||
* Composition lists for YesYes and MaybeYes characters which combine-forward but
|
||||
don't also have a mapping
|
||||
* Mappings and composition lists for YesNo and MaybeNo characters which have a two-way mapping
|
||||
* Only mappings for characters which have a one-way mapping
|
||||
|
||||
In formatVersions 4 and below, the composition lists for MaybeYes characters were stored before
|
||||
the data for other characters.
|
||||
|
||||
In formatVersion 5, the data for MaybeNo and MaybeYes characters is stored after
|
||||
the data for other characters.
|
||||
|
||||
There is no data in these arrays corresponding to the gap between limitNoNo and minMaybeNo.
|
||||
|
||||
Threshold values like minYesNo depend on the mapping data.
|
||||
|
||||
|
@ -1176,7 +1239,8 @@ Mapping to an empty string is encoded as a regular mapping with length 0.
|
|||
* formatVersion 3 stores explicit worst-case values lccc=1 and tccc=255.
|
||||
* formatVersion 1 & 2 store ccc=lccc=tccc=0, and the worst-case values are computed at runtime.
|
||||
|
||||
If both a mapping and a composition list are stored for a character (only possible for YesNo), the mapping comes first.
|
||||
If both a mapping and a composition list are stored for a character (for YesNo & MaybeNo),
|
||||
the mapping comes first.
|
||||
|
||||
* In formatVersion 2+, the trie value thresholds indicate whether there is a composition list.
|
||||
* In formatVersion 1, a bit in the first word indicates that there is a composition list.
|
||||
|
@ -1225,9 +1289,9 @@ Optional composition list
|
|||
* Second unit bits 15..6 contain the combining-back code point's bits 9..0
|
||||
* The remaining second/third unit bits are the same as for the previous case
|
||||
|
||||
In the ICU implementation, it is ok to not store the ccc value directly in the lookup value for NoNo characters. When the quick check fails with YesNo, NoNo or MaybeYes, the surrounding sequence is decomposed, which does not use the original characters' ccc values. Composition then sees only YesYes and MaybeYes characters which do have their ccc values in the lookup value.
|
||||
In the ICU implementation, it is ok to not store the ccc value directly in the lookup value for NoNo characters. When the quick check fails with YesNo, MaybeNo, NoNo or MaybeYes, the surrounding sequence is decomposed, which does not use the original characters' ccc values. Composition then sees only YesYes and MaybeYes characters which do have their ccc values in the lookup value.
|
||||
|
||||
A composite that combines-forward has quick check flags YesNo, has a mapping, has ccc=0 (it's a starter) and lccc=0 (it composes from a starter plus another character) and has a composition list (it combines-forward).
|
||||
A composite that combines-forward has quick check flags YesNo or MaybeNo, has a mapping, has ccc=0 (it's a starter) and lccc=0 (it composes from a starter plus another character) and has a composition list (it combines-forward).
|
||||
|
||||
Old vs. new: The old composition data uses combine-forward and combine-back indexes stored in the extra data next to the mapping. In the new data structure, the combine-forward index is replaced by appending the composition list after the mapping, and the combine-back index is replaced by searching in the list for the back-combining code point itself.
|
||||
|
||||
|
|
|
@ -63,7 +63,7 @@ LoadedNormalizer2Impl::isAcceptable(void * /*context*/,
|
|||
pInfo->dataFormat[1]==0x72 &&
|
||||
pInfo->dataFormat[2]==0x6d &&
|
||||
pInfo->dataFormat[3]==0x32 &&
|
||||
pInfo->formatVersion[0]==4
|
||||
pInfo->formatVersion[0]==5
|
||||
) {
|
||||
// Normalizer2Impl *me=(Normalizer2Impl *)context;
|
||||
// uprv_memcpy(me->dataVersion, pInfo->dataVersion, 4);
|
||||
|
|
File diff suppressed because it is too large
Load diff
|
@ -284,7 +284,7 @@ UBool ReorderingBuffer::append(const char16_t *s, int32_t length, UBool isNFD,
|
|||
U16_NEXT(s, i, length, c);
|
||||
if(i<length) {
|
||||
if (isNFD) {
|
||||
leadCC = Normalizer2Impl::getCCFromYesOrMaybe(impl.getRawNorm16(c));
|
||||
leadCC = Normalizer2Impl::getCCFromYesOrMaybeYes(impl.getRawNorm16(c));
|
||||
} else {
|
||||
leadCC = impl.getCC(impl.getNorm16(c));
|
||||
}
|
||||
|
@ -392,7 +392,7 @@ uint8_t ReorderingBuffer::previousCC() {
|
|||
--codePointStart;
|
||||
c=U16_GET_SUPPLEMENTARY(c2, c);
|
||||
}
|
||||
return impl.getCCFromYesOrMaybeCP(c);
|
||||
return impl.getCCFromYesOrMaybeYesCP(c);
|
||||
}
|
||||
|
||||
// Inserts c somewhere before the last character.
|
||||
|
@ -440,15 +440,14 @@ Normalizer2Impl::init(const int32_t *inIndexes, const UCPTrie *inTrie,
|
|||
minNoNoCompNoMaybeCC = static_cast<uint16_t>(inIndexes[IX_MIN_NO_NO_COMP_NO_MAYBE_CC]);
|
||||
minNoNoEmpty = static_cast<uint16_t>(inIndexes[IX_MIN_NO_NO_EMPTY]);
|
||||
limitNoNo = static_cast<uint16_t>(inIndexes[IX_LIMIT_NO_NO]);
|
||||
minMaybeNo = static_cast<uint16_t>(inIndexes[IX_MIN_MAYBE_NO]);
|
||||
minMaybeNoCombinesFwd = static_cast<uint16_t>(inIndexes[IX_MIN_MAYBE_NO_COMBINES_FWD]);
|
||||
minMaybeYes = static_cast<uint16_t>(inIndexes[IX_MIN_MAYBE_YES]);
|
||||
U_ASSERT((minMaybeYes & 7) == 0); // 8-aligned for noNoDelta bit fields
|
||||
centerNoNoDelta = (minMaybeYes >> DELTA_SHIFT) - MAX_DELTA - 1;
|
||||
U_ASSERT((minMaybeNo & 7) == 0); // 8-aligned for noNoDelta bit fields
|
||||
centerNoNoDelta = (minMaybeNo >> DELTA_SHIFT) - MAX_DELTA - 1;
|
||||
|
||||
normTrie=inTrie;
|
||||
|
||||
maybeYesCompositions=inExtraData;
|
||||
extraData=maybeYesCompositions+((MIN_NORMAL_MAYBE_YES-minMaybeYes)>>OFFSET_SHIFT);
|
||||
|
||||
extraData=inExtraData;
|
||||
smallFCD=inSmallFCD;
|
||||
}
|
||||
|
||||
|
@ -650,7 +649,7 @@ Normalizer2Impl::decompose(const char16_t *src, const char16_t *limit,
|
|||
}
|
||||
} else {
|
||||
if(isDecompYes(norm16)) {
|
||||
uint8_t cc=getCCFromYesOrMaybe(norm16);
|
||||
uint8_t cc=getCCFromYesOrMaybeYes(norm16);
|
||||
if(prevCC<=cc || cc==0) {
|
||||
prevCC=cc;
|
||||
if(cc<=1) {
|
||||
|
@ -702,12 +701,13 @@ UBool Normalizer2Impl::decompose(UChar32 c, uint16_t norm16,
|
|||
UErrorCode &errorCode) const {
|
||||
// get the decomposition and the lead and trail cc's
|
||||
if (norm16 >= limitNoNo) {
|
||||
if (isMaybeOrNonZeroCC(norm16)) {
|
||||
return buffer.append(c, getCCFromYesOrMaybe(norm16), errorCode);
|
||||
if (isMaybeYesOrNonZeroCC(norm16)) {
|
||||
return buffer.append(c, getCCFromYesOrMaybeYes(norm16), errorCode);
|
||||
} else if (norm16 < minMaybeNo) {
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
c=mapAlgorithmic(c, norm16);
|
||||
norm16=getRawNorm16(c);
|
||||
}
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
c=mapAlgorithmic(c, norm16);
|
||||
norm16=getRawNorm16(c);
|
||||
}
|
||||
if (norm16 < minYesNo) {
|
||||
// c does not decompose
|
||||
|
@ -718,7 +718,7 @@ UBool Normalizer2Impl::decompose(UChar32 c, uint16_t norm16,
|
|||
return buffer.appendZeroCC(jamos, jamos+Hangul::decompose(c, jamos), errorCode);
|
||||
}
|
||||
// c decomposes, get everything from the variable-length extra data
|
||||
const uint16_t *mapping=getMapping(norm16);
|
||||
const uint16_t *mapping=getData(norm16);
|
||||
uint16_t firstUnit=*mapping;
|
||||
int32_t length=firstUnit&MAPPING_LENGTH_MASK;
|
||||
uint8_t leadCC, trailCC;
|
||||
|
@ -787,9 +787,9 @@ Normalizer2Impl::decomposeUTF8(uint32_t options,
|
|||
}
|
||||
|
||||
// Medium-fast path: Quick check.
|
||||
if (isMaybeOrNonZeroCC(norm16)) {
|
||||
if (isMaybeYesOrNonZeroCC(norm16)) {
|
||||
// Does not decompose.
|
||||
uint8_t cc = getCCFromYesOrMaybe(norm16);
|
||||
uint8_t cc = getCCFromYesOrMaybeYes(norm16);
|
||||
if (prevCC <= cc || cc == 0) {
|
||||
prevCC = cc;
|
||||
if (cc <= 1) {
|
||||
|
@ -836,7 +836,7 @@ Normalizer2Impl::decomposeUTF8(uint32_t options,
|
|||
}
|
||||
// We already know there was a change if the original character decomposed;
|
||||
// otherwise compare.
|
||||
if (isMaybeOrNonZeroCC(norm16) && buffer.equals(prevBoundary, src)) {
|
||||
if (isMaybeYesOrNonZeroCC(norm16) && buffer.equals(prevBoundary, src)) {
|
||||
if (!ByteSinkUtil::appendUnchanged(prevBoundary, src,
|
||||
*sink, options, edits, errorCode)) {
|
||||
break;
|
||||
|
@ -867,9 +867,9 @@ Normalizer2Impl::decomposeShort(const uint8_t *src, const uint8_t *limit,
|
|||
// Get the decomposition and the lead and trail cc's.
|
||||
UChar32 c = U_SENTINEL;
|
||||
if (norm16 >= limitNoNo) {
|
||||
if (isMaybeOrNonZeroCC(norm16)) {
|
||||
if (isMaybeYesOrNonZeroCC(norm16)) {
|
||||
// No comp boundaries around this character.
|
||||
uint8_t cc = getCCFromYesOrMaybe(norm16);
|
||||
uint8_t cc = getCCFromYesOrMaybeYes(norm16);
|
||||
if (cc == 0 && stopAt == STOP_AT_DECOMP_BOUNDARY) {
|
||||
return prevSrc;
|
||||
}
|
||||
|
@ -881,14 +881,15 @@ Normalizer2Impl::decomposeShort(const uint8_t *src, const uint8_t *limit,
|
|||
return src;
|
||||
}
|
||||
continue;
|
||||
} else if (norm16 < minMaybeNo) {
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
if (stopAt != STOP_AT_LIMIT) {
|
||||
return prevSrc;
|
||||
}
|
||||
c = codePointFromValidUTF8(prevSrc, src);
|
||||
c = mapAlgorithmic(c, norm16);
|
||||
norm16 = getRawNorm16(c);
|
||||
}
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
if (stopAt != STOP_AT_LIMIT) {
|
||||
return prevSrc;
|
||||
}
|
||||
c = codePointFromValidUTF8(prevSrc, src);
|
||||
c = mapAlgorithmic(c, norm16);
|
||||
norm16 = getRawNorm16(c);
|
||||
} else if (stopAt != STOP_AT_LIMIT && norm16 < minNoNoCompNoMaybeCC) {
|
||||
return prevSrc;
|
||||
}
|
||||
|
@ -918,7 +919,7 @@ Normalizer2Impl::decomposeShort(const uint8_t *src, const uint8_t *limit,
|
|||
}
|
||||
} else {
|
||||
// The character decomposes, get everything from the variable-length extra data.
|
||||
const uint16_t *mapping = getMapping(norm16);
|
||||
const uint16_t *mapping = getData(norm16);
|
||||
uint16_t firstUnit = *mapping;
|
||||
int32_t length = firstUnit & MAPPING_LENGTH_MASK;
|
||||
uint8_t trailCC = (uint8_t)(firstUnit >> 8);
|
||||
|
@ -946,7 +947,7 @@ Normalizer2Impl::decomposeShort(const uint8_t *src, const uint8_t *limit,
|
|||
const char16_t *
|
||||
Normalizer2Impl::getDecomposition(UChar32 c, char16_t buffer[4], int32_t &length) const {
|
||||
uint16_t norm16;
|
||||
if(c<minDecompNoCP || isMaybeOrNonZeroCC(norm16=getNorm16(c))) {
|
||||
if(c<minDecompNoCP || isMaybeYesOrNonZeroCC(norm16=getNorm16(c))) {
|
||||
// c does not decompose
|
||||
return nullptr;
|
||||
}
|
||||
|
@ -968,7 +969,7 @@ Normalizer2Impl::getDecomposition(UChar32 c, char16_t buffer[4], int32_t &length
|
|||
return buffer;
|
||||
}
|
||||
// c decomposes, get everything from the variable-length extra data
|
||||
const uint16_t *mapping=getMapping(norm16);
|
||||
const uint16_t *mapping=getData(norm16);
|
||||
length=*mapping&MAPPING_LENGTH_MASK;
|
||||
return (const char16_t *)mapping+1;
|
||||
}
|
||||
|
@ -995,7 +996,7 @@ Normalizer2Impl::getRawDecomposition(UChar32 c, char16_t buffer[30], int32_t &le
|
|||
return buffer;
|
||||
}
|
||||
// c decomposes, get everything from the variable-length extra data
|
||||
const uint16_t *mapping=getMapping(norm16);
|
||||
const uint16_t *mapping=getData(norm16);
|
||||
uint16_t firstUnit=*mapping;
|
||||
int32_t mLength=firstUnit&MAPPING_LENGTH_MASK; // length of normal mapping
|
||||
if(firstUnit&MAPPING_HAS_RAW_MAPPING) {
|
||||
|
@ -1070,7 +1071,7 @@ UBool Normalizer2Impl::norm16HasDecompBoundaryBefore(uint16_t norm16) const {
|
|||
return norm16 <= MIN_NORMAL_MAYBE_YES || norm16 == JAMO_VT;
|
||||
}
|
||||
// c decomposes, get everything from the variable-length extra data
|
||||
const uint16_t *mapping=getMapping(norm16);
|
||||
const uint16_t *mapping=getDataForYesOrNo(norm16);
|
||||
uint16_t firstUnit=*mapping;
|
||||
// true if leadCC==0 (hasFCDBoundaryBefore())
|
||||
return (firstUnit&MAPPING_HAS_CCC_LCCC_WORD)==0 || (*(mapping-1)&0xff00)==0;
|
||||
|
@ -1091,14 +1092,15 @@ UBool Normalizer2Impl::norm16HasDecompBoundaryAfter(uint16_t norm16) const {
|
|||
return true;
|
||||
}
|
||||
if (norm16 >= limitNoNo) {
|
||||
if (isMaybeOrNonZeroCC(norm16)) {
|
||||
if (isMaybeYesOrNonZeroCC(norm16)) {
|
||||
return norm16 <= MIN_NORMAL_MAYBE_YES || norm16 == JAMO_VT;
|
||||
} else if (norm16 < minMaybeNo) {
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
return (norm16 & DELTA_TCCC_MASK) <= DELTA_TCCC_1;
|
||||
}
|
||||
// Maps to an isCompYesAndZeroCC.
|
||||
return (norm16 & DELTA_TCCC_MASK) <= DELTA_TCCC_1;
|
||||
}
|
||||
// c decomposes, get everything from the variable-length extra data
|
||||
const uint16_t *mapping=getMapping(norm16);
|
||||
const uint16_t *mapping=getData(norm16);
|
||||
uint16_t firstUnit=*mapping;
|
||||
// decomp after-boundary: same as hasFCDBoundaryAfter(),
|
||||
// fcd16<=1 || trailCC==0
|
||||
|
@ -1240,7 +1242,7 @@ void Normalizer2Impl::recompose(ReorderingBuffer &buffer, int32_t recomposeStart
|
|||
|
||||
for(;;) {
|
||||
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, p, limit, c, norm16);
|
||||
cc=getCCFromYesOrMaybe(norm16);
|
||||
cc=getCCFromYesOrMaybeYes(norm16);
|
||||
if( // this character combines backward and
|
||||
isMaybe(norm16) &&
|
||||
// we have seen a starter that combines forward and
|
||||
|
@ -1414,17 +1416,22 @@ Normalizer2Impl::composePair(UChar32 a, UChar32 b) const {
|
|||
}
|
||||
} else {
|
||||
// 'a' has a compositions list in extraData
|
||||
list=getMapping(norm16);
|
||||
list=getDataForYesOrNo(norm16);
|
||||
if(norm16>minYesNo) { // composite 'a' has both mapping & compositions list
|
||||
list+= // mapping pointer
|
||||
1+ // +1 to skip the first unit with the mapping length
|
||||
(*list&MAPPING_LENGTH_MASK); // + mapping length
|
||||
}
|
||||
}
|
||||
} else if(norm16<minMaybeYes || MIN_NORMAL_MAYBE_YES<=norm16) {
|
||||
} else if(norm16<minMaybeNoCombinesFwd || MIN_NORMAL_MAYBE_YES<=norm16) {
|
||||
return U_SENTINEL;
|
||||
} else {
|
||||
list=getCompositionsListForMaybe(norm16);
|
||||
list=getDataForMaybe(norm16);
|
||||
if(norm16<minMaybeYes) { // composite 'a' has both mapping & compositions list
|
||||
list+= // mapping pointer
|
||||
1+ // +1 to skip the first unit with the mapping length
|
||||
(*list&MAPPING_LENGTH_MASK); // + mapping length
|
||||
}
|
||||
}
|
||||
if(b<0 || 0x10ffff<b) { // combine(list, b) requires a valid code point b
|
||||
return U_SENTINEL;
|
||||
|
@ -1502,12 +1509,12 @@ Normalizer2Impl::compose(const char16_t *src, const char16_t *limit,
|
|||
}
|
||||
// isCompYesAndZeroCC(norm16) is false, that is, norm16>=minNoNo.
|
||||
// The current character is either a "noNo" (has a mapping)
|
||||
// or a "maybeYes" (combines backward)
|
||||
// or a "maybeYes" / "maybeNo" (combines backward)
|
||||
// or a "yesYes" with ccc!=0.
|
||||
// It is not a Hangul syllable or Jamo L because those have "yes" properties.
|
||||
|
||||
// Medium-fast path: Handle cases that do not require full decomposition and recomposition.
|
||||
if (!isMaybeOrNonZeroCC(norm16)) { // minNoNo <= norm16 < minMaybeYes
|
||||
if (norm16 < minMaybeNo) { // minNoNo <= norm16 < minMaybeNo
|
||||
if (!doCompose) {
|
||||
return false;
|
||||
}
|
||||
|
@ -1534,7 +1541,7 @@ Normalizer2Impl::compose(const char16_t *src, const char16_t *limit,
|
|||
if (prevBoundary != prevSrc && !buffer.appendZeroCC(prevBoundary, prevSrc, errorCode)) {
|
||||
break;
|
||||
}
|
||||
const char16_t *mapping = reinterpret_cast<const char16_t *>(getMapping(norm16));
|
||||
const char16_t *mapping = reinterpret_cast<const char16_t *>(getDataForYesOrNo(norm16));
|
||||
int32_t length = *mapping++ & MAPPING_LENGTH_MASK;
|
||||
if(!buffer.appendZeroCC(mapping, mapping + length, errorCode)) {
|
||||
break;
|
||||
|
@ -1763,7 +1770,7 @@ Normalizer2Impl::composeQuickCheck(const char16_t *src, const char16_t *limit,
|
|||
}
|
||||
// isCompYesAndZeroCC(norm16) is false, that is, norm16>=minNoNo.
|
||||
// The current character is either a "noNo" (has a mapping)
|
||||
// or a "maybeYes" (combines backward)
|
||||
// or a "maybeYes" / "maybeNo" (combines backward)
|
||||
// or a "yesYes" with ccc!=0.
|
||||
// It is not a Hangul syllable or Jamo L because those have "yes" properties.
|
||||
|
||||
|
@ -1784,8 +1791,9 @@ Normalizer2Impl::composeQuickCheck(const char16_t *src, const char16_t *limit,
|
|||
}
|
||||
}
|
||||
|
||||
if(isMaybeOrNonZeroCC(norm16)) {
|
||||
uint8_t cc=getCCFromYesOrMaybe(norm16);
|
||||
if (norm16 >= minMaybeNo) {
|
||||
uint16_t fcd16 = getFCD16FromMaybeOrNonZeroCC(norm16);
|
||||
uint8_t cc = fcd16 >> 8;
|
||||
if (onlyContiguous /* FCC */ && cc != 0 &&
|
||||
getTrailCCFromCompYesAndZeroCC(prevNorm16) > cc) {
|
||||
// The [prevBoundary..prevSrc[ character
|
||||
|
@ -1806,11 +1814,12 @@ Normalizer2Impl::composeQuickCheck(const char16_t *src, const char16_t *limit,
|
|||
if (src == limit) {
|
||||
return src;
|
||||
}
|
||||
uint8_t prevCC = cc;
|
||||
uint8_t prevCC = fcd16;
|
||||
nextSrc = src;
|
||||
UCPTRIE_FAST_U16_NEXT(normTrie, UCPTRIE_16, nextSrc, limit, c, norm16);
|
||||
if (isMaybeOrNonZeroCC(norm16)) {
|
||||
cc = getCCFromYesOrMaybe(norm16);
|
||||
if (norm16 >= minMaybeNo) {
|
||||
fcd16 = getFCD16FromMaybeOrNonZeroCC(norm16);
|
||||
cc = fcd16 >> 8;
|
||||
if (!(prevCC <= cc || cc == 0)) {
|
||||
break;
|
||||
}
|
||||
|
@ -1903,12 +1912,12 @@ Normalizer2Impl::composeUTF8(uint32_t options, UBool onlyContiguous,
|
|||
}
|
||||
// isCompYesAndZeroCC(norm16) is false, that is, norm16>=minNoNo.
|
||||
// The current character is either a "noNo" (has a mapping)
|
||||
// or a "maybeYes" (combines backward)
|
||||
// or a "maybeYes" / "maybeNo" (combines backward)
|
||||
// or a "yesYes" with ccc!=0.
|
||||
// It is not a Hangul syllable or Jamo L because those have "yes" properties.
|
||||
|
||||
// Medium-fast path: Handle cases that do not require full decomposition and recomposition.
|
||||
if (!isMaybeOrNonZeroCC(norm16)) { // minNoNo <= norm16 < minMaybeYes
|
||||
if (norm16 < minMaybeNo) { // minNoNo <= norm16 < minMaybeNo
|
||||
if (sink == nullptr) {
|
||||
return false;
|
||||
}
|
||||
|
@ -1937,7 +1946,7 @@ Normalizer2Impl::composeUTF8(uint32_t options, UBool onlyContiguous,
|
|||
*sink, options, edits, errorCode)) {
|
||||
break;
|
||||
}
|
||||
const uint16_t *mapping = getMapping(norm16);
|
||||
const uint16_t *mapping = getDataForYesOrNo(norm16);
|
||||
int32_t length = *mapping++ & MAPPING_LENGTH_MASK;
|
||||
if (!ByteSinkUtil::appendChange(prevSrc, src, (const char16_t *)mapping, length,
|
||||
*sink, edits, errorCode)) {
|
||||
|
@ -2245,7 +2254,7 @@ uint16_t Normalizer2Impl::getFCD16FromNormData(UChar32 c) const {
|
|||
return norm16|(norm16<<8);
|
||||
} else if(norm16>=minMaybeYes) {
|
||||
return 0;
|
||||
} else { // isDecompNoAlgorithmic(norm16)
|
||||
} else if(norm16<minMaybeNo) { // isDecompNoAlgorithmic(norm16)
|
||||
uint16_t deltaTrailCC = norm16 & DELTA_TCCC_MASK;
|
||||
if (deltaTrailCC <= DELTA_TCCC_1) {
|
||||
return deltaTrailCC >> OFFSET_SHIFT;
|
||||
|
@ -2260,7 +2269,7 @@ uint16_t Normalizer2Impl::getFCD16FromNormData(UChar32 c) const {
|
|||
return 0;
|
||||
}
|
||||
// c decomposes, get everything from the variable-length extra data
|
||||
const uint16_t *mapping=getMapping(norm16);
|
||||
const uint16_t *mapping=getData(norm16);
|
||||
uint16_t firstUnit=*mapping;
|
||||
norm16=firstUnit>>8; // tccc
|
||||
if(firstUnit&MAPPING_HAS_CCC_LCCC_WORD) {
|
||||
|
@ -2272,6 +2281,23 @@ uint16_t Normalizer2Impl::getFCD16FromNormData(UChar32 c) const {
|
|||
#pragma optimize( "", on )
|
||||
#endif
|
||||
|
||||
uint16_t Normalizer2Impl::getFCD16FromMaybeOrNonZeroCC(uint16_t norm16) const {
|
||||
U_ASSERT(norm16 >= minMaybeNo);
|
||||
if (norm16 >= MIN_NORMAL_MAYBE_YES) {
|
||||
// combining mark
|
||||
norm16 = getCCFromNormalYesOrMaybe(norm16);
|
||||
return norm16 | (norm16<<8);
|
||||
} else if (norm16 >= minMaybeYes) {
|
||||
return 0;
|
||||
}
|
||||
// c decomposes, get everything from the variable-length extra data
|
||||
const uint16_t *mapping = getDataForMaybe(norm16);
|
||||
uint16_t firstUnit = *mapping;
|
||||
// maybeNo has lccc = 0
|
||||
U_ASSERT((firstUnit & MAPPING_HAS_CCC_LCCC_WORD) == 0 || (*(mapping - 1) & 0xff00) == 0);
|
||||
return firstUnit >> 8; // tccc
|
||||
}
|
||||
|
||||
// Dual functionality:
|
||||
// buffer!=nullptr: normalize
|
||||
// buffer==nullptr: isNormalized/quickCheck/spanQuickCheckYes
|
||||
|
@ -2575,9 +2601,11 @@ void InitCanonIterData::doInit(Normalizer2Impl *impl, UErrorCode &errorCode) {
|
|||
void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, const uint16_t norm16,
|
||||
CanonIterData &newData,
|
||||
UErrorCode &errorCode) const {
|
||||
if(isInert(norm16) || (minYesNo<=norm16 && norm16<minNoNo)) {
|
||||
if(isInert(norm16) ||
|
||||
(minYesNo<=norm16 && norm16<minNoNo) ||
|
||||
(minMaybeNo<=norm16 && norm16<minMaybeYes)) {
|
||||
// Inert, or 2-way mapping (including Hangul syllable).
|
||||
// We do not write a canonStartSet for any yesNo character.
|
||||
// We do not write a canonStartSet for any yesNo/maybeNo character.
|
||||
// Composites from 2-way mappings are added at runtime from the
|
||||
// starter's compositions list, and the other characters in
|
||||
// 2-way mappings get CANON_NOT_SEGMENT_STARTER set because they are
|
||||
|
@ -2587,7 +2615,7 @@ void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, co
|
|||
for(UChar32 c=start; c<=end; ++c) {
|
||||
uint32_t oldValue = umutablecptrie_get(newData.mutableTrie, c);
|
||||
uint32_t newValue=oldValue;
|
||||
if(isMaybeOrNonZeroCC(norm16)) {
|
||||
if(isMaybeYesOrNonZeroCC(norm16)) {
|
||||
// not a segment starter if it occurs in a decomposition or has cc!=0
|
||||
newValue|=CANON_NOT_SEGMENT_STARTER;
|
||||
if(norm16<MIN_NORMAL_MAYBE_YES) {
|
||||
|
@ -2609,7 +2637,7 @@ void Normalizer2Impl::makeCanonIterDataFromNorm16(UChar32 start, UChar32 end, co
|
|||
}
|
||||
if (norm16_2 > minYesNo) {
|
||||
// c decomposes, get everything from the variable-length extra data
|
||||
const uint16_t *mapping=getMapping(norm16_2);
|
||||
const uint16_t *mapping=getDataForYesOrNo(norm16_2);
|
||||
uint16_t firstUnit=*mapping;
|
||||
int32_t length=firstUnit&MAPPING_LENGTH_MASK;
|
||||
if((firstUnit&MAPPING_HAS_CCC_LCCC_WORD)!=0) {
|
||||
|
@ -2728,7 +2756,7 @@ unorm2_swap(const UDataSwapper *ds,
|
|||
pInfo->dataFormat[1]==0x72 &&
|
||||
pInfo->dataFormat[2]==0x6d &&
|
||||
pInfo->dataFormat[3]==0x32 &&
|
||||
(1<=formatVersion0 && formatVersion0<=4)
|
||||
(1<=formatVersion0 && formatVersion0<=5)
|
||||
)) {
|
||||
udata_printError(ds, "unorm2_swap(): data format %02x.%02x.%02x.%02x (format version %02x) is not recognized as Normalizer2 data\n",
|
||||
pInfo->dataFormat[0], pInfo->dataFormat[1],
|
||||
|
@ -2747,8 +2775,10 @@ unorm2_swap(const UDataSwapper *ds,
|
|||
minIndexesLength=Normalizer2Impl::IX_MIN_MAYBE_YES+1;
|
||||
} else if(formatVersion0==2) {
|
||||
minIndexesLength=Normalizer2Impl::IX_MIN_YES_NO_MAPPINGS_ONLY+1;
|
||||
} else {
|
||||
} else if(formatVersion0<=4) {
|
||||
minIndexesLength=Normalizer2Impl::IX_MIN_LCCC_CP+1;
|
||||
} else {
|
||||
minIndexesLength=Normalizer2Impl::IX_MIN_MAYBE_NO_COMBINES_FWD+1;
|
||||
}
|
||||
|
||||
if(length>=0) {
|
||||
|
|
|
@ -241,7 +241,7 @@ private:
|
|||
* Low-level implementation of the Unicode Normalization Algorithm.
|
||||
* For the data structure and details see the documentation at the end of
|
||||
* this normalizer2impl.h and in the design doc at
|
||||
* https://icu.unicode.org/design/normalization/custom
|
||||
* https://unicode-org.github.io/icu/design/normalization/custom.html
|
||||
*/
|
||||
class U_COMMON_API Normalizer2Impl : public UObject {
|
||||
public:
|
||||
|
@ -271,14 +271,14 @@ public:
|
|||
UNormalizationCheckResult getCompQuickCheck(uint16_t norm16) const {
|
||||
if(norm16<minNoNo || MIN_YES_YES_WITH_CC<=norm16) {
|
||||
return UNORM_YES;
|
||||
} else if(minMaybeYes<=norm16) {
|
||||
} else if(minMaybeNo<=norm16) {
|
||||
return UNORM_MAYBE;
|
||||
} else {
|
||||
return UNORM_NO;
|
||||
}
|
||||
}
|
||||
UBool isAlgorithmicNoNo(uint16_t norm16) const { return limitNoNo<=norm16 && norm16<minMaybeYes; }
|
||||
UBool isCompNo(uint16_t norm16) const { return minNoNo<=norm16 && norm16<minMaybeYes; }
|
||||
UBool isAlgorithmicNoNo(uint16_t norm16) const { return limitNoNo<=norm16 && norm16<minMaybeNo; }
|
||||
UBool isCompNo(uint16_t norm16) const { return minNoNo<=norm16 && norm16<minMaybeNo; }
|
||||
UBool isDecompYes(uint16_t norm16) const { return norm16<minYesNo || minMaybeYes<=norm16; }
|
||||
|
||||
uint8_t getCC(uint16_t norm16) const {
|
||||
|
@ -293,12 +293,12 @@ public:
|
|||
static uint8_t getCCFromNormalYesOrMaybe(uint16_t norm16) {
|
||||
return (uint8_t)(norm16 >> OFFSET_SHIFT);
|
||||
}
|
||||
static uint8_t getCCFromYesOrMaybe(uint16_t norm16) {
|
||||
static uint8_t getCCFromYesOrMaybeYes(uint16_t norm16) {
|
||||
return norm16>=MIN_NORMAL_MAYBE_YES ? getCCFromNormalYesOrMaybe(norm16) : 0;
|
||||
}
|
||||
uint8_t getCCFromYesOrMaybeCP(UChar32 c) const {
|
||||
uint8_t getCCFromYesOrMaybeYesCP(UChar32 c) const {
|
||||
if (c < minCompNoMaybeCP) { return 0; }
|
||||
return getCCFromYesOrMaybe(getNorm16(c));
|
||||
return getCCFromYesOrMaybeYes(getNorm16(c));
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -369,6 +369,8 @@ public:
|
|||
/** Returns the FCD value from the regular normalization data. */
|
||||
uint16_t getFCD16FromNormData(UChar32 c) const;
|
||||
|
||||
uint16_t getFCD16FromMaybeOrNonZeroCC(uint16_t norm16) const;
|
||||
|
||||
/**
|
||||
* Gets the decomposition for one code point.
|
||||
* @param c code point
|
||||
|
@ -450,7 +452,13 @@ public:
|
|||
|
||||
IX_MIN_LCCC_CP,
|
||||
IX_RESERVED19,
|
||||
IX_COUNT
|
||||
|
||||
/** Two-way mappings; each starts with a character that combines backward. */
|
||||
IX_MIN_MAYBE_NO, // 20
|
||||
/** Two-way mappings & compositions. */
|
||||
IX_MIN_MAYBE_NO_COMBINES_FWD,
|
||||
|
||||
IX_COUNT // 22
|
||||
};
|
||||
|
||||
enum {
|
||||
|
@ -541,7 +549,8 @@ public:
|
|||
uint16_t norm16=getNorm16(c);
|
||||
return isCompYesAndZeroCC(norm16) &&
|
||||
(norm16 & HAS_COMP_BOUNDARY_AFTER) != 0 &&
|
||||
(!onlyContiguous || isInert(norm16) || *getMapping(norm16) <= 0x1ff);
|
||||
(!onlyContiguous || isInert(norm16) || *getDataForYesOrNo(norm16) <= 0x1ff);
|
||||
// The last check fetches the mapping's first unit and checks tccc<=1.
|
||||
}
|
||||
|
||||
UBool hasFCDBoundaryBefore(UChar32 c) const { return hasDecompBoundaryBefore(c); }
|
||||
|
@ -551,8 +560,8 @@ private:
|
|||
friend class InitCanonIterData;
|
||||
friend class LcccContext;
|
||||
|
||||
UBool isMaybe(uint16_t norm16) const { return minMaybeYes<=norm16 && norm16<=JAMO_VT; }
|
||||
UBool isMaybeOrNonZeroCC(uint16_t norm16) const { return norm16>=minMaybeYes; }
|
||||
UBool isMaybe(uint16_t norm16) const { return minMaybeNo<=norm16 && norm16<=JAMO_VT; }
|
||||
UBool isMaybeYesOrNonZeroCC(uint16_t norm16) const { return norm16>=minMaybeYes; }
|
||||
static UBool isInert(uint16_t norm16) { return norm16==INERT; }
|
||||
static UBool isJamoL(uint16_t norm16) { return norm16==JAMO_L; }
|
||||
static UBool isJamoVT(uint16_t norm16) { return norm16==JAMO_VT; }
|
||||
|
@ -566,7 +575,7 @@ private:
|
|||
// return norm16>=MIN_YES_YES_WITH_CC || norm16<minNoNo;
|
||||
// }
|
||||
// UBool isCompYesOrMaybe(uint16_t norm16) const {
|
||||
// return norm16<minNoNo || minMaybeYes<=norm16;
|
||||
// return norm16<minNoNo || minMaybeNo<=norm16;
|
||||
// }
|
||||
// UBool hasZeroCCFromDecompYes(uint16_t norm16) const {
|
||||
// return norm16<=MIN_NORMAL_MAYBE_YES || norm16==JAMO_VT;
|
||||
|
@ -579,12 +588,12 @@ private:
|
|||
/**
|
||||
* A little faster and simpler than isDecompYesAndZeroCC() but does not include
|
||||
* the MaybeYes which combine-forward and have ccc=0.
|
||||
* (Standard Unicode 10 normalization does not have such characters.)
|
||||
*/
|
||||
UBool isMostDecompYesAndZeroCC(uint16_t norm16) const {
|
||||
return norm16<minYesNo || norm16==MIN_NORMAL_MAYBE_YES || norm16==JAMO_VT;
|
||||
}
|
||||
UBool isDecompNoAlgorithmic(uint16_t norm16) const { return norm16>=limitNoNo; }
|
||||
/** Since formatVersion 5: same as isAlgorithmicNoNo() */
|
||||
UBool isDecompNoAlgorithmic(uint16_t norm16) const { return limitNoNo<=norm16 && norm16<minMaybeNo; }
|
||||
|
||||
// For use with isCompYes().
|
||||
// Perhaps the compiler can combine the two tests for MIN_YES_YES_WITH_CC.
|
||||
|
@ -592,7 +601,7 @@ private:
|
|||
// return norm16>=MIN_YES_YES_WITH_CC ? getCCFromNormalYesOrMaybe(norm16) : 0;
|
||||
// }
|
||||
uint8_t getCCFromNoNo(uint16_t norm16) const {
|
||||
const uint16_t *mapping=getMapping(norm16);
|
||||
const uint16_t *mapping=getDataForYesOrNo(norm16);
|
||||
if(*mapping&MAPPING_HAS_CCC_LCCC_WORD) {
|
||||
return (uint8_t)*(mapping-1);
|
||||
} else {
|
||||
|
@ -605,7 +614,7 @@ private:
|
|||
return 0; // yesYes and Hangul LV have ccc=tccc=0
|
||||
} else {
|
||||
// For Hangul LVT we harmlessly fetch a firstUnit with tccc=0 here.
|
||||
return (uint8_t)(*getMapping(norm16)>>8); // tccc from yesNo
|
||||
return (uint8_t)(*getDataForYesOrNo(norm16)>>8); // tccc from yesNo
|
||||
}
|
||||
}
|
||||
uint8_t getPreviousTrailCC(const char16_t *start, const char16_t *p) const;
|
||||
|
@ -619,28 +628,33 @@ private:
|
|||
return (norm16>>DELTA_SHIFT)-centerNoNoDelta;
|
||||
}
|
||||
|
||||
// Requires minYesNo<norm16<limitNoNo.
|
||||
const uint16_t *getMapping(uint16_t norm16) const { return extraData+(norm16>>OFFSET_SHIFT); }
|
||||
const uint16_t *getDataForYesOrNo(uint16_t norm16) const {
|
||||
return extraData+(norm16>>OFFSET_SHIFT);
|
||||
}
|
||||
const uint16_t *getDataForMaybe(uint16_t norm16) const {
|
||||
return extraData+((norm16-minMaybeNo+limitNoNo)>>OFFSET_SHIFT);
|
||||
}
|
||||
const uint16_t *getData(uint16_t norm16) const {
|
||||
if(norm16>=minMaybeNo) {
|
||||
norm16=norm16-minMaybeNo+limitNoNo;
|
||||
}
|
||||
return extraData+(norm16>>OFFSET_SHIFT);
|
||||
}
|
||||
const uint16_t *getCompositionsListForDecompYes(uint16_t norm16) const {
|
||||
if(norm16<JAMO_L || MIN_NORMAL_MAYBE_YES<=norm16) {
|
||||
return nullptr;
|
||||
} else if(norm16<minMaybeYes) {
|
||||
return getMapping(norm16); // for yesYes; if Jamo L: harmless empty list
|
||||
} else {
|
||||
return maybeYesCompositions+((norm16-minMaybeYes)>>OFFSET_SHIFT);
|
||||
// if yesYes: if Jamo L: harmless empty list
|
||||
return getData(norm16);
|
||||
}
|
||||
}
|
||||
const uint16_t *getCompositionsListForComposite(uint16_t norm16) const {
|
||||
// A composite has both mapping & compositions list.
|
||||
const uint16_t *list=getMapping(norm16);
|
||||
const uint16_t *list=getData(norm16);
|
||||
return list+ // mapping pointer
|
||||
1+ // +1 to skip the first unit with the mapping length
|
||||
(*list&MAPPING_LENGTH_MASK); // + mapping length
|
||||
}
|
||||
const uint16_t *getCompositionsListForMaybe(uint16_t norm16) const {
|
||||
// minMaybeYes<=norm16<MIN_NORMAL_MAYBE_YES
|
||||
return maybeYesCompositions+((norm16-minMaybeYes)>>OFFSET_SHIFT);
|
||||
}
|
||||
/**
|
||||
* @param c code point must have compositions
|
||||
* @return compositions list pointer
|
||||
|
@ -692,11 +706,13 @@ private:
|
|||
/** For FCC: Given norm16 HAS_COMP_BOUNDARY_AFTER, does it have tccc<=1? */
|
||||
UBool isTrailCC01ForCompBoundaryAfter(uint16_t norm16) const {
|
||||
return isInert(norm16) || (isDecompNoAlgorithmic(norm16) ?
|
||||
(norm16 & DELTA_TCCC_MASK) <= DELTA_TCCC_1 : *getMapping(norm16) <= 0x1ff);
|
||||
(norm16 & DELTA_TCCC_MASK) <= DELTA_TCCC_1 : *getDataForYesOrNo(norm16) <= 0x1ff);
|
||||
}
|
||||
|
||||
const char16_t *findPreviousCompBoundary(const char16_t *start, const char16_t *p, UBool onlyContiguous) const;
|
||||
const char16_t *findNextCompBoundary(const char16_t *p, const char16_t *limit, UBool onlyContiguous) const;
|
||||
const char16_t *findPreviousCompBoundary(const char16_t *start, const char16_t *p,
|
||||
UBool onlyContiguous) const;
|
||||
const char16_t *findNextCompBoundary(const char16_t *p, const char16_t *limit,
|
||||
UBool onlyContiguous) const;
|
||||
|
||||
const char16_t *findPreviousFCDBoundary(const char16_t *start, const char16_t *p) const;
|
||||
const char16_t *findNextFCDBoundary(const char16_t *p, const char16_t *limit) const;
|
||||
|
@ -723,11 +739,12 @@ private:
|
|||
uint16_t minNoNoEmpty;
|
||||
uint16_t limitNoNo;
|
||||
uint16_t centerNoNoDelta;
|
||||
uint16_t minMaybeNo;
|
||||
uint16_t minMaybeNoCombinesFwd;
|
||||
uint16_t minMaybeYes;
|
||||
|
||||
const UCPTrie *normTrie;
|
||||
const uint16_t *maybeYesCompositions;
|
||||
const uint16_t *extraData; // mappings and/or compositions for yesYes, yesNo & noNo characters
|
||||
const uint16_t *extraData; // mappings and/or compositions
|
||||
const uint8_t *smallFCD; // [0x100] one bit per 32 BMP code points, set if any FCD!=0
|
||||
|
||||
UInitOnce fCanonIterDataInitOnce {};
|
||||
|
@ -785,7 +802,7 @@ unorm_getFCD16(UChar32 c);
|
|||
|
||||
/**
|
||||
* Format of Normalizer2 .nrm data files.
|
||||
* Format version 4.0.
|
||||
* Format version 5.0.
|
||||
*
|
||||
* Normalizer2 .nrm data files provide data for the Unicode Normalization algorithms.
|
||||
* ICU ships with data files for standard Unicode Normalization Forms
|
||||
|
@ -807,7 +824,7 @@ unorm_getFCD16(UChar32 c);
|
|||
* Constants are defined as enum values of the Normalizer2Impl class.
|
||||
*
|
||||
* Many details of the data structures are described in the design doc
|
||||
* which is at https://icu.unicode.org/design/normalization/custom
|
||||
* which is at https://unicode-org.github.io/icu/design/normalization/custom.html
|
||||
*
|
||||
* int32_t indexes[indexesLength]; -- indexesLength=indexes[IX_NORM_TRIE_OFFSET]/4;
|
||||
*
|
||||
|
@ -829,7 +846,9 @@ unorm_getFCD16(UChar32 c);
|
|||
*
|
||||
* The next eight indexes are thresholds of 16-bit trie values for ranges of
|
||||
* values indicating multiple normalization properties.
|
||||
* They are listed here in threshold order, not in the order they are stored in the indexes.
|
||||
* Format version 5 adds the two minMaybeNo* threshold indexes.
|
||||
* The thresholds are listed here in threshold order,
|
||||
* not in the order they are stored in the indexes.
|
||||
* minYesNo=indexes[IX_MIN_YES_NO];
|
||||
* minYesNoMappingsOnly=indexes[IX_MIN_YES_NO_MAPPINGS_ONLY];
|
||||
* minNoNo=indexes[IX_MIN_NO_NO];
|
||||
|
@ -837,6 +856,8 @@ unorm_getFCD16(UChar32 c);
|
|||
* minNoNoCompNoMaybeCC=indexes[IX_MIN_NO_NO_COMP_NO_MAYBE_CC];
|
||||
* minNoNoEmpty=indexes[IX_MIN_NO_NO_EMPTY];
|
||||
* limitNoNo=indexes[IX_LIMIT_NO_NO];
|
||||
* minMaybeNo=indexes[IX_MIN_MAYBE_NO];
|
||||
* minMaybeNoCombinesFwd=indexes[IX_MIN_MAYBE_NO_COMBINES_FWD];
|
||||
* minMaybeYes=indexes[IX_MIN_MAYBE_YES];
|
||||
* See the normTrie description below and the design doc for details.
|
||||
*
|
||||
|
@ -845,13 +866,14 @@ unorm_getFCD16(UChar32 c);
|
|||
* The trie holds the main normalization data. Each code point is mapped to a 16-bit value.
|
||||
* Rather than using independent bits in the value (which would require more than 16 bits),
|
||||
* information is extracted primarily via range checks.
|
||||
* Except, format version 3 uses bit 0 for hasCompBoundaryAfter().
|
||||
* Except, format version 3+ uses bit 0 for hasCompBoundaryAfter().
|
||||
* For example, a 16-bit value norm16 in the range minYesNo<=norm16<minNoNo
|
||||
* means that the character has NF*C_QC=Yes and NF*D_QC=No properties,
|
||||
* which means it has a two-way (round-trip) decomposition mapping.
|
||||
* Values in the range 2<=norm16<limitNoNo are also directly indexes into the extraData
|
||||
* Values in the ranges 2<=norm16<limitNoNo and minMaybeNo<=norm16<minMaybeYes
|
||||
* are also directly indexes into the extraData
|
||||
* pointing to mappings, compositions lists, or both.
|
||||
* Value norm16==INERT (0 in versions 1 & 2, 1 in version 3)
|
||||
* Value norm16==INERT (0 in versions 1 & 2, 1 in version 3+)
|
||||
* means that the character is normalization-inert, that is,
|
||||
* it does not have a mapping, does not participate in composition, has a zero
|
||||
* canonical combining class, and forms a boundary where text before it and after it
|
||||
|
@ -870,32 +892,38 @@ unorm_getFCD16(UChar32 c);
|
|||
* When the lead surrogate unit's value exceeds the quick check minimum during processing,
|
||||
* the properties for the full supplementary code point need to be looked up.
|
||||
*
|
||||
* uint16_t maybeYesCompositions[MIN_NORMAL_MAYBE_YES-minMaybeYes];
|
||||
* uint16_t extraData[];
|
||||
*
|
||||
* There is only one byte offset for the end of these two arrays.
|
||||
* The split between them is given by the constant and variable mentioned above.
|
||||
* In version 3, the difference must be shifted right by OFFSET_SHIFT.
|
||||
* The extraData array contains many per-character data sections.
|
||||
* Each section contains mappings and/or composition lists.
|
||||
* The norm16 value of each character that has such data is directly an index to
|
||||
* a section of the extraData array.
|
||||
*
|
||||
* The maybeYesCompositions array contains compositions lists for characters that
|
||||
* combine both forward (as starters in composition pairs)
|
||||
* and backward (as trailing characters in composition pairs).
|
||||
* Such characters occur in Unicode 16 for the first time.
|
||||
* If there are no such characters, then minMaybeYes==MIN_NORMAL_MAYBE_YES
|
||||
* and the maybeYesCompositions array is empty.
|
||||
* If there are such characters, then minMaybeYes is subtracted from their norm16 values
|
||||
* to get the index into this array.
|
||||
*
|
||||
* The extraData array contains compositions lists for "YesYes" characters,
|
||||
* followed by mappings and optional compositions lists for "YesNo" characters,
|
||||
* followed by only mappings for "NoNo" characters.
|
||||
* (Referring to pairs of NFC/NFD quick check values.)
|
||||
* The norm16 values of those characters are directly indexes into the extraData array.
|
||||
* In version 3, the norm16 values must be shifted right by OFFSET_SHIFT
|
||||
* In version 3+, the norm16 values must be shifted right by OFFSET_SHIFT
|
||||
* for accessing extraData.
|
||||
*
|
||||
* The data structures for compositions lists and mappings are described in the design doc.
|
||||
*
|
||||
* In version 4 and below, the composition lists for MaybeYes characters were stored before
|
||||
* the data for other characters.
|
||||
* This sub-array had a length of MIN_NORMAL_MAYBE_YES-minMaybeYes.
|
||||
* In version 3 & 4, the difference must be shifted right by OFFSET_SHIFT.
|
||||
*
|
||||
* In version 5, the data for MaybeNo and MaybeYes characters is stored after
|
||||
* the data for other characters.
|
||||
*
|
||||
* If there are no MaybeNo and no MaybeYes characters,
|
||||
* then minMaybeYes==minMaybeNo==MIN_NORMAL_MAYBE_YES.
|
||||
* If there are such characters, then minMaybeNo is subtracted from their norm16 values
|
||||
* to get the index into the extraData.
|
||||
* In version 4 and below, the data index for Yes* and No* characters needs to be
|
||||
* offset by the length of the MaybeYes data.
|
||||
* In version 5, the data index for Maybe* characters needs to be offset by limitNoNo.
|
||||
*
|
||||
* Version 5 is the first to support MaybeNo characters, and
|
||||
* adds the minMaybeNo and minMaybeNoCombinesFwd thresholds and
|
||||
* the corresponding sections of the extraData.
|
||||
*
|
||||
* uint8_t smallFCD[0x100]; -- new in format version 2
|
||||
*
|
||||
* This is a bit set to help speed up FCD value lookups in the absence of a full
|
||||
|
@ -935,7 +963,7 @@ unorm_getFCD16(UChar32 c);
|
|||
* to make room for two bits (three values) indicating whether the tccc is 0, 1, or greater.
|
||||
* See DELTA_TCCC_MASK etc.
|
||||
* This helps with fetching tccc/FCD values and FCC hasCompBoundaryAfter().
|
||||
* minMaybeYes is 8-aligned so that the DELTA_TCCC_MASK bits can be tested directly.
|
||||
* minMaybeNo is 8-aligned so that the DELTA_TCCC_MASK bits can be tested directly.
|
||||
*
|
||||
* - Algorithmic mappings are only used for mapping to "comp yes and ccc=0" characters,
|
||||
* and ASCII characters are mapped algorithmically only to other ASCII characters.
|
||||
|
@ -981,6 +1009,23 @@ unorm_getFCD16(UChar32 c);
|
|||
* gennorm2 now has to reject mappings for surrogate code points.
|
||||
* UTS #46 maps unpaired surrogates to U+FFFD in code rather than via its
|
||||
* custom normalization data file.
|
||||
*
|
||||
* Changes from format version 4 to format version 5 (ICU 76) ------------------
|
||||
*
|
||||
* Unicode 16 adds the first MaybeYes characters which combine both backward and forward,
|
||||
* taking this formerly theoretical data structure into reality.
|
||||
*
|
||||
* Unicode 16 also adds the first characters that have two-way mappings whose first characters
|
||||
* combine backward. In order for normalization and the quick check to work properly,
|
||||
* these composite characters also must be marked as NFC_QC=Maybe,
|
||||
* corresponding to "combines back", although the composites themselves do not combine backward.
|
||||
* Format version 5 adds two new ranges between "algorithmic NoNo" and MaybeYes,
|
||||
* with thresholds minMaybeNo and minMaybeNoCombinesFwd,
|
||||
* and indexes[IX_MIN_MAYBE_NO] and indexes[IX_MIN_MAYBE_NO_COMBINES_FWD],
|
||||
* and corresponding mappings and composition lists in the extraData.
|
||||
*
|
||||
* Format version 5 moves the data for Maybe* characters from the start of the extraData array
|
||||
* to its end.
|
||||
*/
|
||||
|
||||
#endif /* !UCONFIG_NO_NORMALIZATION */
|
||||
|
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
@ -138,7 +138,7 @@ void ExtraData::writeCompositions(UChar32 c, const Norm &norm, UnicodeString &da
|
|||
const CompositionPair &pair=pairs[i];
|
||||
// 22 bits for the composite character and whether it combines forward.
|
||||
UChar32 compositeAndFwd=pair.composite<<1;
|
||||
if(norms.getNormRef(pair.composite).compositions!=nullptr) {
|
||||
if(norms.getNormRef(pair.composite).combinesFwd()) {
|
||||
compositeAndFwd|=1; // The composite character also combines-forward.
|
||||
}
|
||||
// Encode most pairs in two units and some in three.
|
||||
|
@ -231,6 +231,15 @@ void ExtraData::writeExtraData(UChar32 c, Norm &norm) {
|
|||
// if they have different raw mappings.
|
||||
norm.offset=writeNoNoMapping(c, norm, noNoMappingsEmpty, previousNoNoMappingsEmpty);
|
||||
break;
|
||||
case Norm::MAYBE_NO_MAPPING_ONLY:
|
||||
norm.offset=maybeNoMappingsOnly.length()+
|
||||
writeMapping(c, norm, maybeNoMappingsOnly);
|
||||
break;
|
||||
case Norm::MAYBE_NO_COMBINES_FWD:
|
||||
norm.offset=maybeNoMappingsAndCompositions.length()+
|
||||
writeMapping(c, norm, maybeNoMappingsAndCompositions);
|
||||
writeCompositions(c, norm, maybeNoMappingsAndCompositions);
|
||||
break;
|
||||
case Norm::MAYBE_YES_COMBINES_FWD:
|
||||
norm.offset=maybeYesCompositions.length();
|
||||
writeCompositions(c, norm, maybeYesCompositions);
|
||||
|
|
|
@ -32,6 +32,8 @@ public:
|
|||
|
||||
void rangeHandler(UChar32 start, UChar32 end, Norm &norm) override;
|
||||
|
||||
UnicodeString maybeNoMappingsOnly;
|
||||
UnicodeString maybeNoMappingsAndCompositions;
|
||||
UnicodeString maybeYesCompositions;
|
||||
UnicodeString yesYesCompositions;
|
||||
UnicodeString yesNoMappingsAndCompositions;
|
||||
|
@ -44,15 +46,17 @@ public:
|
|||
private:
|
||||
/**
|
||||
* Requires norm.hasMapping().
|
||||
* Returns the offset of the "first unit" from the beginning of the extraData for c.
|
||||
* Returns the offset of the "first unit" from the beginning of the extraData for c,
|
||||
* not from the beginning of the dataString.
|
||||
* That is the same as the length of the optional data
|
||||
* for the raw mapping and the ccc/lccc word.
|
||||
*/
|
||||
int32_t writeMapping(UChar32 c, const Norm &norm, UnicodeString &dataString);
|
||||
/** Returns the full offset into the dataString of the "first unit" for c. */
|
||||
int32_t writeNoNoMapping(UChar32 c, const Norm &norm,
|
||||
UnicodeString &dataString, Hashtable &previousMappings);
|
||||
UBool setNoNoDelta(UChar32 c, Norm &norm) const;
|
||||
/** Requires norm.compositions!=nullptr. */
|
||||
/** Requires norm.combinesFwd(). */
|
||||
void writeCompositions(UChar32 c, const Norm &norm, UnicodeString &dataString);
|
||||
void writeExtraData(UChar32 c, Norm &norm);
|
||||
|
||||
|
|
|
@ -59,8 +59,8 @@ static UDataInfo dataInfo={
|
|||
0,
|
||||
|
||||
{ 0x4e, 0x72, 0x6d, 0x32 }, /* dataFormat="Nrm2" */
|
||||
{ 4, 0, 0, 0 }, /* formatVersion */
|
||||
{ 11, 0, 0, 0 } /* dataVersion (Unicode version) */
|
||||
{ 5, 0, 0, 0 }, /* formatVersion */
|
||||
{ 16, 0, 0, 0 } /* dataVersion (Unicode version) */
|
||||
};
|
||||
|
||||
U_NAMESPACE_BEGIN
|
||||
|
@ -254,7 +254,7 @@ UBool Normalizer2DataBuilder::mappingHasCompBoundaryAfter(const BuilderReorderin
|
|||
// characters following this mapping are possible.
|
||||
const Norm *starterNorm=norms.getNorm(starter);
|
||||
if(i==lastStarterIndex &&
|
||||
(starterNorm==nullptr || starterNorm->compositions==nullptr)) {
|
||||
(starterNorm==nullptr || !starterNorm->combinesFwd())) {
|
||||
return true; // The last starter does not combine forward.
|
||||
}
|
||||
uint8_t prevCC=0;
|
||||
|
@ -270,14 +270,14 @@ UBool Normalizer2DataBuilder::mappingHasCompBoundaryAfter(const BuilderReorderin
|
|||
// The starter combines with c into a composite replacement starter.
|
||||
starterNorm=norms.getNorm(starter);
|
||||
if(i>=lastStarterIndex &&
|
||||
(starterNorm==nullptr || starterNorm->compositions==nullptr)) {
|
||||
(starterNorm==nullptr || !starterNorm->combinesFwd())) {
|
||||
return true; // The composite does not combine further.
|
||||
}
|
||||
// Keep prevCC because we "removed" the combining mark.
|
||||
} else if(cc==0) {
|
||||
starterNorm=norms.getNorm(c);
|
||||
if(i==lastStarterIndex &&
|
||||
(starterNorm==nullptr || starterNorm->compositions==nullptr)) {
|
||||
(starterNorm==nullptr || !starterNorm->combinesFwd())) {
|
||||
return true; // The new starter does not combine forward.
|
||||
}
|
||||
prevCC=0;
|
||||
|
@ -354,19 +354,38 @@ void Normalizer2DataBuilder::postProcess(Norm &norm) {
|
|||
|
||||
norm.hasCompBoundaryBefore=
|
||||
!buffer.isEmpty() && norm.leadCC==0 && !norms.combinesBack(buffer.charAt(0));
|
||||
// No comp-boundary-after when norm.combinesBack:
|
||||
// MaybeNo character whose first mapping character may combine-back,
|
||||
// in which case we would not recompose to this character,
|
||||
// and may need more context.
|
||||
norm.hasCompBoundaryAfter=
|
||||
norm.compositions==nullptr && mappingHasCompBoundaryAfter(buffer, norm.mappingType);
|
||||
!norm.combinesBack && !norm.combinesFwd() &&
|
||||
mappingHasCompBoundaryAfter(buffer, norm.mappingType);
|
||||
|
||||
if(norm.combinesBack) {
|
||||
norm.error="combines-back and decomposes, not possible in Unicode normalization";
|
||||
if(norm.mappingType!=Norm::ROUND_TRIP) {
|
||||
// One-way mappings don't get NFC_QC=Maybe, and
|
||||
// should not have gotten combinesBack set.
|
||||
norm.error="combines-back and has a one-way mapping, "
|
||||
"not possible in Unicode normalization";
|
||||
} else if(norm.combinesFwd()) {
|
||||
// Earlier code checked ccc=0.
|
||||
norm.type=Norm::MAYBE_NO_COMBINES_FWD;
|
||||
} else if(norm.cc==0) {
|
||||
norm.type=Norm::MAYBE_NO_MAPPING_ONLY;
|
||||
} else {
|
||||
norm.error="combines-back and decomposes with ccc!=0, "
|
||||
"not possible in Unicode normalization";
|
||||
// ... because we don't reorder again after composition.
|
||||
}
|
||||
} else if(norm.mappingType==Norm::ROUND_TRIP) {
|
||||
if(norm.compositions!=nullptr) {
|
||||
if(norm.combinesFwd()) {
|
||||
norm.type=Norm::YES_NO_COMBINES_FWD;
|
||||
} else {
|
||||
norm.type=Norm::YES_NO_MAPPING_ONLY;
|
||||
}
|
||||
} else { // one-way mapping
|
||||
if(norm.compositions!=nullptr) {
|
||||
if(norm.combinesFwd()) {
|
||||
norm.error="combines-forward and has a one-way mapping, "
|
||||
"not possible in Unicode normalization";
|
||||
} else if(buffer.isEmpty()) {
|
||||
|
@ -386,16 +405,16 @@ void Normalizer2DataBuilder::postProcess(Norm &norm) {
|
|||
norm.hasCompBoundaryBefore=
|
||||
norm.cc==0 && !norm.combinesBack;
|
||||
norm.hasCompBoundaryAfter=
|
||||
norm.cc==0 && !norm.combinesBack && norm.compositions==nullptr;
|
||||
norm.cc==0 && !norm.combinesBack && !norm.combinesFwd();
|
||||
|
||||
if(norm.combinesBack) {
|
||||
if(norm.compositions!=nullptr) {
|
||||
if(norm.combinesFwd()) {
|
||||
// Earlier code checked ccc=0.
|
||||
norm.type=Norm::MAYBE_YES_COMBINES_FWD;
|
||||
} else {
|
||||
norm.type=Norm::MAYBE_YES_SIMPLE; // any ccc
|
||||
}
|
||||
} else if(norm.compositions!=nullptr) {
|
||||
} else if(norm.combinesFwd()) {
|
||||
// Earlier code checked ccc=0.
|
||||
norm.type=Norm::YES_YES_COMBINES_FWD;
|
||||
} else if(norm.cc!=0) {
|
||||
|
@ -469,6 +488,12 @@ void Normalizer2DataBuilder::writeNorm16(UMutableCPTrie *norm16Trie, UChar32 sta
|
|||
norm16=getMinNoNoDelta()+offset;
|
||||
break;
|
||||
}
|
||||
case Norm::MAYBE_NO_MAPPING_ONLY:
|
||||
norm16=indexes[Normalizer2Impl::IX_MIN_MAYBE_NO]+norm.offset*2;
|
||||
break;
|
||||
case Norm::MAYBE_NO_COMBINES_FWD:
|
||||
norm16=indexes[Normalizer2Impl::IX_MIN_MAYBE_NO_COMBINES_FWD]+norm.offset*2;
|
||||
break;
|
||||
case Norm::MAYBE_YES_COMBINES_FWD:
|
||||
norm16=indexes[Normalizer2Impl::IX_MIN_MAYBE_YES]+norm.offset*2;
|
||||
break;
|
||||
|
@ -593,17 +618,26 @@ LocalUCPTriePointer Normalizer2DataBuilder::processData() {
|
|||
extraData.append(extra.noNoMappingsEmpty);
|
||||
indexes[Normalizer2Impl::IX_LIMIT_NO_NO]=extraData.length()*2;
|
||||
|
||||
// Pad the maybeYesCompositions length to a multiple of 4,
|
||||
int32_t maybeDataLength=
|
||||
extra.maybeNoMappingsOnly.length()+
|
||||
extra.maybeNoMappingsAndCompositions.length()+
|
||||
extra.maybeYesCompositions.length();
|
||||
int32_t minMaybeNo=Normalizer2Impl::MIN_NORMAL_MAYBE_YES-maybeDataLength*2;
|
||||
// Adjust minMaybeNo down to 8-align it,
|
||||
// so that NO_NO_DELTA bits 2..1 can be used without subtracting the center.
|
||||
while(extra.maybeYesCompositions.length()&3) {
|
||||
extra.maybeYesCompositions.append((char16_t)0);
|
||||
}
|
||||
extraData.insert(0, extra.maybeYesCompositions);
|
||||
indexes[Normalizer2Impl::IX_MIN_MAYBE_YES]=
|
||||
Normalizer2Impl::MIN_NORMAL_MAYBE_YES-
|
||||
extra.maybeYesCompositions.length()*2;
|
||||
minMaybeNo&=~7;
|
||||
|
||||
// Pad to even length for 4-byte alignment of following data.
|
||||
int32_t index=minMaybeNo;
|
||||
indexes[Normalizer2Impl::IX_MIN_MAYBE_NO]=index;
|
||||
extraData.append(extra.maybeNoMappingsOnly);
|
||||
index+=extra.maybeNoMappingsOnly.length()*2;
|
||||
indexes[Normalizer2Impl::IX_MIN_MAYBE_NO_COMBINES_FWD]=index;
|
||||
extraData.append(extra.maybeNoMappingsAndCompositions);
|
||||
index+=extra.maybeNoMappingsAndCompositions.length()*2;
|
||||
indexes[Normalizer2Impl::IX_MIN_MAYBE_YES]=index;
|
||||
extraData.append(extra.maybeYesCompositions);
|
||||
|
||||
// Pad the extraData to even length for 4-byte alignment of following data.
|
||||
if(extraData.length()&1) {
|
||||
extraData.append((char16_t)0);
|
||||
}
|
||||
|
@ -753,18 +787,34 @@ LocalUCPTriePointer Normalizer2DataBuilder::processData() {
|
|||
printf("size of 16-bit extra data: %5ld uint16_t\n", (long)extraData.length());
|
||||
printf("size of small-FCD data: %5ld bytes\n", (long)sizeof(smallFCD));
|
||||
printf("size of binary data file contents: %5ld bytes\n", (long)totalSize);
|
||||
printf("minDecompNoCodePoint: U+%04lX\n", (long)indexes[Normalizer2Impl::IX_MIN_DECOMP_NO_CP]);
|
||||
printf("minCompNoMaybeCodePoint: U+%04lX\n", (long)indexes[Normalizer2Impl::IX_MIN_COMP_NO_MAYBE_CP]);
|
||||
printf("minLcccCodePoint: U+%04lX\n", (long)indexes[Normalizer2Impl::IX_MIN_LCCC_CP]);
|
||||
printf("minYesNo: (with compositions) 0x%04x\n", (int)indexes[Normalizer2Impl::IX_MIN_YES_NO]);
|
||||
printf("minYesNoMappingsOnly: 0x%04x\n", (int)indexes[Normalizer2Impl::IX_MIN_YES_NO_MAPPINGS_ONLY]);
|
||||
printf("minNoNo: (comp-normalized) 0x%04x\n", (int)indexes[Normalizer2Impl::IX_MIN_NO_NO]);
|
||||
printf("minNoNoCompBoundaryBefore: 0x%04x\n", (int)indexes[Normalizer2Impl::IX_MIN_NO_NO_COMP_BOUNDARY_BEFORE]);
|
||||
printf("minNoNoCompNoMaybeCC: 0x%04x\n", (int)indexes[Normalizer2Impl::IX_MIN_NO_NO_COMP_NO_MAYBE_CC]);
|
||||
printf("minNoNoEmpty: 0x%04x\n", (int)indexes[Normalizer2Impl::IX_MIN_NO_NO_EMPTY]);
|
||||
printf("limitNoNo: 0x%04x\n", (int)indexes[Normalizer2Impl::IX_LIMIT_NO_NO]);
|
||||
printf("minNoNoDelta: 0x%04x\n", (int)minNoNoDelta);
|
||||
printf("minMaybeYes: 0x%04x\n", (int)indexes[Normalizer2Impl::IX_MIN_MAYBE_YES]);
|
||||
printf("minDecompNoCodePoint: U+%04lX\n",
|
||||
(long)indexes[Normalizer2Impl::IX_MIN_DECOMP_NO_CP]);
|
||||
printf("minCompNoMaybeCodePoint: U+%04lX\n",
|
||||
(long)indexes[Normalizer2Impl::IX_MIN_COMP_NO_MAYBE_CP]);
|
||||
printf("minLcccCodePoint: U+%04lX\n",
|
||||
(long)indexes[Normalizer2Impl::IX_MIN_LCCC_CP]);
|
||||
printf("minYesNo: (with compositions) 0x%04x\n",
|
||||
(int)indexes[Normalizer2Impl::IX_MIN_YES_NO]);
|
||||
printf("minYesNoMappingsOnly: 0x%04x\n",
|
||||
(int)indexes[Normalizer2Impl::IX_MIN_YES_NO_MAPPINGS_ONLY]);
|
||||
printf("minNoNo: (comp-normalized) 0x%04x\n",
|
||||
(int)indexes[Normalizer2Impl::IX_MIN_NO_NO]);
|
||||
printf("minNoNoCompBoundaryBefore: 0x%04x\n",
|
||||
(int)indexes[Normalizer2Impl::IX_MIN_NO_NO_COMP_BOUNDARY_BEFORE]);
|
||||
printf("minNoNoCompNoMaybeCC: 0x%04x\n",
|
||||
(int)indexes[Normalizer2Impl::IX_MIN_NO_NO_COMP_NO_MAYBE_CC]);
|
||||
printf("minNoNoEmpty: 0x%04x\n",
|
||||
(int)indexes[Normalizer2Impl::IX_MIN_NO_NO_EMPTY]);
|
||||
printf("limitNoNo: 0x%04x\n",
|
||||
(int)indexes[Normalizer2Impl::IX_LIMIT_NO_NO]);
|
||||
printf("minNoNoDelta: 0x%04x\n",
|
||||
(int)minNoNoDelta);
|
||||
printf("minMaybeNo: 0x%04x\n",
|
||||
(int)indexes[Normalizer2Impl::IX_MIN_MAYBE_NO]);
|
||||
printf("minMaybeNoCombinesFwd: 0x%04x\n",
|
||||
(int)indexes[Normalizer2Impl::IX_MIN_MAYBE_NO_COMBINES_FWD]);
|
||||
printf("minMaybeYes: 0x%04x\n",
|
||||
(int)indexes[Normalizer2Impl::IX_MIN_MAYBE_YES]);
|
||||
}
|
||||
|
||||
UVersionInfo nullVersion={ 0, 0, 0, 0 };
|
||||
|
|
|
@ -92,7 +92,7 @@ private:
|
|||
|
||||
void setSmallFCD(UChar32 c);
|
||||
int32_t getMinNoNoDelta() const {
|
||||
return indexes[Normalizer2Impl::IX_MIN_MAYBE_YES]-
|
||||
return indexes[Normalizer2Impl::IX_MIN_MAYBE_NO]-
|
||||
((2*Normalizer2Impl::MAX_DELTA+1)<<Normalizer2Impl::DELTA_SHIFT);
|
||||
}
|
||||
void writeNorm16(UMutableCPTrie *norm16Trie, UChar32 start, UChar32 end, Norm &norm);
|
||||
|
|
|
@ -245,6 +245,14 @@ void Decomposer::rangeHandler(UChar32 start, UChar32 end, Norm &norm) {
|
|||
exit(U_INVALID_FORMAT_ERROR);
|
||||
}
|
||||
const Norm &cNorm=norms.getNormRef(c);
|
||||
if(norm.mappingType==Norm::ROUND_TRIP && prev==0 &&
|
||||
!norm.combinesBack && cNorm.combinesBack) {
|
||||
// If a two-way mapping starts with an NFC_QC=Maybe character,
|
||||
// then mark the composite as NFC_QC=Maybe as well,
|
||||
// so that we trigger decomposition and recomposition.
|
||||
norm.combinesBack=true;
|
||||
didDecompose|=true;
|
||||
}
|
||||
if(cNorm.hasMapping()) {
|
||||
if(norm.mappingType==Norm::ROUND_TRIP) {
|
||||
if(prev==0) {
|
||||
|
|
|
@ -70,6 +70,7 @@ struct Norm {
|
|||
}
|
||||
}
|
||||
|
||||
bool combinesFwd() const { return compositions!=nullptr; }
|
||||
const CompositionPair *getCompositionPairs(int32_t &length) const {
|
||||
if(compositions==nullptr) {
|
||||
length=0;
|
||||
|
@ -97,7 +98,7 @@ struct Norm {
|
|||
* Set after most processing is done.
|
||||
*
|
||||
* Corresponds to the rows in the chart on
|
||||
* https://icu.unicode.org/design/normalization/custom
|
||||
* https://unicode-org.github.io/icu/design/normalization/custom.html
|
||||
* in numerical (but reverse visual) order.
|
||||
*
|
||||
* YES_NO means composition quick check=yes, decomposition QC=no -- etc.
|
||||
|
@ -123,10 +124,14 @@ struct Norm {
|
|||
NO_NO_EMPTY,
|
||||
/** Has an algorithmic one-way mapping to a single code point. */
|
||||
NO_NO_DELTA,
|
||||
/** Has a two-way mapping which starts with a character that combines backward. */
|
||||
MAYBE_NO_MAPPING_ONLY,
|
||||
/**
|
||||
* Combines both backward and forward, has compositions.
|
||||
* Allowed, but not normally used.
|
||||
* Has a two-way mapping which starts with a character that combines backward.
|
||||
* Also combines forward.
|
||||
*/
|
||||
MAYBE_NO_COMBINES_FWD,
|
||||
/** Combines both backward and forward, has compositions. */
|
||||
MAYBE_YES_COMBINES_FWD,
|
||||
/** Combines only backward. */
|
||||
MAYBE_YES_SIMPLE,
|
||||
|
|
Loading…
Add table
Reference in a new issue