mirror of
https://github.com/unicode-org/icu.git
synced 2025-04-13 08:53:20 +00:00
ICU-1955 doc update
X-SVN-Rev: 8924
This commit is contained in:
parent
6939f0b1bb
commit
d2500d9618
4 changed files with 2487 additions and 2503 deletions
|
@ -5,8 +5,8 @@
|
|||
*******************************************************************************
|
||||
*
|
||||
* $Source: /xsrl/Nsvn/icu/icu4j/src/com/ibm/icu/text/Attic/BOSCU.java,v $
|
||||
* $Date: 2002/06/20 01:21:18 $
|
||||
* $Revision: 1.2 $
|
||||
* $Date: 2002/06/22 07:23:45 $
|
||||
* $Revision: 1.3 $
|
||||
*
|
||||
*******************************************************************************
|
||||
*/
|
||||
|
@ -17,366 +17,367 @@ import com.ibm.icu.impl.UnicodeCharacterIterator;
|
|||
/**
|
||||
* <p>Binary Ordered Compression Scheme for Unicode</p>
|
||||
*
|
||||
* <p>Specific application:<br>
|
||||
* Encode a Unicode string for the identical level of a sort key.<br>
|
||||
* Restrictions:
|
||||
* <ul>
|
||||
* <li> byte stream (unsigned 8-bit bytes)
|
||||
* <li> lexical order of the identical-level run must be the same as code
|
||||
* point order for the string
|
||||
* <li> avoid byte values 0, 1, 2
|
||||
* </ul>
|
||||
* </p>
|
||||
* <p>(Syn Wee: reference a paper if we have one on our site)</p>
|
||||
* <p>BOCU is used to compress unicode text into a stream of unsigned
|
||||
* bytes. For many kinds of text the compression compares favorably
|
||||
* to UTF-8, and for some kinds of text (such as CJK) it does better.
|
||||
* The resulting bytes will compare in the same order as the original
|
||||
* code points. The byte stream does not contain the values 0, 1, or
|
||||
* 2. (Syn Wee, I don't understand the comment later in the source
|
||||
* about these values being used in sort keys, can you explain?)</p>
|
||||
*
|
||||
* <p>Unlike a UTF encoding, BOCU-compressed text is not suitable for
|
||||
* random access.</p>
|
||||
*
|
||||
* <p>Method: Slope Detection<br>
|
||||
* Remember the previous code point (initial 0).
|
||||
* For each cp in the string, encode the difference to the previous one.
|
||||
* </p>
|
||||
* <p>With a compact encoding of differences, this yields good results for
|
||||
* small scripts and UTF-like results otherwise.
|
||||
* </p>
|
||||
* <p>Encoding of differences:<br>
|
||||
* <ul>
|
||||
* <li>Similar to a UTF, encoding the length of the byte sequence in the lead
|
||||
* bytes.
|
||||
* <li> Does not need to be friendly for decoding or random access
|
||||
* (trail byte values may overlap with lead/single byte values).
|
||||
* <li> The signedness must be encoded as the most significant part.
|
||||
* </ul>
|
||||
* </p>
|
||||
* <p>We encode differences with few bytes if their absolute values are small.
|
||||
* For correct ordering, we must treat the entire value range -10ffff..+10ffff
|
||||
* in ascending order, which forbids encoding the sign and the absolute value
|
||||
* separately.
|
||||
* Instead, we split the lead byte range in the middle and encode non-negative
|
||||
* values going up and negative values going down.
|
||||
* </p>
|
||||
* <p>For very small absolute values, the difference is added to a middle byte
|
||||
* value for single-byte encoded differences.
|
||||
* For somewhat larger absolute values, the difference is divided by the number
|
||||
* of byte values available, the modulo is used for one trail byte, and the
|
||||
* remainder is added to a lead byte avoiding the single-byte range.
|
||||
* For large absolute values, the difference is similarly encoded in three
|
||||
* bytes.
|
||||
* </p>
|
||||
* <p>This encoding does not use byte values 0, 1, 2, but uses all other byte
|
||||
* values for lead/single bytes so that the middle range of single bytes is as
|
||||
* large as possible.
|
||||
* </p>
|
||||
* <p>Note that the lead byte ranges overlap some, but that the sequences as a
|
||||
* whole are well ordered. I.e., even if the lead byte is the same for
|
||||
* sequences of different lengths, the trail bytes establish correct order.
|
||||
* It would be possible to encode slightly larger ranges for each length (>1)
|
||||
* by subtracting the lower bound of the range. However, that would also slow
|
||||
* down the calculation.
|
||||
* </p>
|
||||
* <p>For the actual string encoding, an optimization moves the previous code
|
||||
* point value to the middle of its Unicode script block to minimize the
|
||||
* differences in same-script text runs.
|
||||
* </p>
|
||||
* <p>Method: Slope Detection<br> Remember the previous code point
|
||||
* (initial 0). For each code point in the string, encode the
|
||||
* difference with the previous one. Similar to a UTF, the length of
|
||||
* the byte sequence is encoded in the lead bytes. Unlike a UTF, the
|
||||
* trail byte values may overlap with lead/single byte values. The
|
||||
* signedness of the difference must be encoded as the most
|
||||
* significant part.</p>
|
||||
*
|
||||
* <p>We encode differences with few bytes if their absolute values
|
||||
* are small. For correct ordering, we must treat the entire value
|
||||
* range -10ffff..+10ffff in ascending order, which forbids encoding
|
||||
* the sign and the absolute value separately. Instead, we split the
|
||||
* lead byte range in the middle and encode non-negative values going
|
||||
* up and negative values going down.</p>
|
||||
*
|
||||
* <p>For very small absolute values, the difference is added to a
|
||||
* middle byte value for single-byte encoded differences. For
|
||||
* somewhat larger absolute values, the difference is divided by the
|
||||
* number of byte values available, the modulo is used for one trail
|
||||
* byte, and the remainder is added to a lead byte avoiding the
|
||||
* single-byte range. For large absolute values, the difference is
|
||||
* similarly encoded in three bytes. (Syn Wee, I need examples
|
||||
* here.)</p>
|
||||
*
|
||||
* <p>BOCU does not use byte values 0, 1, or 2, but uses all other
|
||||
* byte values for lead and single bytes, so that the middle range of
|
||||
* single bytes is as large as possible.</p>
|
||||
*
|
||||
* <p>Note that the lead byte ranges overlap some, but that the
|
||||
* sequences as a whole are well ordered. I.e., even if the lead byte
|
||||
* is the same for sequences of different lengths, the trail bytes
|
||||
* establish correct order. It would be possible to encode slightly
|
||||
* larger ranges for each length (>1) by subtracting the lower bound
|
||||
* of the range. However, that would also slow down the calculation.
|
||||
* (Syn Wee, need an example).</p>
|
||||
*
|
||||
* <p>For the actual string encoding, an optimization moves the
|
||||
* previous code point value to the middle of its Unicode script block
|
||||
* to minimize the differences in same-script text runs. (Syn Wee,
|
||||
* need an example.)</p>
|
||||
*
|
||||
* @author Syn Wee Quek
|
||||
* @since release 2.2, May 3rd 2002
|
||||
* @draft 2.2
|
||||
*/
|
||||
* @draft 2.2 */
|
||||
public class BOSCU
|
||||
{
|
||||
// public constructors --------------------------------------------------
|
||||
// public constructors --------------------------------------------------
|
||||
|
||||
// public methods -------------------------------------------------------
|
||||
|
||||
/**
|
||||
* <p>Encode the code points of a string as a sequence of byte-encoded
|
||||
* differences (slope detection), preserving lexical order.</p>
|
||||
* <p>Optimize the difference-taking for runs of Unicode text within
|
||||
* small scripts:<br>
|
||||
* Most small scripts are allocated within aligned 128-blocks of Unicode
|
||||
* code points. Lexical order is preserved if "prev" is always moved
|
||||
* into the middle of such a block.</p>
|
||||
* <p>Additionally, "prev" is moved from anywhere in the Unihan area into
|
||||
* the middle of that area.</p>
|
||||
* <p>Note that the identical-level run in a sort key is generated from
|
||||
* NFD text - there are never Hangul characters included.</p>
|
||||
* @param source text source
|
||||
* @param buffer output buffer
|
||||
* @param offset to start writing to
|
||||
* @return end offset where the writing stop
|
||||
*/
|
||||
public static int writeIdenticalLevelRun(String source, byte buffer[],
|
||||
int offset)
|
||||
{
|
||||
int prev = 0;
|
||||
UnicodeCharacterIterator iterator = new UnicodeCharacterIterator(source);
|
||||
int codepoint = iterator.nextCodePoint();
|
||||
while (codepoint != UnicodeCharacterIterator.DONE_CODEPOINT) {
|
||||
if (prev < 0x4e00 || prev >= 0xa000) {
|
||||
prev = (prev & ~0x7f) - SLOPE_REACH_NEG_1_;
|
||||
}
|
||||
else {
|
||||
// Unihan U+4e00..U+9fa5:
|
||||
// double-bytes down from the upper end
|
||||
prev = 0x9fff - SLOPE_REACH_POS_2_;
|
||||
}
|
||||
|
||||
offset = writeDiff(codepoint - prev, buffer, offset);
|
||||
prev = codepoint;
|
||||
codepoint = iterator.nextCodePoint();
|
||||
}
|
||||
return offset;
|
||||
}
|
||||
|
||||
/**
|
||||
* How many bytes would writeIdenticalLevelRun() write?
|
||||
* @param source text source string
|
||||
* @return the length of the BOSCU result
|
||||
*/
|
||||
public static int lengthOfIdenticalLevelRun(String source)
|
||||
{
|
||||
int prev = 0;
|
||||
int result = 0;
|
||||
UnicodeCharacterIterator iterator = new UnicodeCharacterIterator(source);
|
||||
int codepoint = iterator.nextCodePoint();
|
||||
while (codepoint != UnicodeCharacterIterator.DONE_CODEPOINT) {
|
||||
if (prev < 0x4e00 || prev >= 0xa000) {
|
||||
prev = (prev & ~0x7f) - SLOPE_REACH_NEG_1_;
|
||||
}
|
||||
else {
|
||||
// Unihan U+4e00..U+9fa5:
|
||||
// double-bytes down from the upper end
|
||||
prev = 0x9fff - SLOPE_REACH_POS_2_;
|
||||
}
|
||||
|
||||
codepoint = iterator.nextCodePoint();
|
||||
result += lengthOfDiff(codepoint - prev);
|
||||
prev = codepoint;
|
||||
}
|
||||
return result;
|
||||
}
|
||||
// public methods -------------------------------------------------------
|
||||
|
||||
/**
|
||||
* <p>(Syn Wee-- I think this should be renamed to 'compress')</p>
|
||||
* <p>Encode the code points of a string as a sequence of bytes,
|
||||
* preserving lexical order.</p>
|
||||
*
|
||||
* @param source text source
|
||||
* @param buffer output buffer
|
||||
* @param offset to start writing to
|
||||
* @return end offset where the writing stopped
|
||||
*/
|
||||
public static int writeIdenticalLevelRun(String source, byte buffer[],
|
||||
int offset)
|
||||
{
|
||||
// (Syn Wee - this is a public function so comments of this nature don't
|
||||
// really belong in the documentation, I think. So I moved them.)
|
||||
// Optimize the difference-taking for runs of Unicode text within
|
||||
// small scripts.
|
||||
// Most small scripts are allocated within aligned 128-blocks of Unicode
|
||||
// code points. Lexical order is preserved if "prev" is always moved
|
||||
// into the middle of such a block.
|
||||
// <p>Additionally, "prev" is moved from anywhere in the Unihan area into
|
||||
// the middle of that area.
|
||||
// Note that the identical-level run in a sort key is generated from
|
||||
// NFD text - there are never Hangul characters included.
|
||||
|
||||
// public setter methods -------------------------------------------------
|
||||
|
||||
int prev = 0;
|
||||
UnicodeCharacterIterator iterator = new UnicodeCharacterIterator(source);
|
||||
int codepoint = iterator.nextCodePoint();
|
||||
while (codepoint != UnicodeCharacterIterator.DONE_CODEPOINT) {
|
||||
if (prev < 0x4e00 || prev >= 0xa000) {
|
||||
prev = (prev & ~0x7f) - SLOPE_REACH_NEG_1_;
|
||||
}
|
||||
else {
|
||||
// Unihan U+4e00..U+9fa5:
|
||||
// double-bytes down from the upper end
|
||||
prev = 0x9fff - SLOPE_REACH_POS_2_;
|
||||
}
|
||||
|
||||
offset = writeDiff(codepoint - prev, buffer, offset);
|
||||
prev = codepoint;
|
||||
codepoint = iterator.nextCodePoint();
|
||||
}
|
||||
return offset;
|
||||
}
|
||||
|
||||
/**
|
||||
* <p>(Syn Wee, I think this should be renamed getCompressedLength).</p>
|
||||
* Return the number of bytes that writeIdenticalLevelRun() would write.
|
||||
* @param source text source string
|
||||
* @return the length of the BOCU result
|
||||
*/
|
||||
public static int lengthOfIdenticalLevelRun(String source)
|
||||
{
|
||||
int prev = 0;
|
||||
int result = 0;
|
||||
UnicodeCharacterIterator iterator = new UnicodeCharacterIterator(source);
|
||||
int codepoint = iterator.nextCodePoint();
|
||||
while (codepoint != UnicodeCharacterIterator.DONE_CODEPOINT) {
|
||||
if (prev < 0x4e00 || prev >= 0xa000) {
|
||||
prev = (prev & ~0x7f) - SLOPE_REACH_NEG_1_;
|
||||
}
|
||||
else {
|
||||
// Unihan U+4e00..U+9fa5:
|
||||
// double-bytes down from the upper end
|
||||
prev = 0x9fff - SLOPE_REACH_POS_2_;
|
||||
}
|
||||
|
||||
codepoint = iterator.nextCodePoint();
|
||||
result += lengthOfDiff(codepoint - prev);
|
||||
prev = codepoint;
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
// public setter methods -------------------------------------------------
|
||||
|
||||
// public getter methods ------------------------------------------------
|
||||
|
||||
// public other methods -------------------------------------------------
|
||||
|
||||
// public other methods -------------------------------------------------
|
||||
|
||||
// protected constructor ------------------------------------------------
|
||||
|
||||
// protected data members ------------------------------------------------
|
||||
// protected data members ------------------------------------------------
|
||||
|
||||
// protected methods -----------------------------------------------------
|
||||
|
||||
// private data members --------------------------------------------------
|
||||
// private data members --------------------------------------------------
|
||||
|
||||
/**
|
||||
* Do not use byte values 0, 1, 2 because they are separators in sort keys.
|
||||
*/
|
||||
private static final int SLOPE_MIN_ = 3;
|
||||
private static final int SLOPE_MAX_ = 0xff;
|
||||
private static final int SLOPE_MIDDLE_ = 0x81;
|
||||
private static final int SLOPE_TAIL_COUNT_ = SLOPE_MAX_ - SLOPE_MIN_ + 1;
|
||||
private static final int SLOPE_MAX_BYTES_ = 4;
|
||||
private static final int SLOPE_MIN_ = 3;
|
||||
private static final int SLOPE_MAX_ = 0xff;
|
||||
private static final int SLOPE_MIDDLE_ = 0x81;
|
||||
private static final int SLOPE_TAIL_COUNT_ = SLOPE_MAX_ - SLOPE_MIN_ + 1;
|
||||
private static final int SLOPE_MAX_BYTES_ = 4;
|
||||
|
||||
/**
|
||||
* Number of lead bytes:
|
||||
* 1 middle byte for 0
|
||||
* 2*80=160 single bytes for !=0
|
||||
* 2*42=84 for double-byte values
|
||||
* 2*3=6 for 3-byte values
|
||||
* 2*1=2 for 4-byte values
|
||||
*
|
||||
* The sum must be <=SLOPE_TAIL_COUNT.
|
||||
*
|
||||
* Why these numbers?
|
||||
* - There should be >=128 single-byte values to cover 128-blocks
|
||||
* with small scripts.
|
||||
* - There should be >=20902 single/double-byte values to cover Unihan.
|
||||
* - It helps CJK Extension B some if there are 3-byte values that cover
|
||||
* the distance between them and Unihan.
|
||||
* This also helps to jump among distant places in the BMP.
|
||||
* - Four-byte values are necessary to cover the rest of Unicode.
|
||||
*
|
||||
* Symmetrical lead byte counts are for convenience.
|
||||
* With an equal distribution of even and odd differences there is also
|
||||
* no advantage to asymmetrical lead byte counts.
|
||||
*/
|
||||
private static final int SLOPE_SINGLE_ = 80;
|
||||
private static final int SLOPE_LEAD_2_ = 42;
|
||||
private static final int SLOPE_LEAD_3_ = 3;
|
||||
private static final int SLOPE_LEAD_4_ = 1;
|
||||
/**
|
||||
* Number of lead bytes:
|
||||
* 1 middle byte for 0
|
||||
* 2*80=160 single bytes for !=0
|
||||
* 2*42=84 for double-byte values
|
||||
* 2*3=6 for 3-byte values
|
||||
* 2*1=2 for 4-byte values
|
||||
*
|
||||
* The sum must be <=SLOPE_TAIL_COUNT.
|
||||
*
|
||||
* Why these numbers?
|
||||
* - There should be >=128 single-byte values to cover 128-blocks
|
||||
* with small scripts.
|
||||
* - There should be >=20902 single/double-byte values to cover Unihan.
|
||||
* - It helps CJK Extension B some if there are 3-byte values that cover
|
||||
* the distance between them and Unihan.
|
||||
* This also helps to jump among distant places in the BMP.
|
||||
* - Four-byte values are necessary to cover the rest of Unicode.
|
||||
*
|
||||
* Symmetrical lead byte counts are for convenience.
|
||||
* With an equal distribution of even and odd differences there is also
|
||||
* no advantage to asymmetrical lead byte counts.
|
||||
*/
|
||||
private static final int SLOPE_SINGLE_ = 80;
|
||||
private static final int SLOPE_LEAD_2_ = 42;
|
||||
private static final int SLOPE_LEAD_3_ = 3;
|
||||
private static final int SLOPE_LEAD_4_ = 1;
|
||||
|
||||
/**
|
||||
* The difference value range for single-byters.
|
||||
*/
|
||||
private static final int SLOPE_REACH_POS_1_ = SLOPE_SINGLE_;
|
||||
private static final int SLOPE_REACH_NEG_1_ = (-SLOPE_SINGLE_);
|
||||
/**
|
||||
* The difference value range for single-byters.
|
||||
*/
|
||||
private static final int SLOPE_REACH_POS_1_ = SLOPE_SINGLE_;
|
||||
private static final int SLOPE_REACH_NEG_1_ = (-SLOPE_SINGLE_);
|
||||
|
||||
/**
|
||||
* The difference value range for double-byters.
|
||||
*/
|
||||
private static final int SLOPE_REACH_POS_2_ =
|
||||
SLOPE_LEAD_2_ * SLOPE_TAIL_COUNT_ + SLOPE_LEAD_2_ - 1;
|
||||
private static final int SLOPE_REACH_NEG_2_ = (-SLOPE_REACH_POS_2_ - 1);
|
||||
/**
|
||||
* The difference value range for double-byters.
|
||||
*/
|
||||
private static final int SLOPE_REACH_POS_2_ =
|
||||
SLOPE_LEAD_2_ * SLOPE_TAIL_COUNT_ + SLOPE_LEAD_2_ - 1;
|
||||
private static final int SLOPE_REACH_NEG_2_ = (-SLOPE_REACH_POS_2_ - 1);
|
||||
|
||||
/**
|
||||
* The difference value range for 3-byters.
|
||||
*/
|
||||
private static final int SLOPE_REACH_POS_3_ = SLOPE_LEAD_3_
|
||||
* SLOPE_TAIL_COUNT_
|
||||
* SLOPE_TAIL_COUNT_
|
||||
+ (SLOPE_LEAD_3_ - 1)
|
||||
* SLOPE_TAIL_COUNT_ +
|
||||
(SLOPE_TAIL_COUNT_ - 1);
|
||||
private static final int SLOPE_REACH_NEG_3_ = (-SLOPE_REACH_POS_3_ - 1);
|
||||
/**
|
||||
* The difference value range for 3-byters.
|
||||
*/
|
||||
private static final int SLOPE_REACH_POS_3_ = SLOPE_LEAD_3_
|
||||
* SLOPE_TAIL_COUNT_
|
||||
* SLOPE_TAIL_COUNT_
|
||||
+ (SLOPE_LEAD_3_ - 1)
|
||||
* SLOPE_TAIL_COUNT_ +
|
||||
(SLOPE_TAIL_COUNT_ - 1);
|
||||
private static final int SLOPE_REACH_NEG_3_ = (-SLOPE_REACH_POS_3_ - 1);
|
||||
|
||||
/**
|
||||
* The lead byte start values.
|
||||
*/
|
||||
private static final int SLOPE_START_POS_2_ = SLOPE_MIDDLE_
|
||||
+ SLOPE_SINGLE_ + 1;
|
||||
private static final int SLOPE_START_POS_3_ = SLOPE_START_POS_2_
|
||||
+ SLOPE_LEAD_2_;
|
||||
private static final int SLOPE_START_NEG_2_ = SLOPE_MIDDLE_ +
|
||||
SLOPE_REACH_NEG_1_;
|
||||
private static final int SLOPE_START_NEG_3_ = SLOPE_START_NEG_2_
|
||||
- SLOPE_LEAD_2_;
|
||||
|
||||
// private constructor ---------------------------------------------------
|
||||
|
||||
/**
|
||||
* Constructor private to prevent initialization
|
||||
*/
|
||||
private BOSCU()
|
||||
{
|
||||
}
|
||||
/**
|
||||
* The lead byte start values.
|
||||
*/
|
||||
private static final int SLOPE_START_POS_2_ = SLOPE_MIDDLE_
|
||||
+ SLOPE_SINGLE_ + 1;
|
||||
private static final int SLOPE_START_POS_3_ = SLOPE_START_POS_2_
|
||||
+ SLOPE_LEAD_2_;
|
||||
private static final int SLOPE_START_NEG_2_ = SLOPE_MIDDLE_ +
|
||||
SLOPE_REACH_NEG_1_;
|
||||
private static final int SLOPE_START_NEG_3_ = SLOPE_START_NEG_2_
|
||||
- SLOPE_LEAD_2_;
|
||||
|
||||
// private constructor ---------------------------------------------------
|
||||
|
||||
/**
|
||||
* Constructor private to prevent initialization
|
||||
*/
|
||||
private BOSCU()
|
||||
{
|
||||
}
|
||||
|
||||
// private methods -------------------------------------------------------
|
||||
|
||||
/**
|
||||
* Integer division and modulo with negative numerators
|
||||
* yields negative modulo results and quotients that are one more than
|
||||
* what we need here.
|
||||
* @param number which operations are to be performed on
|
||||
* @param factor the factor to use for division
|
||||
* @return (result of division) << 32 | modulo
|
||||
*/
|
||||
private static final long getNegDivMod(int number, int factor)
|
||||
{
|
||||
int modulo = number % factor;
|
||||
long result = number / factor;
|
||||
if (modulo < 0) {
|
||||
-- result;
|
||||
modulo += factor;
|
||||
}
|
||||
return (result << 32) | modulo;
|
||||
}
|
||||
|
||||
/**
|
||||
* Encode one difference value -0x10ffff..+0x10ffff in 1..3 bytes,
|
||||
* preserving lexical order
|
||||
* @param diff
|
||||
* @param buffer byte buffer to append to
|
||||
* @param offset to the byte buffer to start appending
|
||||
* @return end offset where the appending stops
|
||||
*/
|
||||
private static final int writeDiff(int diff, byte buffer[], int offset)
|
||||
{
|
||||
if (diff >= SLOPE_REACH_NEG_1_) {
|
||||
if (diff <= SLOPE_REACH_POS_1_) {
|
||||
buffer[offset ++] = (byte)(SLOPE_MIDDLE_ + diff);
|
||||
}
|
||||
else if (diff <= SLOPE_REACH_POS_2_) {
|
||||
buffer[offset ++] = (byte)(SLOPE_START_POS_2_
|
||||
+ (diff / SLOPE_TAIL_COUNT_));
|
||||
buffer[offset ++] = (byte)(SLOPE_MIN_ +
|
||||
(diff % SLOPE_TAIL_COUNT_));
|
||||
}
|
||||
else if (diff <= SLOPE_REACH_POS_3_) {
|
||||
buffer[offset + 2] = (byte)(SLOPE_MIN_
|
||||
+ (diff % SLOPE_TAIL_COUNT_));
|
||||
diff /= SLOPE_TAIL_COUNT_;
|
||||
buffer[offset + 1] = (byte)(SLOPE_MIN_
|
||||
+ (diff % SLOPE_TAIL_COUNT_));
|
||||
buffer[offset] = (byte)(SLOPE_START_POS_3_
|
||||
+ (diff / SLOPE_TAIL_COUNT_));
|
||||
offset += 3;
|
||||
}
|
||||
else {
|
||||
buffer[offset + 3] = (byte)(SLOPE_MIN_
|
||||
+ diff % SLOPE_TAIL_COUNT_);
|
||||
diff /= SLOPE_TAIL_COUNT_;
|
||||
buffer[offset] = (byte)(SLOPE_MIN_
|
||||
+ diff % SLOPE_TAIL_COUNT_);
|
||||
diff /= SLOPE_TAIL_COUNT_;
|
||||
buffer[offset + 1] = (byte)(SLOPE_MIN_
|
||||
+ diff % SLOPE_TAIL_COUNT_);
|
||||
buffer[offset] = (byte)SLOPE_MAX_;
|
||||
offset += 4;
|
||||
}
|
||||
}
|
||||
else {
|
||||
long division = getNegDivMod(diff, SLOPE_TAIL_COUNT_);
|
||||
int modulo = (int)division;
|
||||
if (diff >= SLOPE_REACH_NEG_2_) {
|
||||
diff = (int)(division >> 32);
|
||||
buffer[offset ++] = (byte)(SLOPE_START_NEG_2_ + diff);
|
||||
buffer[offset ++] = (byte)(SLOPE_MIN_ + modulo);
|
||||
}
|
||||
else if (diff >= SLOPE_REACH_NEG_3_) {
|
||||
buffer[offset + 2] = (byte)(SLOPE_MIN_ + modulo);
|
||||
diff = (int)(division >> 32);
|
||||
division = getNegDivMod(diff, SLOPE_TAIL_COUNT_);
|
||||
modulo = (int)division;
|
||||
diff = (int)(division >> 32);
|
||||
buffer[offset + 1] = (byte)(SLOPE_MIN_ + modulo);
|
||||
buffer[offset] = (byte)(SLOPE_START_NEG_3_ + diff);
|
||||
offset += 3;
|
||||
}
|
||||
else {
|
||||
buffer[offset + 3] = (byte)(SLOPE_MIN_ + modulo);
|
||||
diff = (int)(division >> 32);
|
||||
division = getNegDivMod(diff, SLOPE_TAIL_COUNT_);
|
||||
modulo = (int)division;
|
||||
diff = (int)(division >> 32);
|
||||
buffer[offset + 2] = (byte)(SLOPE_MIN_ + modulo);
|
||||
division = getNegDivMod(diff, SLOPE_TAIL_COUNT_);
|
||||
modulo = (int)division;
|
||||
buffer[offset + 1] = (byte)(SLOPE_MIN_ + modulo);
|
||||
buffer[offset] = SLOPE_MIN_;
|
||||
offset += 4;
|
||||
}
|
||||
}
|
||||
return offset;
|
||||
}
|
||||
|
||||
/**
|
||||
* How many bytes would writeDiff() write?
|
||||
* @param diff
|
||||
*/
|
||||
private static final int lengthOfDiff(int diff)
|
||||
{
|
||||
if (diff >= SLOPE_REACH_NEG_1_) {
|
||||
if (diff <= SLOPE_REACH_POS_1_) {
|
||||
return 1;
|
||||
}
|
||||
else if (diff <= SLOPE_REACH_POS_2_) {
|
||||
return 2;
|
||||
}
|
||||
else if(diff <= SLOPE_REACH_POS_3_) {
|
||||
return 3;
|
||||
}
|
||||
else {
|
||||
return 4;
|
||||
}
|
||||
}
|
||||
else {
|
||||
if (diff >= SLOPE_REACH_NEG_2_) {
|
||||
return 2;
|
||||
}
|
||||
else if (diff >= SLOPE_REACH_NEG_3_) {
|
||||
return 3;
|
||||
}
|
||||
else {
|
||||
return 4;
|
||||
}
|
||||
}
|
||||
}
|
||||
* Integer division and modulo with negative numerators
|
||||
* yields negative modulo results and quotients that are one more than
|
||||
* what we need here.
|
||||
* @param number which operations are to be performed on
|
||||
* @param factor the factor to use for division
|
||||
* @return (result of division) << 32 | modulo
|
||||
*/
|
||||
private static final long getNegDivMod(int number, int factor)
|
||||
{
|
||||
int modulo = number % factor;
|
||||
long result = number / factor;
|
||||
if (modulo < 0) {
|
||||
-- result;
|
||||
modulo += factor;
|
||||
}
|
||||
return (result << 32) | modulo;
|
||||
}
|
||||
|
||||
/**
|
||||
* Encode one difference value -0x10ffff..+0x10ffff in 1..3 bytes,
|
||||
* preserving lexical order
|
||||
* @param diff
|
||||
* @param buffer byte buffer to append to
|
||||
* @param offset to the byte buffer to start appending
|
||||
* @return end offset where the appending stops
|
||||
*/
|
||||
private static final int writeDiff(int diff, byte buffer[], int offset)
|
||||
{
|
||||
if (diff >= SLOPE_REACH_NEG_1_) {
|
||||
if (diff <= SLOPE_REACH_POS_1_) {
|
||||
buffer[offset ++] = (byte)(SLOPE_MIDDLE_ + diff);
|
||||
}
|
||||
else if (diff <= SLOPE_REACH_POS_2_) {
|
||||
buffer[offset ++] = (byte)(SLOPE_START_POS_2_
|
||||
+ (diff / SLOPE_TAIL_COUNT_));
|
||||
buffer[offset ++] = (byte)(SLOPE_MIN_ +
|
||||
(diff % SLOPE_TAIL_COUNT_));
|
||||
}
|
||||
else if (diff <= SLOPE_REACH_POS_3_) {
|
||||
buffer[offset + 2] = (byte)(SLOPE_MIN_
|
||||
+ (diff % SLOPE_TAIL_COUNT_));
|
||||
diff /= SLOPE_TAIL_COUNT_;
|
||||
buffer[offset + 1] = (byte)(SLOPE_MIN_
|
||||
+ (diff % SLOPE_TAIL_COUNT_));
|
||||
buffer[offset] = (byte)(SLOPE_START_POS_3_
|
||||
+ (diff / SLOPE_TAIL_COUNT_));
|
||||
offset += 3;
|
||||
}
|
||||
else {
|
||||
buffer[offset + 3] = (byte)(SLOPE_MIN_
|
||||
+ diff % SLOPE_TAIL_COUNT_);
|
||||
diff /= SLOPE_TAIL_COUNT_;
|
||||
buffer[offset] = (byte)(SLOPE_MIN_
|
||||
+ diff % SLOPE_TAIL_COUNT_);
|
||||
diff /= SLOPE_TAIL_COUNT_;
|
||||
buffer[offset + 1] = (byte)(SLOPE_MIN_
|
||||
+ diff % SLOPE_TAIL_COUNT_);
|
||||
buffer[offset] = (byte)SLOPE_MAX_;
|
||||
offset += 4;
|
||||
}
|
||||
}
|
||||
else {
|
||||
long division = getNegDivMod(diff, SLOPE_TAIL_COUNT_);
|
||||
int modulo = (int)division;
|
||||
if (diff >= SLOPE_REACH_NEG_2_) {
|
||||
diff = (int)(division >> 32);
|
||||
buffer[offset ++] = (byte)(SLOPE_START_NEG_2_ + diff);
|
||||
buffer[offset ++] = (byte)(SLOPE_MIN_ + modulo);
|
||||
}
|
||||
else if (diff >= SLOPE_REACH_NEG_3_) {
|
||||
buffer[offset + 2] = (byte)(SLOPE_MIN_ + modulo);
|
||||
diff = (int)(division >> 32);
|
||||
division = getNegDivMod(diff, SLOPE_TAIL_COUNT_);
|
||||
modulo = (int)division;
|
||||
diff = (int)(division >> 32);
|
||||
buffer[offset + 1] = (byte)(SLOPE_MIN_ + modulo);
|
||||
buffer[offset] = (byte)(SLOPE_START_NEG_3_ + diff);
|
||||
offset += 3;
|
||||
}
|
||||
else {
|
||||
buffer[offset + 3] = (byte)(SLOPE_MIN_ + modulo);
|
||||
diff = (int)(division >> 32);
|
||||
division = getNegDivMod(diff, SLOPE_TAIL_COUNT_);
|
||||
modulo = (int)division;
|
||||
diff = (int)(division >> 32);
|
||||
buffer[offset + 2] = (byte)(SLOPE_MIN_ + modulo);
|
||||
division = getNegDivMod(diff, SLOPE_TAIL_COUNT_);
|
||||
modulo = (int)division;
|
||||
buffer[offset + 1] = (byte)(SLOPE_MIN_ + modulo);
|
||||
buffer[offset] = SLOPE_MIN_;
|
||||
offset += 4;
|
||||
}
|
||||
}
|
||||
return offset;
|
||||
}
|
||||
|
||||
/**
|
||||
* How many bytes would writeDiff() write?
|
||||
* @param diff
|
||||
*/
|
||||
private static final int lengthOfDiff(int diff)
|
||||
{
|
||||
if (diff >= SLOPE_REACH_NEG_1_) {
|
||||
if (diff <= SLOPE_REACH_POS_1_) {
|
||||
return 1;
|
||||
}
|
||||
else if (diff <= SLOPE_REACH_POS_2_) {
|
||||
return 2;
|
||||
}
|
||||
else if(diff <= SLOPE_REACH_POS_3_) {
|
||||
return 3;
|
||||
}
|
||||
else {
|
||||
return 4;
|
||||
}
|
||||
}
|
||||
else {
|
||||
if (diff >= SLOPE_REACH_NEG_2_) {
|
||||
return 2;
|
||||
}
|
||||
else if (diff >= SLOPE_REACH_NEG_3_) {
|
||||
return 3;
|
||||
}
|
||||
else {
|
||||
return 4;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
File diff suppressed because it is too large
Load diff
|
@ -5,8 +5,8 @@
|
|||
*******************************************************************************
|
||||
*
|
||||
* $Source: /xsrl/Nsvn/icu/icu4j/src/com/ibm/icu/text/CollationKey.java,v $
|
||||
* $Date: 2002/06/21 23:56:44 $
|
||||
* $Revision: 1.6 $
|
||||
* $Date: 2002/06/22 07:23:45 $
|
||||
* $Revision: 1.7 $
|
||||
*
|
||||
*******************************************************************************
|
||||
*/
|
||||
|
@ -15,43 +15,49 @@ package com.ibm.icu.text;
|
|||
import java.util.Arrays;
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* A <code>CollationKey</code> represents a <code>String</code> under the
|
||||
* rules of a specific <code>Collator</code> object. Comparing two
|
||||
* <code>CollationKey</code>s returns the relative order of the
|
||||
* <code>String</code>s they represent.
|
||||
* </p>
|
||||
* <p>
|
||||
* <code>CollationKey</code> instances can not be create directly. Rather,
|
||||
* they are generated by calling <code>Collator.getCollationKey(String)</code>.
|
||||
* Since the rule set of each <code>Collator differs</code>, the sort orders of
|
||||
* the same string under two unique <code>Collator</code> may not be the same.
|
||||
* Hence comparing <code>CollationKey</code>s generated from different
|
||||
* <code>Collator</code> objects may not give the right results.
|
||||
* </p>
|
||||
* <p>
|
||||
* Similar to <code>CollationKey.compareTo(CollationKey)</code>,
|
||||
* the method <code>RuleBasedCollator.compare(String, String)</code> compares
|
||||
* two strings and returns the relative order. During the construction
|
||||
* of a <code>CollationKey</code> object, the entire source string is examined
|
||||
* and processed into a series of bits that are stored in the
|
||||
* <code>CollationKey</code> object. Bitwise comparison on the bit sequences
|
||||
* are then performed during <code>CollationKey.compareTo(CollationKey)</code>.
|
||||
* This comparison could incurr expensive startup costs while creating
|
||||
* the <code>CollationKey</code> object, but once the objects are created,
|
||||
* binary comparisons are fast, and is recommended when the same strings are
|
||||
* to be compared over and over again.
|
||||
* On the other hand <code>Collator.compare(String, String)</code> examines
|
||||
* and processes the string only until the first characters differing in order,
|
||||
* and is recommend for use if the <code>String</code>s are to be compared only
|
||||
* once.
|
||||
* </p>
|
||||
* <p>
|
||||
* Details of the composition of the bit sequence is located at
|
||||
* <a href=http://oss.software.ibm.com/icu/userguide/Collate_ServiceArchitecture.html>
|
||||
* user guide</a>.
|
||||
* </p>
|
||||
* <p>The following example shows how <code>CollationKey</code>s might be used
|
||||
* <p>A <code>CollationKey</code> represents a <code>String</code>
|
||||
* under the rules of a specific <code>Collator</code>
|
||||
* object. Comparing two <code>CollationKey</code>s returns the
|
||||
* relative order of the <code>String</code>s they represent.</p>
|
||||
*
|
||||
* <p><code>CollationKey</code> instances are not created
|
||||
* directly. Rather, they are generated by calling
|
||||
* <code>Collator.getCollationKey(String)</code>.</p>
|
||||
*
|
||||
* <p>Since the rule set of <code>Collator</code>s can differ, the
|
||||
* sort orders of the same string under two different
|
||||
* <code>Collator</code>s might differ. Hence comparing
|
||||
* <code>CollationKey</code>s generated from different
|
||||
* <code>Collator</code>s can give incorrect results.</p>
|
||||
*
|
||||
* <p>Both the method
|
||||
* <code>CollationKey.compareTo(CollationKey)</code> and the method
|
||||
* <code>Collator.compare(String, String)</code> compare two strings
|
||||
* and returns their relative order. The performance characterictics
|
||||
* of these two approaches can differ.</p>
|
||||
*
|
||||
* <p>During the construction of a <code>CollationKey</code>, the
|
||||
* entire source string is examined and processed into a series of
|
||||
* bits that are stored in the <code>CollationKey</code>. When
|
||||
* <code>CollationKey.compareTo(CollationKey)</code> executes, it
|
||||
* performs bitwise comparison on the bit sequences. This can incurs
|
||||
* startup cost when creating the <code>CollationKey</code>, but once
|
||||
* the key is created, binary comparisons are fast. This approach is
|
||||
* recommended when the same strings are to be compared over and over
|
||||
* again.</p>
|
||||
*
|
||||
* <p>On the other hand, implementations of
|
||||
* <code>Collator.compare(String, String)</code> can examine and
|
||||
* process the strings only until the first characters differing in
|
||||
* order. This approach is recommended if the strings are to be
|
||||
* compared only once.</p>
|
||||
*
|
||||
* <p>More information about the composition of the bit sequence can
|
||||
* be found in the
|
||||
* <a href="http://oss.software.ibm.com/icu/userguide/Collate_ServiceArchitecture.html">
|
||||
* user guide</a>.</p>
|
||||
*
|
||||
* <p>The following example shows how <code>CollationKey</code>s can be used
|
||||
* to sort a list of <code>String</code>s.</p>
|
||||
* <blockquote>
|
||||
* <pre>
|
||||
|
@ -82,16 +88,16 @@ import java.util.Arrays;
|
|||
* @see RuleBasedCollator
|
||||
* @author Syn Wee Quek
|
||||
* @since release 2.2, April 18 2002
|
||||
* @draft 2.2
|
||||
* @draft 2.2
|
||||
*/
|
||||
public final class CollationKey implements Comparable
|
||||
{
|
||||
// public methods -------------------------------------------------------
|
||||
// public methods -------------------------------------------------------
|
||||
|
||||
// public getters -------------------------------------------------------
|
||||
// public getters -------------------------------------------------------
|
||||
|
||||
/**
|
||||
* Returns the source string that this CollationKey represents.
|
||||
* Return the source string that this CollationKey represents.
|
||||
* @return source string that this CollationKey represents
|
||||
* @draft 2.2
|
||||
*/
|
||||
|
@ -101,20 +107,19 @@ public final class CollationKey implements Comparable
|
|||
}
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Duplicates and returns the value of this CollationKey as a sequence
|
||||
* of big-endian bytes terminated by a null.
|
||||
* </p>
|
||||
* <p>
|
||||
* If two CollationKeys could be legitimately compared, then one could
|
||||
* compare the byte arrays of each to obtain the same result.
|
||||
* <p>Duplicates and returns the value of this CollationKey as a sequence
|
||||
* of big-endian bytes terminated by a null.</p>
|
||||
*
|
||||
* <p>If two CollationKeys can be legitimately compared, then one can
|
||||
* compare the byte arrays of each to obtain the same result, e.g.
|
||||
* <pre>
|
||||
* byte key1[] = collationkey1.toByteArray();
|
||||
* byte key2[] = collationkey2.toByteArray();
|
||||
* int key, targetkey;
|
||||
* int i = 0;
|
||||
* while (key1[i] != 0 && key2[i] != 0) {
|
||||
* int key = key1[i] & 0xFF;
|
||||
* int targetkey = key2[i] & 0xFF;
|
||||
* do {
|
||||
* key = key1[i] & 0xFF;
|
||||
* targetkey = key2[i] & 0xFF;
|
||||
* if (key < targetkey) {
|
||||
* System.out.println("String 1 is less than string 2");
|
||||
* return;
|
||||
|
@ -123,18 +128,9 @@ public final class CollationKey implements Comparable
|
|||
* System.out.println("String 1 is more than string 2");
|
||||
* }
|
||||
* i ++;
|
||||
* }
|
||||
* int key = key1[i] & 0xFF;
|
||||
* int targetkey = key2[i] & 0xFF;
|
||||
* if (key < targetkey) {
|
||||
* System.out.println("String 1 is less than string 2");
|
||||
* return;
|
||||
* }
|
||||
* if (targetkey < key) {
|
||||
* System.out.println("String 1 is more than string 2");
|
||||
* return;
|
||||
* }
|
||||
* System.out.println("String 1 is equals to string 2");;
|
||||
* } while (key != 0 && targetKey != 0);
|
||||
*
|
||||
* System.out.println("Strings are equal.");
|
||||
* </pre>
|
||||
* </p>
|
||||
* @return CollationKey value in a sequence of big-endian byte bytes
|
||||
|
@ -145,10 +141,10 @@ public final class CollationKey implements Comparable
|
|||
{
|
||||
int length = 0;
|
||||
while (true) {
|
||||
if (m_key_[length] == 0) {
|
||||
break;
|
||||
}
|
||||
length ++;
|
||||
if (m_key_[length] == 0) {
|
||||
break;
|
||||
}
|
||||
length ++;
|
||||
}
|
||||
length ++;
|
||||
byte result[] = new byte[length];
|
||||
|
@ -156,94 +152,88 @@ public final class CollationKey implements Comparable
|
|||
return result;
|
||||
}
|
||||
|
||||
// public other methods -------------------------------------------------
|
||||
// public other methods -------------------------------------------------
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Compare this CollationKey to the argument target CollationKey.
|
||||
* The collation
|
||||
* rules of the Collator object which created these keys are applied.
|
||||
* </p>
|
||||
* <p>
|
||||
* <strong>Note:</strong> Comparison between CollationKeys created by
|
||||
* different Collators may not return the correct result. See class
|
||||
* documentation.
|
||||
* </p>
|
||||
* <p>Compare this CollationKey to another CollationKey. The
|
||||
* collation rules of the Collator that created this key are
|
||||
* applied.</p>
|
||||
*
|
||||
* <p><strong>Note:</strong> Comparison between CollationKeys
|
||||
* created by different Collators might return incorrect
|
||||
* results. See class documentation.</p>
|
||||
*
|
||||
* @param target target CollationKey
|
||||
* @return an integer value, if value is less than zero this CollationKey
|
||||
* is less than than target, if value is zero if they are equal
|
||||
* and value is greater than zero if this CollationKey is greater
|
||||
* @return an integer value. If the value is less than zero this CollationKey
|
||||
* is less than than target, if the value is zero they are equal, and
|
||||
* if the value is greater than zero this CollationKey is greater
|
||||
* than target.
|
||||
* @exception NullPointerException thrown when argument is null.
|
||||
* @exception NullPointerException is thrown if argument is null.
|
||||
* @see Collator#compare(String, String)
|
||||
* @draft 2.2
|
||||
*/
|
||||
* @draft 2.2 */
|
||||
public int compareTo(CollationKey target)
|
||||
{
|
||||
int i = 0;
|
||||
while (m_key_[i] != 0 && target.m_key_[i] != 0) {
|
||||
int key = m_key_[i] & 0xFF;
|
||||
int targetkey = target.m_key_[i] & 0xFF;
|
||||
if (key < targetkey) {
|
||||
return -1;
|
||||
}
|
||||
if (targetkey < key) {
|
||||
return 1;
|
||||
}
|
||||
i ++;
|
||||
int key = m_key_[i] & 0xFF;
|
||||
int targetkey = target.m_key_[i] & 0xFF;
|
||||
if (key < targetkey) {
|
||||
return -1;
|
||||
}
|
||||
if (targetkey < key) {
|
||||
return 1;
|
||||
}
|
||||
i ++;
|
||||
}
|
||||
// last comparison if we encounter a 0
|
||||
int key = m_key_[i] & 0xFF;
|
||||
int targetkey = target.m_key_[i] & 0xFF;
|
||||
if (key < targetkey) {
|
||||
return -1;
|
||||
return -1;
|
||||
}
|
||||
if (targetkey < key) {
|
||||
return 1;
|
||||
return 1;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Compares this CollationKey with the specified Object.
|
||||
* The collation
|
||||
* rules of the Collator object which created these objects are applied.
|
||||
* </p>
|
||||
* <p>
|
||||
* See note in compareTo(CollationKey) for warnings of incorrect results
|
||||
* </p>
|
||||
* @param obj the Object to be compared.
|
||||
* <p>Compare this CollationKey with the specified Object. The
|
||||
* collation rules of the Collator that created this key are
|
||||
* applied.</p>
|
||||
*
|
||||
* <p>See note in compareTo(CollationKey) for warnings about possible
|
||||
* incorrect results.</p>
|
||||
*
|
||||
* @param obj the Object to be compared to.
|
||||
* @return Returns a negative integer, zero, or a positive integer
|
||||
* respectively if this CollationKey is less than, equal to, or
|
||||
* greater than the given Object.
|
||||
* @exception ClassCastException thrown when the specified argument is not
|
||||
* a CollationKey. NullPointerException thrown when argument
|
||||
* @exception ClassCastException is thrown when the argument is not
|
||||
* a CollationKey. NullPointerException is thrown when the argument
|
||||
* is null.
|
||||
* @see #compareTo(CollationKey)
|
||||
* @draft 2.2
|
||||
*/
|
||||
* @draft 2.2 */
|
||||
public int compareTo(Object obj)
|
||||
{
|
||||
return compareTo((CollationKey)obj);
|
||||
return compareTo((CollationKey)obj);
|
||||
}
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Compare this CollationKey and the argument target object for equality.
|
||||
* The collation
|
||||
* rules of the Collator object which created these objects are applied.
|
||||
* </p>
|
||||
* <p>
|
||||
* See note in compareTo(CollationKey) for warnings of incorrect results
|
||||
* </p>
|
||||
* <p>Compare this CollationKey and the specified Object for
|
||||
* equality. The collation rules of the Collator that created
|
||||
* this key are applied.</p>
|
||||
*
|
||||
* <p>See note in compareTo(CollationKey) for warnings about
|
||||
* possible incorrect results.</p>
|
||||
*
|
||||
* @param target the object to compare to.
|
||||
* @return true if two objects are equal, false otherwise.
|
||||
* @return true if the two keys compare as equal, false otherwise.
|
||||
* @see #compareTo(CollationKey)
|
||||
* @exception ClassCastException thrown when the specified argument is not
|
||||
* a CollationKey. NullPointerException thrown when argument
|
||||
* @exception ClassCastException is thrown when the argument is not
|
||||
* a CollationKey. NullPointerException is thrown when the argument
|
||||
* is null.
|
||||
* @draft 2.2
|
||||
* @draft 2.2
|
||||
*/
|
||||
public boolean equals(Object target)
|
||||
{
|
||||
|
@ -266,13 +256,13 @@ public final class CollationKey implements Comparable
|
|||
* </p>
|
||||
* @param target the CollationKey to compare to.
|
||||
* @return true if two objects are equal, false otherwise.
|
||||
* @exception NullPointerException thrown when argument is null.
|
||||
* @exception NullPointerException is thrown when the argument is null.
|
||||
* @draft 2.2
|
||||
*/
|
||||
public boolean equals(CollationKey target)
|
||||
{
|
||||
if (this == target) {
|
||||
return true;
|
||||
return true;
|
||||
}
|
||||
if (target == null) {
|
||||
return false;
|
||||
|
@ -280,20 +270,19 @@ public final class CollationKey implements Comparable
|
|||
CollationKey other = (CollationKey)target;
|
||||
int i = 0;
|
||||
while (true) {
|
||||
if (m_key_[i] != other.m_key_[i]) {
|
||||
return false;
|
||||
}
|
||||
if (m_key_[i] == 0) {
|
||||
break;
|
||||
}
|
||||
i ++;
|
||||
if (m_key_[i] != other.m_key_[i]) {
|
||||
return false;
|
||||
}
|
||||
if (m_key_[i] == 0) {
|
||||
break;
|
||||
}
|
||||
i ++;
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Creates a hash code for this CollationKey. The hash value is calculated
|
||||
* <p>Returns a hash code for this CollationKey. The hash value is calculated
|
||||
* on the key itself, not the String from which the key was created. Thus
|
||||
* if x and y are CollationKeys, then x.hashCode(x) == y.hashCode()
|
||||
* if x.equals(y) is true. This allows language-sensitive comparison in a
|
||||
|
@ -305,25 +294,25 @@ public final class CollationKey implements Comparable
|
|||
public int hashCode()
|
||||
{
|
||||
if (m_hashCode_ == 0) {
|
||||
int size = m_key_.length >> 1;
|
||||
StringBuffer key = new StringBuffer(size);
|
||||
int i = 0;
|
||||
while (m_key_[i] != 0 && m_key_[i + 1] != 0) {
|
||||
key.append((char)((m_key_[i] << 8) | m_key_[i + 1]));
|
||||
i += 2;
|
||||
}
|
||||
if (m_key_[i] != 0) {
|
||||
key.append((char)(m_key_[i] << 8));
|
||||
}
|
||||
m_hashCode_ = key.toString().hashCode();
|
||||
int size = m_key_.length >> 1;
|
||||
StringBuffer key = new StringBuffer(size);
|
||||
int i = 0;
|
||||
while (m_key_[i] != 0 && m_key_[i + 1] != 0) {
|
||||
key.append((char)((m_key_[i] << 8) | m_key_[i + 1]));
|
||||
i += 2;
|
||||
}
|
||||
if (m_key_[i] != 0) {
|
||||
key.append((char)(m_key_[i] << 8));
|
||||
}
|
||||
m_hashCode_ = key.toString().hashCode();
|
||||
}
|
||||
return m_hashCode_;
|
||||
}
|
||||
|
||||
// protected constructor ------------------------------------------------
|
||||
// protected constructor ------------------------------------------------
|
||||
|
||||
/**
|
||||
* Protected CollationKey can only be generated by Collator objects
|
||||
* CollationKey can only be generated by Collator objects
|
||||
* @param source string the CollationKey represents
|
||||
* @param key sort key array of bytes
|
||||
* @param size of sort key
|
||||
|
@ -336,18 +325,20 @@ public final class CollationKey implements Comparable
|
|||
m_hashCode_ = 0;
|
||||
}
|
||||
|
||||
// private data members -------------------------------------------------
|
||||
// private data members -------------------------------------------------
|
||||
|
||||
/**
|
||||
* Source string this CollationKey represents
|
||||
*/
|
||||
/**
|
||||
* Source string this CollationKey represents
|
||||
*/
|
||||
private String m_source_;
|
||||
|
||||
/**
|
||||
* Sequence of bytes that represents the sort key
|
||||
*/
|
||||
private byte m_key_[];
|
||||
|
||||
/**
|
||||
* Hash code for the key
|
||||
*/
|
||||
private int m_hashCode_;
|
||||
}
|
||||
}
|
||||
|
|
|
@ -5,8 +5,8 @@
|
|||
*******************************************************************************
|
||||
*
|
||||
* $Source: /xsrl/Nsvn/icu/icu4j/src/com/ibm/icu/text/Collator.java,v $
|
||||
* $Date: 2002/06/21 23:56:44 $
|
||||
* $Revision: 1.7 $
|
||||
* $Date: 2002/06/22 07:23:45 $
|
||||
* $Revision: 1.8 $
|
||||
*
|
||||
*******************************************************************************
|
||||
*/
|
||||
|
@ -15,18 +15,16 @@ package com.ibm.icu.text;
|
|||
import java.util.Locale;
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Collator is an abstract base class, its subclasses performs
|
||||
* locale-sensitive String comparison. A concrete subclass, RuleBasedCollator,
|
||||
* is provided and it allows customization of the collation ordering by the use
|
||||
* of rule sets.
|
||||
* </p>
|
||||
* <p>
|
||||
* Following the
|
||||
* <a href=http://www.unicode.org>Unicode Consortium</a>'s specifications for
|
||||
* the <a href=http://www.unicode.org/unicode/reports/tr10/>
|
||||
* Unicode Collation Algorithm (UCA)</a>, there are
|
||||
* 5 different levels of strength used in comparisons.
|
||||
* <p>Collator performs locale-sensitive string comparison. A concrete
|
||||
* subclass, RuleBasedCollator, allows customization of the collation
|
||||
* ordering by the use of rule sets.</p>
|
||||
*
|
||||
* <p>Following the <a href=http://www.unicode.org>Unicode
|
||||
* Consortium</a>'s specifications for the
|
||||
* <a href="http://www.unicode.org/unicode/reports/tr10/"> Unicode Collation
|
||||
* Algorithm (UCA)</a>, there are 5 different levels of strength used
|
||||
* in comparisons:
|
||||
*
|
||||
* <ul>
|
||||
* <li>PRIMARY strength: Typically, this is used to denote differences between
|
||||
* base characters (for example, "a" < "b").
|
||||
|
@ -60,11 +58,12 @@ import java.util.Locale;
|
|||
* are compared, just in case there is no difference.
|
||||
* For example, Hebrew cantellation marks are only distinguished at this
|
||||
* strength. This strength should be used sparingly, as only code point
|
||||
* values differences between two strings is an extremely rare occurrence.
|
||||
* value differences between two strings is an extremely rare occurrence.
|
||||
* Using this strength substantially decreases the performance for both
|
||||
* comparison and collation key generation APIs. This strength also
|
||||
* increases the size of the collation key.
|
||||
* </ul>
|
||||
*
|
||||
* Unlike the JDK, ICU4J's Collator deals only with 2 decomposition modes,
|
||||
* the canonical decomposition mode and one that does not use any decomposition.
|
||||
* The compatibility decomposition mode, java.text.Collator.FULL_DECOMPOSITION
|
||||
|
@ -73,15 +72,13 @@ import java.util.Locale;
|
|||
* producing the same results as if the text were normalized in NFD. If
|
||||
* canonical decomposition is turned off, it is the user's responsibility to
|
||||
* ensure that all text is already in the appropriate form before performing
|
||||
* a comparison or before getting a CollationKey.
|
||||
* </p>
|
||||
* <p>
|
||||
* For more information about the collation service see the
|
||||
* a comparison or before getting a CollationKey.</p>
|
||||
*
|
||||
* <p>For more information about the collation service see the
|
||||
* <a href="http://oss.software.ibm.com/icu/userguide/Collate_Intro.html">users
|
||||
* guide</a>.
|
||||
* </p>
|
||||
* <p>
|
||||
* Examples of use
|
||||
* guide</a>.</p>
|
||||
*
|
||||
* <p>Examples of use
|
||||
* <pre>
|
||||
* // Get the Collator for US English and set its strength to PRIMARY
|
||||
* Collator usCollator = Collator.getInstance(Locale.US);
|
||||
|
@ -90,8 +87,9 @@ import java.util.Locale;
|
|||
* System.out.println("Strings are equivalent");
|
||||
* }
|
||||
*
|
||||
* The following example shows how to compare two strings using the Collator
|
||||
* for the default locale.
|
||||
* The following example shows how to compare two strings using the
|
||||
* Collator for the default locale.
|
||||
*
|
||||
* // Compare two strings in the default locale
|
||||
* Collator myCollator = Collator.getInstance();
|
||||
* myCollator.setDecomposition(NO_DECOMPOSITION);
|
||||
|
@ -114,22 +112,21 @@ import java.util.Locale;
|
|||
* @see CollationKey
|
||||
* @author Syn Wee Quek
|
||||
* @since release 2.2, April 18 2002
|
||||
* @draft 2.2
|
||||
* @draft 2.2
|
||||
*/
|
||||
|
||||
public abstract class Collator
|
||||
{
|
||||
// public data members ---------------------------------------------------
|
||||
|
||||
/**
|
||||
* Strongest collator strength value. Typically, used to denote differences
|
||||
* between base characters.
|
||||
* See class documentation for more explanation.
|
||||
// public data members ---------------------------------------------------
|
||||
|
||||
/**
|
||||
* Strongest collator strength value. Typically used to denote differences
|
||||
* between base characters. See class documentation for more explanation.
|
||||
* @see #setStrength
|
||||
* @see #getStrength
|
||||
* @draft 2.2
|
||||
*/
|
||||
public final static int PRIMARY = 0;
|
||||
|
||||
/**
|
||||
* Second level collator strength value.
|
||||
* Accents in the characters are considered secondary differences.
|
||||
|
@ -141,6 +138,7 @@ public abstract class Collator
|
|||
* @draft 2.2
|
||||
*/
|
||||
public final static int SECONDARY = 1;
|
||||
|
||||
/**
|
||||
* Third level collator strength value.
|
||||
* Upper and lower case differences in characters are distinguished at this
|
||||
|
@ -152,19 +150,21 @@ public abstract class Collator
|
|||
* @draft 2.2
|
||||
*/
|
||||
public final static int TERTIARY = 2;
|
||||
|
||||
/**
|
||||
* Fourth level collator strength value.
|
||||
* When punctuation is ignored
|
||||
* <a href=http://www-124.ibm.com/icu/userguide/Collate_Concepts.html#Ignoring_Punctuation>
|
||||
* <a href="http://www-124.ibm.com/icu/userguide/Collate_Concepts.html#Ignoring_Punctuation">
|
||||
* (see Ignoring Punctuations in the user guide)</a> at PRIMARY to TERTIARY
|
||||
* strength, an additional strength level can
|
||||
* be used to distinguish words with and without punctuation
|
||||
* be used to distinguish words with and without punctuation.
|
||||
* See class documentation for more explanation.
|
||||
* @see #setStrength
|
||||
* @see #getStrength
|
||||
* @draft 2.2
|
||||
*/
|
||||
public final static int QUATERNARY = 3;
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Smallest Collator strength value. When all other strengths are equal,
|
||||
|
@ -181,36 +181,32 @@ public abstract class Collator
|
|||
public final static int IDENTICAL = 15;
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Decomposition mode value. With NO_DECOMPOSITION set, Strings will not be
|
||||
* decomposed for collation. This is the default
|
||||
* decomposition setting unless otherwise specified by the locale used
|
||||
* to create the Collator.
|
||||
* </p>
|
||||
* <p>
|
||||
* Note this value is different from JDK's
|
||||
* </p>
|
||||
* <p>Decomposition mode value. With NO_DECOMPOSITION set, Strings
|
||||
* will not be decomposed for collation. This is the default
|
||||
* decomposition setting unless otherwise specified by the locale
|
||||
* used to create the Collator.</p>
|
||||
*
|
||||
* <p><strong>Note</strong> this value is different from the JDK's.</p>
|
||||
* @see #CANONICAL_DECOMPOSITION
|
||||
* @see #getDecomposition
|
||||
* @see #setDecomposition
|
||||
* @draft 2.2
|
||||
* @draft 2.2
|
||||
*/
|
||||
public final static int NO_DECOMPOSITION = 16;
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Decomposition mode value. With CANONICAL_DECOMPOSITION set,
|
||||
* characters that are canonical variants according to Unicode 2.0 will be
|
||||
* decomposed for collation.
|
||||
* </p>
|
||||
* <p>
|
||||
* CANONICAL_DECOMPOSITION corresponds to Normalization Form D as
|
||||
* <p>Decomposition mode value. With CANONICAL_DECOMPOSITION set,
|
||||
* characters that are canonical variants according to Unicode 2.0
|
||||
* will be decomposed for collation.</p>
|
||||
*
|
||||
* <p>CANONICAL_DECOMPOSITION corresponds to Normalization Form D as
|
||||
* described in <a href="http://www.unicode.org/unicode/reports/tr15/">
|
||||
* Unicode Technical Report #15</a>.
|
||||
* </p>
|
||||
* @see #NO_DECOMPOSITION
|
||||
* @see #getDecomposition
|
||||
* @see #setDecomposition
|
||||
* @draft 2.2
|
||||
* @draft 2.2
|
||||
*/
|
||||
public final static int CANONICAL_DECOMPOSITION = 1;
|
||||
|
||||
|
@ -219,25 +215,23 @@ public abstract class Collator
|
|||
// public setters --------------------------------------------------------
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Sets this Collator's strength property. The strength property
|
||||
* <p>Sets this Collator's strength property. The strength property
|
||||
* determines the minimum level of difference considered significant
|
||||
* during comparison.
|
||||
* </p>
|
||||
* <p>
|
||||
* The default strength for the Collator is TERTIARY, unless specified
|
||||
* otherwise by the locale used to create the Collator.
|
||||
* </p>
|
||||
* during comparison.</p>
|
||||
*
|
||||
* <p>The default strength for the Collator is TERTIARY, unless specified
|
||||
* otherwise by the locale used to create the Collator.</p>
|
||||
*
|
||||
* <p>See the Collator class description for an example of use.</p>
|
||||
* @param the new strength value.
|
||||
* @param new Strength the new strength value.
|
||||
* @see #getStrength
|
||||
* @see #PRIMARY
|
||||
* @see #SECONDARY
|
||||
* @see #TERTIARY
|
||||
* @see #QUATERNARY
|
||||
* @see #IDENTICAL
|
||||
* @exception IllegalArgumentException If the new strength value is not one
|
||||
* of PRIMARY, SECONDARY, TERTIARY, QUATERNARY or IDENTICAL.
|
||||
* @exception IllegalArgumentException if the new strength value is not one
|
||||
* of PRIMARY, SECONDARY, TERTIARY, QUATERNARY or IDENTICAL.
|
||||
* @draft 2.2
|
||||
*/
|
||||
public void setStrength(int newStrength)
|
||||
|
@ -253,35 +247,34 @@ public abstract class Collator
|
|||
}
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Set the decomposition mode of this Collator.
|
||||
* Setting this decomposition property with CANONICAL_DECOMPOSITION allows
|
||||
* the Collator to handle
|
||||
* un-normalized text properly, producing the same results as if the text
|
||||
* were normalized. If NO_DECOMPOSITION is set, it is the user's
|
||||
* responsibility to insure that all text is already in the appropriate
|
||||
* form before a comparison or before getting a CollationKey. Adjusting
|
||||
* decomposition mode allows the user to select between faster and more
|
||||
* complete collation behavior.
|
||||
* </p>
|
||||
* <p>
|
||||
* Since a great majority of the world languages does not require text
|
||||
* normalization, most locales has NO_DECOMPOSITION has the default
|
||||
* decomposition mode.
|
||||
* <p>
|
||||
* The default decompositon mode for the Collator is NO_DECOMPOSITON,
|
||||
* unless specified otherwise by the locale used to create the Collator.
|
||||
* </p>
|
||||
* <p>
|
||||
* See getDecomposition for a description of decomposition mode.
|
||||
* </p>
|
||||
* <p>Set the decomposition mode of this Collator. Setting this
|
||||
* decomposition property with CANONICAL_DECOMPOSITION allows the
|
||||
* Collator to handle un-normalized text properly, producing the
|
||||
* same results as if the text were normalized. If
|
||||
* NO_DECOMPOSITION is set, it is the user's responsibility to
|
||||
* insure that all text is already in the appropriate form before
|
||||
* a comparison or before getting a CollationKey. Adjusting
|
||||
* decomposition mode allows the user to select between faster and
|
||||
* more complete collation behavior.</p>
|
||||
*
|
||||
* <p>Since a great many of the world's languages do not require
|
||||
* text normalization, most locales set NO_DECOMPOSITION as the
|
||||
* default decomposition mode.</p>
|
||||
*
|
||||
* The default decompositon mode for the Collator is
|
||||
* NO_DECOMPOSITON, unless specified otherwise by the locale used
|
||||
* to create the Collator.</p>
|
||||
*
|
||||
* <p>See getDecomposition for a description of decomposition
|
||||
* mode.</p>
|
||||
*
|
||||
* @param decomposition the new decomposition mode
|
||||
* @see #getDecomposition
|
||||
* @see #NO_DECOMPOSITION
|
||||
* @see #CANONICAL_DECOMPOSITION
|
||||
* @exception IllegalArgumentException If the given value is not a valid
|
||||
* decomposition mode.
|
||||
* @draft 2.2
|
||||
* @draft 2.2
|
||||
*/
|
||||
public void setDecomposition(int decomposition)
|
||||
{
|
||||
|
@ -324,17 +317,16 @@ public abstract class Collator
|
|||
*/
|
||||
public static final Collator getInstance(Locale locale)
|
||||
{
|
||||
try {
|
||||
return new RuleBasedCollator(locale);
|
||||
}
|
||||
catch(Exception e) {
|
||||
return RuleBasedCollator.UCA_;
|
||||
}
|
||||
try {
|
||||
return new RuleBasedCollator(locale);
|
||||
}
|
||||
catch(Exception e) {
|
||||
return RuleBasedCollator.UCA_;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* <p>
|
||||
* Returns this Collator's strength property. The strength property
|
||||
* <p>Returns this Collator's strength property. The strength property
|
||||
* determines the minimum level of difference considered significant.
|
||||
* </p>
|
||||
* <p>
|
||||
|
@ -376,12 +368,12 @@ public abstract class Collator
|
|||
// public other methods -------------------------------------------------
|
||||
|
||||
/**
|
||||
* Convenience method for comparing the equality of two text Strings based
|
||||
* on this Collator's collation rules, strength and decomposition mode.
|
||||
* @param source the source string to be compared with.
|
||||
* @param target the target string to be compared with.
|
||||
* Convenience method for comparing the equality of two text Strings using
|
||||
* this Collator's rules, strength and decomposition mode.
|
||||
* @param source the source string to be compared.
|
||||
* @param target the target string to be compared.
|
||||
* @return true if the strings are equal according to the collation
|
||||
* rules. false, otherwise.
|
||||
* rules, otherwise false.
|
||||
* @see #compare
|
||||
* @exception NullPointerException thrown if either arguments is null.
|
||||
* @draft 2.2
|
||||
|
@ -412,7 +404,7 @@ public abstract class Collator
|
|||
/**
|
||||
* <p>
|
||||
* Compares the source text String to the target text String according to
|
||||
* the collation rules, strength and decomposition mode for this Collator.
|
||||
* this Collator's rules, strength and decomposition mode.
|
||||
* Returns an integer less than,
|
||||
* equal to or greater than zero depending on whether the source String is
|
||||
* less than, equal to or greater than the target String. See the Collator
|
||||
|
@ -432,8 +424,8 @@ public abstract class Collator
|
|||
|
||||
/**
|
||||
* <p>
|
||||
* Transforms the String into a series of bits that can be compared
|
||||
* bitwise to other CollationKeys. Bits generated depends on the collation
|
||||
* Transforms the String into a CollationKey suitable for efficient
|
||||
* repeated comparison. The resulting key depends on the collator's
|
||||
* rules, strength and decomposition mode.
|
||||
* </p>
|
||||
* <p>See the CollationKey class documentation for more information.</p>
|
||||
|
@ -448,7 +440,6 @@ public abstract class Collator
|
|||
public abstract CollationKey getCollationKey(String source);
|
||||
|
||||
// protected constructor -------------------------------------------------
|
||||
|
||||
|
||||
// private data members --------------------------------------------------
|
||||
|
||||
|
@ -456,6 +447,7 @@ public abstract class Collator
|
|||
* Collation strength
|
||||
*/
|
||||
private int m_strength_ = TERTIARY;
|
||||
|
||||
/**
|
||||
* Decomposition mode
|
||||
*/
|
||||
|
|
Loading…
Add table
Reference in a new issue