From d5baa1fe28a31aa36accb9f9cc22bb2630341993 Mon Sep 17 00:00:00 2001
From: Doug Felt
- * The RuleBasedCollator class is a concrete subclass of Collator. It allows
- * customization of the Collator via user specified rule sets.
- * RuleBasedCollator is designed to be fully compliant to the
- *
- * Unicode Collation Algorithm (UCA) and conforms to ISO 14651.
- *
- * Users are strongly encouraged to read
- *
- * the users guide for more information about the collation service before
- * using this class.
- *
- * Create a RuleBasedCollator from a locale by calling the getInstance(Locale)
- * factory method in the base class Collator.
- * Collator.getInstance(Locale) creates a RuleBasedCollator object based on the
- * collation rules defined by the argument locale.
- * If a customized collation ordering ar attributes is required, use the
- * RuleBasedCollator(String) constructor with the appropriate rules. The
- * customized RuleBasedCollator will base its ordering on UCA, while
- * re-adjusting the attributes and orders of the characters in the specified
- * rule accordingly.
- *
- * RuleBasedCollator provides correct collation orders for most locales
- * supported in ICU. If specific data for a locale is not available, the orders
- * eventually falls back to the
- * UCA collation order
- * .
- *
- * For information about the collation rule syntax to use and details about
- * customization, please refer to the
+ * RuleBasedCollator is a concrete subclass of Collator. It allows
+ * customization of the Collator via user-specified rule sets.
+ * RuleBasedCollator is designed to be fully compliant to the Unicode
+ * Collation Algorithm (UCA) and conforms to ISO 14651. Users are strongly encouraged to read
+ * the users guide for more information about the collation
+ * service before using this class. Create a RuleBasedCollator from a locale by calling the
+ * getInstance(Locale) factory method in the base class Collator.
+ * Collator.getInstance(Locale) creates a RuleBasedCollator object
+ * based on the collation rules defined by the argument locale. If a
+ * customized collation ordering ar attributes is required, use the
+ * RuleBasedCollator(String) constructor with the appropriate
+ * rules. The customized RuleBasedCollator will base its ordering on
+ * UCA, while re-adjusting the attributes and orders of the characters
+ * in the specified rule accordingly. RuleBasedCollator provides correct collation orders for most
+ * locales supported in ICU. If specific data for a locale is not
+ * available, the orders eventually falls back to the UCA collation
+ * order . For information about the collation rule syntax and details
+ * about customization, please refer to the
*
- * Collation customization section of the users guide.
- *
- * Note that there are some differences between the Collation rule syntax
- * used in Java and ICU4J
+ * Collation customization section of the user's guide. Note that there are some differences between
+ * the Collation rule syntax used in Java and ICU4J:
+ *
*
*
*
* Examples *
*- * Creating Customized RuleBasedCollators + * Creating Customized RuleBasedCollators: *
*- * Concatenating rules to combining- * String Simple = "& a < b < c < d"; - * RuleBasedCollator mySimple = new RuleBasedCollator(Simple); + * String simple = "& a < b < c < d"; + * RuleBasedCollator simpleCollator = new RuleBasedCollator(simple); * - * String Norwegian = "& a , A < b , B < c , C < d , D < e , E " + * String norwegian = "& a , A < b , B < c , C < d , D < e , E " * + "< f , F < g , G < h , H < i , I < j , " * + "J < k , K < l , L < m , M < n , N < " * + "o , O < p , P < q , Q < r , R < s , S < " @@ -131,10 +129,11 @@ import com.ibm.icu.impl.ICULocaleData; * + "< y , Y < z , Z < \u00E5 = a\u030A " * + ", \u00C5 = A\u030A ; aa , AA < \u00E6 " * + ", \u00C6 < \u00F8 , \u00D8"; - * RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian); + * RuleBasedCollator norwegianCollator = new RuleBasedCollator(norwegian); **
Collator
s
+ *
+ * Concatenating rules to combinine Collator
s:
* *- * Making changes on an existing RuleBasedCollator to create a new - ** // Create an en_US Collator object @@ -153,9 +152,9 @@ import com.ibm.icu.impl.ICULocaleData; * // newCollator has the combined rules **
Collator
object, by appending the existing rule with the
- * changes.
+ *
+ * Making changes to an existing RuleBasedCollator to create a new
+ * Collator
object, by appending changes to the existing rule:
* *- * The following example demonstrates how to change the order of - * non-spacing accents, + * + * How to change the order of non-spacing accents: ** // Create a new Collator object with additional rules @@ -165,8 +164,8 @@ import com.ibm.icu.impl.ICULocaleData; * // myCollator contains the new rules **
*- * Putting new primary ordering in before the default setting, - * e.g. Sort English characters before or after Japanese characters in Japanese - ** // old rule with main accents @@ -182,15 +181,16 @@ import com.ibm.icu.impl.ICULocaleData; * RuleBasedCollator myCollator = new RuleBasedCollator(oldRules + addOn); **
Collator
.
+ *
+ * Putting in a new primary ordering before the default setting,
+ * e.g. sort English characters before or after Japanese characters in the Japanese
+ * Collator
:
* ** // get en_US Collator rules * RuleBasedCollator en_USCollator * = (RuleBasedCollator)Collator.getInstance(Locale.US); - * // add a few Japanese character to sort before English characters + * // add a few Japanese characters to sort before English characters * // suppose the last character before the first base letter 'a' in * // the English collation rule is \u2212 * String jaString = "& \u2212 < \u3041, \u3042 < \u3043, " @@ -202,23 +202,23 @@ import com.ibm.icu.impl.ICULocaleData; * * @author Syn Wee Quek * @since release 2.2, April 18 2002 - * @draft 2.2 + * @draft 2.2 */ public final class RuleBasedCollator extends Collator { - // public data members --------------------------------------------------- - - // public constructors --------------------------------------------------- - - /** + // public data members --------------------------------------------------- + + // public constructors --------------------------------------------------- + + /** *- * RuleBasedCollator constructor that takes the argument rules for - * customization. RuleBasedCollator constructed will be based on UCA, + * Constructor that takes the argument rules for + * customization. The collator will be based on UCA, * with the attributes and re-ordering of the characters specified in the * argument rules. *
*See the user guide's section on - * + * * Collation Customization for details on the rule syntax. *
* @param rules the collation rules to build the collation table from. @@ -230,19 +230,18 @@ public final class RuleBasedCollator extends Collator public RuleBasedCollator(String rules) throws Exception { if (rules == null) { - throw new IllegalArgumentException( - "Collation rules can not be null"); + throw new IllegalArgumentException("Collation rules can not be null"); } - setWithUCAData(); + setWithUCAData(); CollationParsedRuleBuilder builder - = new CollationParsedRuleBuilder(rules); - - builder.setRules(this); + = new CollationParsedRuleBuilder(rules); + + builder.setRules(this); m_rules_ = rules; init(); } - // public methods -------------------------------------------------------- + // public methods -------------------------------------------------------- /** * Return a CollationElementIterator for the given String. @@ -256,136 +255,134 @@ public final class RuleBasedCollator extends Collator /** * Return a CollationElementIterator for the given CharacterIterator. - * Argument source's integrity will be preserved since a new copy of source - * will be created for use instead. + * The source iterator's integrity will be preserved since a new copy + * will be created for use. * @see CollationElementIterator * @draft 2.2 */ - public CollationElementIterator getCollationElementIterator( - CharacterIterator source) - { - CharacterIterator newsource = (CharacterIterator)source.clone(); + public CollationElementIterator getCollationElementIterator(CharacterIterator source) + { + CharacterIterator newsource = (CharacterIterator)source.clone(); return new CollationElementIterator(source, this); } // public setters -------------------------------------------------------- /** - * Sets the Hiragana Quaternary mode to be on or off. - * When the Hiragana Quaternary mode turned on, the RuleBasedCollator - * positions Hiragana characters before all non-ignorable characters in - * QUATERNARY strength. This is to produce a correct JIS collation order, - * distinguishing between Katakana and Hiragana characters. - * @param flag true if Hiragana Quaternary mode is to be on, false - * otherwise - * @see #setHiraganaQuaternaryDefault - * @see #isHiraganaQuaternary - * @draft 2.2 - */ - public void setHiraganaQuaternary(boolean flag) - { - m_isHiragana4_ = flag; - } - - /** - * Sets the Hiragana Quaternary mode to the initial mode set during - * construction of the RuleBasedCollator. - * See setHiraganaQuaternary(boolean) for more details. - * @see #setHiraganaQuaternary(boolean) - * @see #isHiraganaQuaternary - * @draft 2.2 - */ - public void setHiraganaQuaternaryDefault() - { - m_isHiragana4_ = m_defaultIsHiragana4_; - } - - /** - * Sets the orders of upper cased characters to sort before lower cased - * characters or vice versa, in strength TERTIARY. The default - * mode is false, and that sorts lower cased characters before upper cased - * characters. - * If true is set, the RuleBasedCollator will sort upper cased characters - * before the lower cased ones. - * @param upperfirst true for sorting upper cased characters before - * lower cased characters, false for sorting lower cased - * characters before upper cased characters - * @see #setCaseFirstOff - * @see #isCaseFirstOff - * @see #isLowerCaseFirst - * @see #isUpperCaseFirst - * @see #setCaseFirstDefault - * @draft 2.2 - */ - public void setCaseFirst(boolean upperfirst) - { - if (upperfirst) { - m_caseFirst_ = AttributeValue.UPPER_FIRST_; - } - else { - m_caseFirst_ = AttributeValue.LOWER_FIRST_; - } - updateInternalState(); - } - - /** - * Sets the Collator to ignore any previous setCaseFirst(boolean) calls. - * Ignores case preferences. - * @draft 2.2 - * @see #setCaseFirst(boolean) - * @see #isCaseFirstOff - * @see #isLowerCaseFirst - * @see #isUpperCaseFirst - * @see #setCaseFirstDefault - */ - public void setCaseFirstOff() - { - m_caseFirst_ = AttributeValue.OFF_; - updateInternalState(); - } - - /** - * Sets the case first mode to the initial mode set during - * construction of the RuleBasedCollator. - * See setCaseFirst(boolean) for more details. - * @see #setCaseFirstOff - * @see #isCaseFirstOff - * @see #isUpperCaseFirst - * @see #setCaseFirst - * @draft 2.2 - */ - public final void setCaseFirstDefault() - { - m_caseFirst_ = m_defaultCaseFirst_; - updateInternalState(); - } + * Sets the Hiragana Quaternary mode to be on or off. + * When the Hiragana Quaternary mode is turned on, the collator + * positions Hiragana characters before all non-ignorable characters in + * QUATERNARY strength. This is to produce a correct JIS collation order, + * distinguishing between Katakana and Hiragana characters. + * @param flag true if Hiragana Quaternary mode is to be on, false + * otherwise + * @see #setHiraganaQuaternaryDefault + * @see #isHiraganaQuaternary + * @draft 2.2 + */ + public void setHiraganaQuaternary(boolean flag) + { + m_isHiragana4_ = flag; + } + + /** + * Sets the Hiragana Quaternary mode to the initial mode set during + * construction of the RuleBasedCollator. + * See setHiraganaQuaternary(boolean) for more details. + * @see #setHiraganaQuaternary(boolean) + * @see #isHiraganaQuaternary + * @draft 2.2 + */ + public void setHiraganaQuaternaryDefault() + { + m_isHiragana4_ = m_defaultIsHiragana4_; + } + + /** + * Sets whether uppercase characters sort before lowercase + * characters or vice versa, in strength TERTIARY. The default + * mode is false, and so lowercase characters sort before uppercase + * characters. + * If true, sort upper case characters first. + * @param upperfirst true to sort uppercase characters before + * lowercase characters, false to sort lowercase + * characters before uppercase characters + * @see #setCaseFirstOff + * @see #isCaseFirstOff + * @see #isLowerCaseFirst + * @see #isUpperCaseFirst + * @see #setCaseFirstDefault + * @draft 2.2 + */ + public void setCaseFirst(boolean upperfirst) + { + if (upperfirst) { + m_caseFirst_ = AttributeValue.UPPER_FIRST_; + } + else { + m_caseFirst_ = AttributeValue.LOWER_FIRST_; + } + updateInternalState(); + } + + /** + * Sets the Collator to ignore any previous setCaseFirst(boolean) calls. + * Ignores case preferences. + * @draft 2.2 + * @see #setCaseFirst(boolean) + * @see #isCaseFirstOff + * @see #isLowerCaseFirst + * @see #isUpperCaseFirst + * @see #setCaseFirstDefault + */ + public void setCaseFirstOff() + { + m_caseFirst_ = AttributeValue.OFF_; + updateInternalState(); + } + + /** + * Sets the case first mode to the initial mode set during + * construction of the RuleBasedCollator. + * See setCaseFirst(boolean) for more details. + * @see #setCaseFirstOff + * @see #isCaseFirstOff + * @see #isUpperCaseFirst + * @see #setCaseFirst + * @draft 2.2 + */ + public final void setCaseFirstDefault() + { + m_caseFirst_ = m_defaultCaseFirst_; + updateInternalState(); + } /** * Sets the alternate handling mode to the initial mode set during - * construction of the RuleBasedCollator. - * See setAlternateHandling(boolean) for more details. - * @see #setAlternateHandling(boolean) - * @see #isAlternateHandling(boolean) + * construction of the RuleBasedCollator. + * See setAlternateHandling(boolean) for more details. + * @see #setAlternateHandling(boolean) + * @see #isAlternateHandling(boolean) * @draft 2.2 */ public void setAlternateHandlingDefault() { - m_isAlternateHandlingShifted_ = m_defaultIsAlternateHandlingShifted_; - updateInternalState(); + m_isAlternateHandlingShifted_ = m_defaultIsAlternateHandlingShifted_; + updateInternalState(); } /** * Sets the case level mode to the initial mode set during - * construction of the RuleBasedCollator. - * See setCaseLevel(boolean) for more details. - * @see #setCaseLevel(boolean) - * @see #isCaseLevel + * construction of the RuleBasedCollator. + * See setCaseLevel(boolean) for more details. + * @see #setCaseLevel(boolean) + * @see #isCaseLevel * @draft 2.2 */ public void setCaseLevelDefault() { - m_isCaseLevel_ = m_defaultIsCaseLevel_; - updateInternalState(); + m_isCaseLevel_ = m_defaultIsCaseLevel_; + updateInternalState(); } /** @@ -398,7 +395,7 @@ public final class RuleBasedCollator extends Collator */ public void setDecompositionDefault() { - setDecomposition(m_defaultDecomposition_); + setDecomposition(m_defaultDecomposition_); } /** @@ -411,8 +408,8 @@ public final class RuleBasedCollator extends Collator */ public void setFrenchCollationDefault() { - m_isFrenchCollation_ = m_defaultIsFrenchCollation_; - updateInternalState(); + m_isFrenchCollation_ = m_defaultIsFrenchCollation_; + updateInternalState(); } /** @@ -425,17 +422,17 @@ public final class RuleBasedCollator extends Collator */ public void setStrengthDefault() { - setStrength(m_defaultStrength_); + setStrength(m_defaultStrength_); } /** * Sets the mode for the direction of SECONDARY weights to be used in * French collation. - * The default value is false which treats SECONDARY weights in the order + * The default value is false, which treats SECONDARY weights in the order * they appear. - * If true is set, the SECONDARY weights will be sorted backwards. + * If set to true, the SECONDARY weights will be sorted backwards. * See the section on - * + * * French collation for more information. * @param flag true to set the French collation on, false to set it off * @draft 2.2 @@ -444,15 +441,15 @@ public final class RuleBasedCollator extends Collator */ public void setFrenchCollation(boolean flag) { - m_isFrenchCollation_ = flag; - updateInternalState(); + m_isFrenchCollation_ = flag; + updateInternalState(); } /** - * Sets the alternate handling for Quaternary strength to be either + * Sets the alternate handling for QUATERNARY strength to be either * shifted or non-ignorable. * See the UCA definition on - * + * * Alternate Weighting. * This attribute will only be effective when QUATERNARY strength is set. * The default value for this mode is false, corresponding to the @@ -471,8 +468,8 @@ public final class RuleBasedCollator extends Collator */ public void setAlternateHandling(boolean shifted) { - m_isAlternateHandlingShifted_ = shifted; - updateInternalState(); + m_isAlternateHandlingShifted_ = shifted; + updateInternalState(); } /** @@ -482,30 +479,29 @@ public final class RuleBasedCollator extends Collator * The case level is used to distinguish large and small Japanese Kana * characters. Case level could also be used in other situations. * For example to distinguish certain Pinyin characters. - * The default value is false, where the case level is not generated. - * If the case level is set to true, which causes the case level to be - * generated. Contents of the case level are affected by the case first + * The default value is false, which means the case level is not generated. + * The contents of the case level are affected by the case first * mode. A simple way to ignore accent differences in a string is to set * the strength to PRIMARY and enable case level. * ** See the section on - * + * * case level for more information. *
* @param flag true if case level sorting is required, false otherwise * @draft 2.2 * @see #setCaseLevelDefault - * @see #isCaseLevel - * @see #setCaseFirst(boolean) + * @see #isCaseLevel + * @see #setCaseFirst(boolean) */ public void setCaseLevel(boolean flag) { - m_isCaseLevel_ = flag; - updateInternalState(); + m_isCaseLevel_ = flag; + updateInternalState(); } - /** + /** ** Sets this Collator's strength property. The strength property * determines the minimum level of difference considered significant @@ -521,7 +517,7 @@ public final class RuleBasedCollator extends Collator * @see #QUATERNARY * @see #IDENTICAL * @exception IllegalArgumentException If the new strength value is not one - * of PRIMARY, SECONDARY, TERTIARY, QUATERNARY or IDENTICAL. + * of PRIMARY, SECONDARY, TERTIARY, QUATERNARY or IDENTICAL. * @draft 2.2 */ public void setStrength(int newStrength) { @@ -538,584 +534,539 @@ public final class RuleBasedCollator extends Collator */ public String getRules() { - return m_rules_; + return m_rules_; } - /** - *
- * Get a Collation key for the argument String source from this - * RuleBasedCollator. - *
- *- * General recommendation:
- *
- * If comparison are to be done to the same String multiple times, it would - * be more efficient to generate CollationKeys for the Strings and use - * CollationKey.compareTo(CollationKey) for the comparisons. - * If the each Strings are compared to only once, using the method - * RuleBasedCollator.compare(String, String) will have a better performance. - *- * See the class documentation for an explanation about CollationKeys. - *
- * @param source the text String to be transformed into a collation key. - * @return the CollationKey for the given String based on this - * RuleBasedCollator's collation rules. If the source String is - * null, a null CollationKey is returned. - * @see CollationKey - * @see #compare(String, String) - * @draft 2.2 - */ public CollationKey getCollationKey(String source) { - if (source == null) { - return null; - } - int strength = getStrength(); - boolean compare[] = {m_isCaseLevel_, - true, - strength >= SECONDARY, - strength >= TERTIARY, - strength >= QUATERNARY, - strength == IDENTICAL - }; + if (source == null) { + return null; + } + int strength = getStrength(); + boolean compare[] = {m_isCaseLevel_, + true, + strength >= SECONDARY, + strength >= TERTIARY, + strength >= QUATERNARY, + strength == IDENTICAL + }; - byte bytes[][] = {new byte[SORT_BUFFER_INIT_SIZE_CASE_], // case - new byte[SORT_BUFFER_INIT_SIZE_1_], // primary - new byte[SORT_BUFFER_INIT_SIZE_2_], // secondary - new byte[SORT_BUFFER_INIT_SIZE_3_], // tertiary - new byte[SORT_BUFFER_INIT_SIZE_4_] // Quaternary - }; - int bytescount[] = {0, 0, 0, 0, 0}; - int count[] = {0, 0, 0, 0, 0}; - boolean doFrench = m_isFrenchCollation_ && compare[2]; - // TODO: UCOL_COMMON_BOT4 should be a function of qShifted. - // If we have no qShifted, we don't need to set UCOL_COMMON_BOT4 so - // high. - int commonBottom4 = ((m_variableTopValue_ >> 8) & LAST_BYTE_MASK_) + 1; - byte hiragana4 = 0; - if (m_isHiragana4_ && compare[4]) { - // allocate one more space for hiragana, value for hiragana - hiragana4 = (byte)commonBottom4; - commonBottom4 ++; - } - - int bottomCount4 = 0xFF - commonBottom4; - // If we need to normalize, we'll do it all at once at the beginning! - if ((compare[5] || getDecomposition() != NO_DECOMPOSITION) - && Normalizer.quickCheck(source, Normalizer.NFD) - != Normalizer.YES) { - source = Normalizer.decompose(source, false); - } - getSortKeyBytes(source, compare, bytes, bytescount, count, doFrench, - hiragana4, commonBottom4, bottomCount4); - byte sortkey[] = getSortKey(source, compare, bytes, bytescount, count, - doFrench, commonBottom4, bottomCount4); - return new CollationKey(source, sortkey); + byte bytes[][] = {new byte[SORT_BUFFER_INIT_SIZE_CASE_], // case + new byte[SORT_BUFFER_INIT_SIZE_1_], // primary + new byte[SORT_BUFFER_INIT_SIZE_2_], // secondary + new byte[SORT_BUFFER_INIT_SIZE_3_], // tertiary + new byte[SORT_BUFFER_INIT_SIZE_4_] // Quaternary + }; + int bytescount[] = {0, 0, 0, 0, 0}; + int count[] = {0, 0, 0, 0, 0}; + boolean doFrench = m_isFrenchCollation_ && compare[2]; + // TODO: UCOL_COMMON_BOT4 should be a function of qShifted. + // If we have no qShifted, we don't need to set UCOL_COMMON_BOT4 so + // high. + int commonBottom4 = ((m_variableTopValue_ >> 8) & LAST_BYTE_MASK_) + 1; + byte hiragana4 = 0; + if (m_isHiragana4_ && compare[4]) { + // allocate one more space for hiragana, value for hiragana + hiragana4 = (byte)commonBottom4; + commonBottom4 ++; + } + + int bottomCount4 = 0xFF - commonBottom4; + // If we need to normalize, we'll do it all at once at the beginning! + if ((compare[5] || getDecomposition() != NO_DECOMPOSITION) + && Normalizer.quickCheck(source, Normalizer.NFD) + != Normalizer.YES) { + source = Normalizer.decompose(source, false); + } + getSortKeyBytes(source, compare, bytes, bytescount, count, doFrench, + hiragana4, commonBottom4, bottomCount4); + byte sortkey[] = getSortKey(source, compare, bytes, bytescount, count, + doFrench, commonBottom4, bottomCount4); + return new CollationKey(source, sortkey); } - + /** - * Checks if upper cased character is sorted before lower cased character. - * See setCaseFirst(boolean) for details. - * @see #setCaseFirstOff - * @see #setCaseFirst(boolean) - * @see #isLowerCaseFirst - * @see #setCaseFirstDefault - * @return true if upper cased characters are sorted before lower cased - * characters, false otherwise - * @draft 2.2 - */ - public boolean isUpperCaseFirst() - { - return (m_caseFirst_ == AttributeValue.UPPER_FIRST_); - } - - /** - * Checks if lower cased character is sorted before upper cased character. - * See setCaseFirst(boolean) for details. - * @see #setCaseFirstOff - * @see #setCaseFirst(boolean) - * @see #isUpperCaseFirst - * @see #setCaseFirstDefault - * @return true lower cased characters are sorted before upper cased - * characters, false otherwise - * @draft 2.2 - */ - public boolean isLowerCaseFirst() - { - return (m_caseFirst_ == AttributeValue.LOWER_FIRST_); - } - - /** - * Checks if a previous call to setCaseFirst(boolean) is turned off - * by setCaseFirstOff(). - * See setCaseFirst(boolean) for details. - * @return true if the customized case sorting is turned off, false - * otherwise - * @see #setCaseFirstOff - * @see #setCaseFirst(boolean) - * @see #isUpperCaseFirst - * @see #isLowerCaseFirst - * @see #setCaseFirstDefault - * @draft 2.2 - */ - public boolean isCaseFirstOff() - { - return (m_caseFirst_ == AttributeValue.OFF_); - } - - /** - * Checks if the alternate handling behaviour is the UCA defined SHIFTED or - * NON_IGNORABLE. - *- *
- * See setAlternateHandling(boolean) for more details. - * @param shifted true if checks are to be done to see if the SHIFTED - * behaviour is on, false if checks are to be done to see if the - * NON_IGNORABLE behaviour is on. - * @return true or false - * @see #setAlternateHandling(boolean) - * @see #setAlternateHandlingDefault + * Return true if an uppercase character is sorted before the corresponding lowercase character. + * See setCaseFirst(boolean) for details. + * @see #setCaseFirstOff + * @see #setCaseFirst(boolean) + * @see #isLowerCaseFirst + * @see #setCaseFirstDefault + * @return true if upper cased characters are sorted before lower cased + * characters, false otherwise * @draft 2.2 */ - public boolean isAlternateHandling(boolean shifted) - { - if (shifted) { - return m_isAlternateHandlingShifted_; - } - return !m_isAlternateHandlingShifted_; - } - - /** - * Checks if case level is set to true. - * See setCaseLevel(boolean) for details. - * @return the case level mode - * @see #setCaseLevelDefault - * @see #isCaseLevel - * @see #setCaseLevel(boolean) - * @draft 2.2 - */ - public boolean isCaseLevel() - { - return m_isCaseLevel_; - } - - /** - * Checks if French Collation is set to true. - * See setFrenchCollation(boolean) for details. - * @return true if French Collation is set to true, false otherwise - * @see #setFrenchCollation(boolean) - * @see #setFrenchCollationDefault - * @draft 2.2 - */ - public boolean isFrenchCollation() - { - return m_isFrenchCollation_; - } - - /** - * Checks if the Hiragana Quaternary mode is set on. - * See setHiraganaQuaternary(boolean) for more details. - * @return flag true if Hiragana Quaternary mode is on, false otherwise - * @see #setHiraganaQuaternaryDefault - * @see #setHiraganaQuaternary(boolean) - * @draft 2.2 - */ - public boolean isHiraganaQuaternary() - { - return m_isHiragana4_; - } - - // public other methods ------------------------------------------------- + public boolean isUpperCaseFirst() + { + return (m_caseFirst_ == AttributeValue.UPPER_FIRST_); + } + + /** + * Return true if a lowercase character is sorted before the corresponding uppercase character. + * See setCaseFirst(boolean) for details. + * @see #setCaseFirstOff + * @see #setCaseFirst(boolean) + * @see #isUpperCaseFirst + * @see #setCaseFirstDefault + * @return true lower cased characters are sorted before upper cased + * characters, false otherwise + * @draft 2.2 + */ + public boolean isLowerCaseFirst() + { + return (m_caseFirst_ == AttributeValue.LOWER_FIRST_); + } + + /** + * Checks if a previous call to setCaseFirst(boolean) is turned off + * by setCaseFirstOff(). + * See setCaseFirst(boolean) for details. + * @return true if the customized case sorting is turned off, false + * otherwise + * @see #setCaseFirstOff + * @see #setCaseFirst(boolean) + * @see #isUpperCaseFirst + * @see #isLowerCaseFirst + * @see #setCaseFirstDefault + * @draft 2.2 + */ + public boolean isCaseFirstOff() + { + return (m_caseFirst_ == AttributeValue.OFF_); + } + + /** + * + * (Syn Wee: this makes no sense to me. It seems a function like + * isAlternateHandlingShifted() is much easier to explain! You + * can tell from the implementation that there isn't _this_ much + * to document about this function... :-) + * + * Checks if the alternate handling behaviour is the UCA defined SHIFTED or + * NON_IGNORABLE. + *- If argument shifted is true and - *
- *
- *- return value is true, then the alternate handling attribute for - * the Collator is SHIFTED. Or - *
- return value is false, then the alternate handling attribute for - * the Collator is NON_IGNORABLE - *
- If argument shifted is false and - *
- *
- *- return value is true, then the alternate handling attribute for - * the Collator is NON_IGNORABLE. Or - *
- return value is false, then the alternate handling attribute for - * the Collator is SHIFTED. - *
+ *
+ * See setAlternateHandling(boolean) for more details. + * @param shifted true if checks are to be done to see if the SHIFTED + * behaviour is on, false if checks are to be done to see if the + * NON_IGNORABLE behaviour is on. + * @return true or false + * @see #setAlternateHandling(boolean) + * @see #setAlternateHandlingDefault + * @draft 2.2 */ + public boolean isAlternateHandling(boolean shifted) + { + if (shifted) { + return m_isAlternateHandlingShifted_; + } + return !m_isAlternateHandlingShifted_; + } + + /** + * Checks if case level is set to true. + * See setCaseLevel(boolean) for details. + * @return the case level mode + * @see #setCaseLevelDefault + * @see #isCaseLevel + * @see #setCaseLevel(boolean) + * @draft 2.2 + */ + public boolean isCaseLevel() + { + return m_isCaseLevel_; + } + + /** + * Checks if French Collation is set to true. + * See setFrenchCollation(boolean) for details. + * @return true if French Collation is set to true, false otherwise + * @see #setFrenchCollation(boolean) + * @see #setFrenchCollationDefault + * @draft 2.2 + */ + public boolean isFrenchCollation() + { + return m_isFrenchCollation_; + } + + /** + * Checks if the Hiragana Quaternary mode is set on. + * See setHiraganaQuaternary(boolean) for more details. + * @return flag true if Hiragana Quaternary mode is on, false otherwise + * @see #setHiraganaQuaternaryDefault + * @see #setHiraganaQuaternary(boolean) + * @draft 2.2 + */ + public boolean isHiraganaQuaternary() + { + return m_isHiragana4_; + } + + // public other methods ------------------------------------------------- /** * Compares the equality of two RuleBasedCollator objects. - * RuleBasedCollator objects are equivalent if they have the same collation + * RuleBasedCollator objects are equal if they have the same collation * rules and the same attributes. - * @param obj the RuleBasedCollator to be compared with. + * @param obj the RuleBasedCollator to be compared to. * @return true if this RuleBasedCollator has exactly the same * collation behaviour as obj, false otherwise. * @draft 2.2 */ public boolean equals(Object obj) { if (obj == null) { - return false; // super does class check + return false; // super does class check } if (this == obj) { - return true; + return true; } if (getClass() != obj.getClass()) { - return false; + return false; } RuleBasedCollator other = (RuleBasedCollator)obj; // all other non-transient information is also contained in rules. return getStrength() == other.getStrength() - && getDecomposition() == other.getDecomposition() - && other.m_caseFirst_ == m_caseFirst_ - && other.m_caseSwitch_ == m_caseSwitch_ - && other.m_isAlternateHandlingShifted_ - == m_isAlternateHandlingShifted_ - && other.m_isCaseLevel_ == m_isCaseLevel_ - && other.m_isFrenchCollation_ == m_isFrenchCollation_ - && other.m_isHiragana4_ == m_isHiragana4_ - && m_rules_.equals(other.m_rules_); + && getDecomposition() == other.getDecomposition() + && other.m_caseFirst_ == m_caseFirst_ + && other.m_caseSwitch_ == m_caseSwitch_ + && other.m_isAlternateHandlingShifted_ + == m_isAlternateHandlingShifted_ + && other.m_isCaseLevel_ == m_isCaseLevel_ + && other.m_isFrenchCollation_ == m_isFrenchCollation_ + && other.m_isHiragana4_ == m_isHiragana4_ + && m_rules_.equals(other.m_rules_); } - /** + /** * Generates a unique hash code for this RuleBasedCollator. * @return the unique hash code for this Collator * @draft 2.2 */ public int hashCode() { - return getRules().hashCode(); + return getRules().hashCode(); } - /** - * Compares the source text String to the target text String according to - * the collation rules, strength and decomposition mode for this - * RuleBasedCollator. - * Returns an integer less than, - * equal to or greater than zero depending on whether the source String is - * less than, equal to or greater than the target String. See the Collator - * class description for an example of use. - * - *- If argument shifted is true and + *
+ *
+ *- return value is true, then the alternate handling attribute for + * the Collator is SHIFTED. Or + *
- return value is false, then the alternate handling attribute for + * the Collator is NON_IGNORABLE + *
- If argument shifted is false and + *
+ *
+ *- return value is true, then the alternate handling attribute for + * the Collator is NON_IGNORABLE. Or + *
- return value is false, then the alternate handling attribute for + * the Collator is SHIFTED. + *
- * General recommendation:
- * @param source the source text String. - * @param target the target text String. - * @return Returns an integer value. Value is less than zero if source is - * less than target, value is zero if source and target are equal, - * value is greater than zero if source is greater than target. - * @see CollationKey - * @see #getCollationKey - * @draft 2.2 - */ public int compare(String source, String target) { - if (source == target) { - return 0; - } - - // Find the length of any leading portion that is equal - int offset = getFirstUnmatchedOffset(source, target); - if (offset == source.length()) { - if (offset == target.length() || checkIgnorable(target, offset)) { - return 0; - } - return -1; - } - else if (target.length() == offset) { - if (checkIgnorable(source, offset)) { - return 0; - } - return 1; - } + if (source == target) { + return 0; + } + + // Find the length of any leading portion that is equal + int offset = getFirstUnmatchedOffset(source, target); + if (offset == source.length()) { + if (offset == target.length() || checkIgnorable(target, offset)) { + return 0; + } + return -1; + } + else if (target.length() == offset) { + if (checkIgnorable(source, offset)) { + return 0; + } + return 1; + } int strength = getStrength(); - // setting up the collator parameters - boolean compare[] = {m_isCaseLevel_, - true, - strength >= SECONDARY, - strength >= TERTIARY, - strength >= QUATERNARY, - strength == IDENTICAL - }; - boolean doFrench = m_isFrenchCollation_ && compare[2]; - boolean doShift4 = m_isAlternateHandlingShifted_ && compare[4]; - boolean doHiragana4 = m_isHiragana4_ && compare[4]; - - if (doHiragana4 && doShift4) { - String sourcesub = source.substring(offset); - String targetsub = target.substring(offset); - return compareBySortKeys(sourcesub, targetsub); - } - - // Preparing the CE buffers. will be filled during the primary phase - int cebuffer[][] = {new int[CE_BUFFER_SIZE_], new int[CE_BUFFER_SIZE_]}; - int cebuffersize[] = {0, 0}; - // This is the lowest primary value that will not be ignored if shifted - int lowestpvalue = m_isAlternateHandlingShifted_ - ? m_variableTopValue_ << 16 : 0; - int result = doPrimaryCompare(doHiragana4, lowestpvalue, source, - target, offset, cebuffer, cebuffersize); - if (cebuffer[0] == null && cebuffer[1] == null) { - // since the cebuffer is cleared when we have determined that - // either source is greater than target or vice versa, the return - // result is the comparison result and not the hiragana result - return result; - } - - int hiraganaresult = result; - - if (compare[2]) { - result = doSecondaryCompare(cebuffer, cebuffersize, doFrench); - if (result != 0) { - return result; - } - } - // doing the case bit - if (compare[0]) { - result = doCaseCompare(cebuffer); - if (result != 0) { - return result; - } - } - // Tertiary level - if (compare[3]) { - result = doTertiaryCompare(cebuffer); - if (result != 0) { - return result; - } - } - - if (doShift4) { // checkQuad - result = doQuaternaryCompare(cebuffer, lowestpvalue); - if (result != 0) { - return result; - } - } - else if (doHiragana4 && hiraganaresult != 0) { - // If we're fine on quaternaries, we might be different - // on Hiragana. This, however, might fail us in shifted. - return hiraganaresult; - } - - // For IDENTICAL comparisons, we use a bitwise character comparison - // as a tiebreaker if all else is equal. - // Getting here should be quite rare - strings are not identical - - // that is checked first, but compared == through all other checks. - if (compare[5]) { - return doIdenticalCompare(source, target, offset, true); - } - return 0; + // setting up the collator parameters + boolean compare[] = {m_isCaseLevel_, + true, + strength >= SECONDARY, + strength >= TERTIARY, + strength >= QUATERNARY, + strength == IDENTICAL + }; + boolean doFrench = m_isFrenchCollation_ && compare[2]; + boolean doShift4 = m_isAlternateHandlingShifted_ && compare[4]; + boolean doHiragana4 = m_isHiragana4_ && compare[4]; + + if (doHiragana4 && doShift4) { + String sourcesub = source.substring(offset); + String targetsub = target.substring(offset); + return compareBySortKeys(sourcesub, targetsub); + } + + // Preparing the CE buffers. will be filled during the primary phase + int cebuffer[][] = {new int[CE_BUFFER_SIZE_], new int[CE_BUFFER_SIZE_]}; + int cebuffersize[] = {0, 0}; + // This is the lowest primary value that will not be ignored if shifted + int lowestpvalue = m_isAlternateHandlingShifted_ + ? m_variableTopValue_ << 16 : 0; + int result = doPrimaryCompare(doHiragana4, lowestpvalue, source, + target, offset, cebuffer, cebuffersize); + if (cebuffer[0] == null && cebuffer[1] == null) { + // since the cebuffer is cleared when we have determined that + // either source is greater than target or vice versa, the return + // result is the comparison result and not the hiragana result + return result; + } + + int hiraganaresult = result; + + if (compare[2]) { + result = doSecondaryCompare(cebuffer, cebuffersize, doFrench); + if (result != 0) { + return result; + } + } + // doing the case bit + if (compare[0]) { + result = doCaseCompare(cebuffer); + if (result != 0) { + return result; + } + } + // Tertiary level + if (compare[3]) { + result = doTertiaryCompare(cebuffer); + if (result != 0) { + return result; + } + } + + if (doShift4) { // checkQuad + result = doQuaternaryCompare(cebuffer, lowestpvalue); + if (result != 0) { + return result; + } + } + else if (doHiragana4 && hiraganaresult != 0) { + // If we're fine on quaternaries, we might be different + // on Hiragana. This, however, might fail us in shifted. + return hiraganaresult; + } + + // For IDENTICAL comparisons, we use a bitwise character comparison + // as a tiebreaker if all else is equal. + // Getting here should be quite rare - strings are not identical - + // that is checked first, but compared == through all other checks. + if (compare[5]) { + return doIdenticalCompare(source, target, offset, true); + } + return 0; } // package private inner interfaces -------------------------------------- /** - * Attribute values to be used when setting the Collator options - */ - static interface AttributeValue - { - /** - * Indicates that the default attribute value will be used. - * See individual attribute for details on its default value. - */ - static final int DEFAULT_ = -1; - /** - * Primary collation strength - */ - static final int PRIMARY_ = Collator.PRIMARY; - /** - * Secondary collation strength - */ - static final int SECONDARY_ = Collator.SECONDARY; - /** - * Tertiary collation strength - */ - static final int TERTIARY_ = Collator.TERTIARY; - /** - * Default collation strength - */ - static final int DEFAULT_STRENGTH_ = Collator.TERTIARY; - /** - * Internal use for strength checks in Collation elements - */ - static final int CE_STRENGTH_LIMIT_ = Collator.TERTIARY + 1; - /** - * Quaternary collation strength - */ - static final int QUATERNARY_ = 3; - /** - * Identical collation strength - */ - static final int IDENTICAL_ = Collator.IDENTICAL; - /** - * Internal use for strength checks - */ - static final int STRENGTH_LIMIT_ = Collator.IDENTICAL + 1; - /** - * Turn the feature off - works for FRENCH_COLLATION, CASE_LEVEL, - * HIRAGANA_QUATERNARY_MODE and DECOMPOSITION_MODE - */ - static final int OFF_ = 16; - /** - * Turn the feature on - works for FRENCH_COLLATION, CASE_LEVEL, - * HIRAGANA_QUATERNARY_MODE and DECOMPOSITION_MODE - */ - static final int ON_ = 17; - /** - * Valid for ALTERNATE_HANDLING. Alternate handling will be shifted - */ - static final int SHIFTED_ = 20; - /** - * Valid for ALTERNATE_HANDLING. Alternate handling will be non - * ignorable - */ - static final int NON_IGNORABLE_ = 21; - /** - * Valid for CASE_FIRST - lower case sorts before upper case - */ - static final int LOWER_FIRST_ = 24; - /** - * Upper case sorts before lower case - */ - static final int UPPER_FIRST_ = 25; - /** - * Valid for NORMALIZATION_MODE ON and OFF are also allowed for this - * attribute - */ - static final int ON_WITHOUT_HANGUL_ = 28; - /** - * Number of attribute values - */ - static final int LIMIT_ = 29; - }; - - /** - * Attributes that collation service understands. All the attributes can - * take DEFAULT value, as well as the values specific to each one. - */ - static interface Attribute - { - /** - * Attribute for direction of secondary weights - used in French. - * Acceptable values are ON, which results in secondary weights being - * considered backwards and OFF which treats secondary weights in the - * order they appear. - */ - static final int FRENCH_COLLATION_ = 0; - /** - * Attribute for handling variable elements. Acceptable values are - * NON_IGNORABLE (default) which treats all the codepoints with - * non-ignorable primary weights in the same way, and SHIFTED which - * causes codepoints with primary weights that are equal or below the - * variable top value to be ignored on primary level and moved to the - * quaternary level. - */ - static final int ALTERNATE_HANDLING_ = 1; - /** - * Controls the ordering of upper and lower case letters. Acceptable - * values are OFF (default), which orders upper and lower case letters - * in accordance to their tertiary weights, UPPER_FIRST which forces - * upper case letters to sort before lower case letters, and - * LOWER_FIRST which does the opposite. - */ - static final int CASE_FIRST_ = 2; - /** - * Controls whether an extra case level (positioned before the third - * level) is generated or not. Acceptable values are OFF (default), - * when case level is not generated, and ON which causes the case - * level to be generated. Contents of the case level are affected by - * the value of CASE_FIRST attribute. A simple way to ignore accent - * differences in a string is to set the strength to PRIMARY and - * enable case level. - */ - static final int CASE_LEVEL_ = 3; - /** - * Controls whether the normalization check and necessary - * normalizations are performed. When set to OFF (default) no - * normalization check is performed. The correctness of the result is - * guaranteed only if the input data is in so-called FCD form (see - * users manual for more info). When set to ON, an incremental check - * is performed to see whether the input data is in the FCD form. If - * the data is not in the FCD form, incremental NFD normalization is - * performed. - */ - static final int NORMALIZATION_MODE_ = 4; - /** - * The strength attribute. Can be either PRIMARY, SECONDARY, TERTIARY, - * QUATERNARY or IDENTICAL. The usual strength for most locales - * (except Japanese) is tertiary. Quaternary strength is useful when - * combined with shifted setting for alternate handling attribute and - * for JIS x 4061 collation, when it is used to distinguish between - * Katakana and Hiragana (this is achieved by setting the - * HIRAGANA_QUATERNARY mode to on. Otherwise, quaternary level is - * affected only by the number of non ignorable code points in the - * string. Identical strength is rarely useful, as it amounts to - * codepoints of the NFD form of the string. - */ - static final int STRENGTH_ = 5; - /** - * When turned on, this attribute positions Hiragana before all - * non-ignorables on quaternary level. This is a sneaky way to produce - * JIS sort order. - */ - static final int HIRAGANA_QUATERNARY_MODE_ = 6; + * Attribute values to be used when setting the Collator options + */ + static interface AttributeValue + { /** - * Attribute count - */ - static final int LIMIT_ = 7; - }; - - /** + * Indicates that the default attribute value will be used. + * See individual attribute for details on its default value. + */ + static final int DEFAULT_ = -1; + /** + * Primary collation strength + */ + static final int PRIMARY_ = Collator.PRIMARY; + /** + * Secondary collation strength + */ + static final int SECONDARY_ = Collator.SECONDARY; + /** + * Tertiary collation strength + */ + static final int TERTIARY_ = Collator.TERTIARY; + /** + * Default collation strength + */ + static final int DEFAULT_STRENGTH_ = Collator.TERTIARY; + /** + * Internal use for strength checks in Collation elements + */ + static final int CE_STRENGTH_LIMIT_ = Collator.TERTIARY + 1; + /** + * Quaternary collation strength + */ + static final int QUATERNARY_ = 3; + /** + * Identical collation strength + */ + static final int IDENTICAL_ = Collator.IDENTICAL; + /** + * Internal use for strength checks + */ + static final int STRENGTH_LIMIT_ = Collator.IDENTICAL + 1; + /** + * Turn the feature off - works for FRENCH_COLLATION, CASE_LEVEL, + * HIRAGANA_QUATERNARY_MODE and DECOMPOSITION_MODE + */ + static final int OFF_ = 16; + /** + * Turn the feature on - works for FRENCH_COLLATION, CASE_LEVEL, + * HIRAGANA_QUATERNARY_MODE and DECOMPOSITION_MODE + */ + static final int ON_ = 17; + /** + * Valid for ALTERNATE_HANDLING. Alternate handling will be shifted + */ + static final int SHIFTED_ = 20; + /** + * Valid for ALTERNATE_HANDLING. Alternate handling will be non + * ignorable + */ + static final int NON_IGNORABLE_ = 21; + /** + * Valid for CASE_FIRST - lower case sorts before upper case + */ + static final int LOWER_FIRST_ = 24; + /** + * Upper case sorts before lower case + */ + static final int UPPER_FIRST_ = 25; + /** + * Valid for NORMALIZATION_MODE ON and OFF are also allowed for this + * attribute + */ + static final int ON_WITHOUT_HANGUL_ = 28; + /** + * Number of attribute values + */ + static final int LIMIT_ = 29; + }; + + /** + * Attributes that collation service understands. All the attributes can + * take DEFAULT value, as well as the values specific to each one. + */ + static interface Attribute + { + /** + * Attribute for direction of secondary weights - used in French. + * Acceptable values are ON, which results in secondary weights being + * considered backwards and OFF which treats secondary weights in the + * order they appear. + */ + static final int FRENCH_COLLATION_ = 0; + /** + * Attribute for handling variable elements. Acceptable values are + * NON_IGNORABLE (default) which treats all the codepoints with + * non-ignorable primary weights in the same way, and SHIFTED which + * causes codepoints with primary weights that are equal or below the + * variable top value to be ignored on primary level and moved to the + * quaternary level. + */ + static final int ALTERNATE_HANDLING_ = 1; + /** + * Controls the ordering of upper and lower case letters. Acceptable + * values are OFF (default), which orders upper and lower case letters + * in accordance to their tertiary weights, UPPER_FIRST which forces + * upper case letters to sort before lower case letters, and + * LOWER_FIRST which does the opposite. + */ + static final int CASE_FIRST_ = 2; + /** + * Controls whether an extra case level (positioned before the third + * level) is generated or not. Acceptable values are OFF (default), + * when case level is not generated, and ON which causes the case + * level to be generated. Contents of the case level are affected by + * the value of CASE_FIRST attribute. A simple way to ignore accent + * differences in a string is to set the strength to PRIMARY and + * enable case level. + */ + static final int CASE_LEVEL_ = 3; + /** + * Controls whether the normalization check and necessary + * normalizations are performed. When set to OFF (default) no + * normalization check is performed. The correctness of the result is + * guaranteed only if the input data is in so-called FCD form (see + * users manual for more info). When set to ON, an incremental check + * is performed to see whether the input data is in the FCD form. If + * the data is not in the FCD form, incremental NFD normalization is + * performed. + */ + static final int NORMALIZATION_MODE_ = 4; + /** + * The strength attribute. Can be either PRIMARY, SECONDARY, TERTIARY, + * QUATERNARY or IDENTICAL. The usual strength for most locales + * (except Japanese) is tertiary. Quaternary strength is useful when + * combined with shifted setting for alternate handling attribute and + * for JIS x 4061 collation, when it is used to distinguish between + * Katakana and Hiragana (this is achieved by setting the + * HIRAGANA_QUATERNARY mode to on. Otherwise, quaternary level is + * affected only by the number of non ignorable code points in the + * string. Identical strength is rarely useful, as it amounts to + * codepoints of the NFD form of the string. + */ + static final int STRENGTH_ = 5; + /** + * When turned on, this attribute positions Hiragana before all + * non-ignorables on quaternary level. This is a sneaky way to produce + * JIS sort order. + */ + static final int HIRAGANA_QUATERNARY_MODE_ = 6; + /** + * Attribute count + */ + static final int LIMIT_ = 7; + }; + + /** * DataManipulate singleton */ static class DataManipulate implements Trie.DataManipulate { - // public methods ---------------------------------------------------- - - /** - * Internal method called to parse a lead surrogate's ce for the offset - * to the next trail surrogate data. - * @param ce collation element of the lead surrogate - * @return data offset or 0 for the next trail surrogate - * @draft 2.2 - */ - public final int getFoldingOffset(int ce) - { - if (isSpecial(ce) && getTag(ce) == CE_SURROGATE_TAG_) { - return (ce & 0xFFFFFF); - } - return 0; - } - - /** - * Get singleton object - */ - public static final DataManipulate getInstance() - { - if (m_instance_ == null) { - m_instance_ = new DataManipulate(); - } - return m_instance_; - } - - // private data member ---------------------------------------------- - - /** - * Singleton instance - */ - private static DataManipulate m_instance_; - - // private constructor ---------------------------------------------- - - /** - * private to prevent initialization - */ - private DataManipulate() - { - } + // public methods ---------------------------------------------------- + + /** + * Internal method called to parse a lead surrogate's ce for the offset + * to the next trail surrogate data. + * @param ce collation element of the lead surrogate + * @return data offset or 0 for the next trail surrogate + * @draft 2.2 + */ + public final int getFoldingOffset(int ce) + { + if (isSpecial(ce) && getTag(ce) == CE_SURROGATE_TAG_) { + return (ce & 0xFFFFFF); + } + return 0; + } + + /** + * Get singleton object + */ + public static final DataManipulate getInstance() + { + if (m_instance_ == null) { + m_instance_ = new DataManipulate(); + } + return m_instance_; + } + + // private data member ---------------------------------------------- + + /** + * Singleton instance + */ + private static DataManipulate m_instance_; + + // private constructor ---------------------------------------------- + + /** + * private to prevent initialization + */ + private DataManipulate() + { + } }; // package private data member ------------------------------------------- @@ -1125,81 +1076,81 @@ public final class RuleBasedCollator extends Collator static final int COMMON_TOP_2_ = 0x86; // int for unsigness static final int COMMON_BOTTOM_2_ = BYTE_COMMON_; /** - * Case strength mask - */ - static final int CE_CASE_BIT_MASK_ = 0xC0; - static final int CE_TAG_SHIFT_ = 24; - static final int CE_TAG_MASK_ = 0x0F000000; - - static final int CE_SPECIAL_FLAG_ = 0xF0000000; + * Case strength mask + */ + static final int CE_CASE_BIT_MASK_ = 0xC0; + static final int CE_TAG_SHIFT_ = 24; + static final int CE_TAG_MASK_ = 0x0F000000; + + static final int CE_SPECIAL_FLAG_ = 0xF0000000; /** * Lead surrogate that is tailored and doesn't start a contraction */ static final int CE_SURROGATE_TAG_ = 5; - /** - * Mask to get the primary strength of the collation element - */ - static final int CE_PRIMARY_MASK_ = 0xFFFF0000; - /** - * Mask to get the secondary strength of the collation element - */ - static final int CE_SECONDARY_MASK_ = 0xFF00; - /** - * Mask to get the tertiary strength of the collation element - */ - static final int CE_TERTIARY_MASK_ = 0xFF; - /** - * Primary strength shift - */ - static final int CE_PRIMARY_SHIFT_ = 16; - /** - * Secondary strength shift - */ - static final int CE_SECONDARY_SHIFT_ = 8; - /** - * Continuation marker - */ - static final int CE_CONTINUATION_MARKER_ = 0xC0; - - /** - * Size of collator raw data headers and options before the expansion - * data. This is used when expansion ces are to be retrieved. ICU4C uses - * the expansion offset starting from UCollator.UColHeader, hence ICU4J - * will have to minus that off to get the right expansion ce offset. In - * number of ints. - */ - int m_expansionOffset_; - /** - * Size of collator raw data headers, options and expansions before - * contraction data. This is used when contraction ces are to be retrieved. - * ICU4C uses contraction offset starting from UCollator.UColHeader, hence - * ICU4J will have to minus that off to get the right contraction ce - * offset. In number of chars. - */ - int m_contractionOffset_; + /** + * Mask to get the primary strength of the collation element + */ + static final int CE_PRIMARY_MASK_ = 0xFFFF0000; + /** + * Mask to get the secondary strength of the collation element + */ + static final int CE_SECONDARY_MASK_ = 0xFF00; + /** + * Mask to get the tertiary strength of the collation element + */ + static final int CE_TERTIARY_MASK_ = 0xFF; + /** + * Primary strength shift + */ + static final int CE_PRIMARY_SHIFT_ = 16; + /** + * Secondary strength shift + */ + static final int CE_SECONDARY_SHIFT_ = 8; + /** + * Continuation marker + */ + static final int CE_CONTINUATION_MARKER_ = 0xC0; + + /** + * Size of collator raw data headers and options before the expansion + * data. This is used when expansion ces are to be retrieved. ICU4C uses + * the expansion offset starting from UCollator.UColHeader, hence ICU4J + * will have to minus that off to get the right expansion ce offset. In + * number of ints. + */ + int m_expansionOffset_; + /** + * Size of collator raw data headers, options and expansions before + * contraction data. This is used when contraction ces are to be retrieved. + * ICU4C uses contraction offset starting from UCollator.UColHeader, hence + * ICU4J will have to minus that off to get the right contraction ce + * offset. In number of chars. + */ + int m_contractionOffset_; /** * Flag indicator if Jamo is special */ boolean m_isJamoSpecial_; - // Collator options ------------------------------------------------------ - int m_defaultVariableTopValue_; - boolean m_defaultIsFrenchCollation_; - boolean m_defaultIsAlternateHandlingShifted_; + // Collator options ------------------------------------------------------ + int m_defaultVariableTopValue_; + boolean m_defaultIsFrenchCollation_; + boolean m_defaultIsAlternateHandlingShifted_; int m_defaultCaseFirst_; boolean m_defaultIsCaseLevel_; int m_defaultDecomposition_; int m_defaultStrength_; boolean m_defaultIsHiragana4_; - /** - * Value of the variable top - */ + /** + * Value of the variable top + */ int m_variableTopValue_; /** * Attribute for special Hiragana */ boolean m_isHiragana4_; - /** + /** * Case sorting customization */ int m_caseFirst_; @@ -1251,20 +1202,20 @@ public final class RuleBasedCollator extends Collator * Table for UCA and builder use */ char m_UCAContraction_[]; - /** - * Original collation rules - */ - String m_rules_; - /** + /** + * Original collation rules + */ + String m_rules_; + /** * The smallest "unsafe" codepoint */ char m_minUnsafe_; /** - * The smallest codepoint that could be the end of a contraction - */ - char m_minContractionEnd_; - - /** + * The smallest codepoint that could be the end of a contraction + */ + char m_minContractionEnd_; + + /** * UnicodeData.txt property object */ static final RuleBasedCollator UCA_; @@ -1273,40 +1224,40 @@ public final class RuleBasedCollator extends Collator static { try - { - UCA_ = new RuleBasedCollator(); - InputStream i = UCA_.getClass().getResourceAsStream( - "/com/ibm/icu/impl/data/ucadata.dat"); - - BufferedInputStream b = new BufferedInputStream(i, 90000); - CollatorReader reader = new CollatorReader(b); - reader.read(UCA_); - b.close(); - i.close(); - ResourceBundle rb = - ICULocaleData.getLocaleElements(Locale.ENGLISH); - UCA_.m_rules_ = rb.getString("%%UCARULES"); - UCA_.init(); - } + { + UCA_ = new RuleBasedCollator(); + InputStream i = UCA_.getClass().getResourceAsStream( + "/com/ibm/icu/impl/data/ucadata.dat"); + + BufferedInputStream b = new BufferedInputStream(i, 90000); + CollatorReader reader = new CollatorReader(b); + reader.read(UCA_); + b.close(); + i.close(); + ResourceBundle rb = + ICULocaleData.getLocaleElements(Locale.ENGLISH); + UCA_.m_rules_ = rb.getString("%%UCARULES"); + UCA_.init(); + } catch (Exception e) - { - e.printStackTrace(); - throw new RuntimeException(e.getMessage()); - } + { + e.printStackTrace(); + throw new RuntimeException(e.getMessage()); + } } // package private constructors ------------------------------------------ /** - *
- * If comparison are to be done to the same String multiple times, it would - * be more efficient to generate CollationKeys for the Strings and use - * CollationKey.compareTo(CollationKey) for the comparisons. - * If the each Strings are compared to only once, using the method - * RuleBasedCollator.compare(String, String) will have a better performance. - *Private contructor for use by subclasses. - * Public access to creating Collators is handled by the API - * Collator.getInstance() or RuleBasedCollator(String rules). - *
- *- * This constructor constructs the UCA collator internally - *
- * @draft 2.2 - */ + *Private contructor for use by subclasses. + * Public access to creating Collators is handled by the API + * Collator.getInstance() or RuleBasedCollator(String rules). + *
+ *+ * This constructor constructs the UCA collator internally + *
+ * @draft 2.2 + */ RuleBasedCollator() { } @@ -1324,10 +1275,10 @@ public final class RuleBasedCollator extends Collator m_contractionCE_ = UCA_.m_contractionCE_; m_trie_ = UCA_.m_trie_; m_expansionEndCE_ = UCA_.m_expansionEndCE_; - m_expansionEndCEMaxSize_ = UCA_.m_expansionEndCEMaxSize_; - m_unsafe_ = UCA_.m_unsafe_; - m_contractionEnd_ = UCA_.m_contractionEnd_; - m_minUnsafe_ = UCA_.m_minUnsafe_; + m_expansionEndCEMaxSize_ = UCA_.m_expansionEndCEMaxSize_; + m_unsafe_ = UCA_.m_unsafe_; + m_contractionEnd_ = UCA_.m_contractionEnd_; + m_minUnsafe_ = UCA_.m_minUnsafe_; m_minContractionEnd_ = UCA_.m_minContractionEnd_; } @@ -1336,39 +1287,39 @@ public final class RuleBasedCollator extends Collator */ final void setWithUCAData() { - m_addition3_ = UCA_.m_addition3_; - m_bottom3_ = UCA_.m_bottom3_; - m_bottomCount3_ = UCA_.m_bottomCount3_; - m_caseFirst_ = UCA_.m_caseFirst_; - m_caseSwitch_ = UCA_.m_caseSwitch_; - m_common3_ = UCA_.m_common3_; - m_contractionOffset_ = UCA_.m_contractionOffset_; - setDecomposition(UCA_.getDecomposition()); - m_defaultCaseFirst_ = UCA_.m_defaultCaseFirst_; - m_defaultDecomposition_ = UCA_.m_defaultDecomposition_; - m_defaultIsAlternateHandlingShifted_ - = UCA_.m_defaultIsAlternateHandlingShifted_; - m_defaultIsCaseLevel_ = UCA_.m_defaultIsCaseLevel_; - m_defaultIsFrenchCollation_ = UCA_.m_defaultIsFrenchCollation_; - m_defaultIsHiragana4_ = UCA_.m_defaultIsHiragana4_; - m_defaultStrength_ = UCA_.m_defaultStrength_; - m_defaultVariableTopValue_ = UCA_.m_defaultVariableTopValue_; - m_expansionOffset_ = UCA_.m_expansionOffset_; - m_isAlternateHandlingShifted_ = UCA_.m_isAlternateHandlingShifted_; - m_isCaseLevel_ = UCA_.m_isCaseLevel_; - m_isFrenchCollation_ = UCA_.m_isFrenchCollation_; - m_isHiragana4_ = UCA_.m_isHiragana4_; - m_isJamoSpecial_ = UCA_.m_isJamoSpecial_; - m_isSimple3_ = UCA_.m_isSimple3_; - m_mask3_ = UCA_.m_mask3_; - m_minContractionEnd_ = UCA_.m_minContractionEnd_; - m_minUnsafe_ = UCA_.m_minUnsafe_; - m_rules_ = UCA_.m_rules_; - setStrength(UCA_.getStrength()); - m_top3_ = UCA_.m_top3_; - m_topCount3_ = UCA_.m_topCount3_; - m_variableTopValue_ = UCA_.m_variableTopValue_; - setWithUCATables(); + m_addition3_ = UCA_.m_addition3_; + m_bottom3_ = UCA_.m_bottom3_; + m_bottomCount3_ = UCA_.m_bottomCount3_; + m_caseFirst_ = UCA_.m_caseFirst_; + m_caseSwitch_ = UCA_.m_caseSwitch_; + m_common3_ = UCA_.m_common3_; + m_contractionOffset_ = UCA_.m_contractionOffset_; + setDecomposition(UCA_.getDecomposition()); + m_defaultCaseFirst_ = UCA_.m_defaultCaseFirst_; + m_defaultDecomposition_ = UCA_.m_defaultDecomposition_; + m_defaultIsAlternateHandlingShifted_ + = UCA_.m_defaultIsAlternateHandlingShifted_; + m_defaultIsCaseLevel_ = UCA_.m_defaultIsCaseLevel_; + m_defaultIsFrenchCollation_ = UCA_.m_defaultIsFrenchCollation_; + m_defaultIsHiragana4_ = UCA_.m_defaultIsHiragana4_; + m_defaultStrength_ = UCA_.m_defaultStrength_; + m_defaultVariableTopValue_ = UCA_.m_defaultVariableTopValue_; + m_expansionOffset_ = UCA_.m_expansionOffset_; + m_isAlternateHandlingShifted_ = UCA_.m_isAlternateHandlingShifted_; + m_isCaseLevel_ = UCA_.m_isCaseLevel_; + m_isFrenchCollation_ = UCA_.m_isFrenchCollation_; + m_isHiragana4_ = UCA_.m_isHiragana4_; + m_isJamoSpecial_ = UCA_.m_isJamoSpecial_; + m_isSimple3_ = UCA_.m_isSimple3_; + m_mask3_ = UCA_.m_mask3_; + m_minContractionEnd_ = UCA_.m_minContractionEnd_; + m_minUnsafe_ = UCA_.m_minUnsafe_; + m_rules_ = UCA_.m_rules_; + setStrength(UCA_.getStrength()); + m_top3_ = UCA_.m_top3_; + m_topCount3_ = UCA_.m_topCount3_; + m_variableTopValue_ = UCA_.m_variableTopValue_; + setWithUCATables(); } /** @@ -1381,67 +1332,67 @@ public final class RuleBasedCollator extends Collator * @param ch character to determin * @return true if ch is unsafe, false otherwise */ - final boolean isUnsafe(char ch) - { - if (ch < m_minUnsafe_) { - return false; - } - - if (ch >= (HEURISTIC_SIZE_ << HEURISTIC_SHIFT_)) { - if (UTF16.isTrailSurrogate(ch)) { - // Trail surrogate are always considered unsafe. - return true; - } - ch &= HEURISTIC_OVERFLOW_MASK_; - ch += HEURISTIC_OVERFLOW_OFFSET_; - } - int value = m_unsafe_[ch >> HEURISTIC_SHIFT_]; - return ((value >> (ch & HEURISTIC_MASK_)) & 1) != 0; - } - - /** - * Approximate determination if a char character is at a contraction end. - * Guaranteed to be true if a character is at the end of a contraction, - * otherwise it is not deterministic. - * @param ch character to be determined - */ - final boolean isContractionEnd(char ch) - { - if (UTF16.isTrailSurrogate(ch)) { - return true; - } + final boolean isUnsafe(char ch) + { + if (ch < m_minUnsafe_) { + return false; + } + + if (ch >= (HEURISTIC_SIZE_ << HEURISTIC_SHIFT_)) { + if (UTF16.isTrailSurrogate(ch)) { + // Trail surrogate are always considered unsafe. + return true; + } + ch &= HEURISTIC_OVERFLOW_MASK_; + ch += HEURISTIC_OVERFLOW_OFFSET_; + } + int value = m_unsafe_[ch >> HEURISTIC_SHIFT_]; + return ((value >> (ch & HEURISTIC_MASK_)) & 1) != 0; + } + + /** + * Approximate determination if a char character is at a contraction end. + * Guaranteed to be true if a character is at the end of a contraction, + * otherwise it is not deterministic. + * @param ch character to be determined + */ + final boolean isContractionEnd(char ch) + { + if (UTF16.isTrailSurrogate(ch)) { + return true; + } - if (ch < m_minContractionEnd_) { - return false; - } + if (ch < m_minContractionEnd_) { + return false; + } - if (ch >= (HEURISTIC_SIZE_ << HEURISTIC_SHIFT_)) { - ch &= HEURISTIC_OVERFLOW_MASK_; - ch += HEURISTIC_OVERFLOW_OFFSET_; - } - int value = m_contractionEnd_[ch >> HEURISTIC_SHIFT_]; - return ((value >> (ch & HEURISTIC_MASK_)) & 1) != 0; - } - - /** - * Retrieve the tag of a special ce - * @param ce ce to test - * @return tag of ce - */ - static int getTag(int ce) - { - return (ce & CE_TAG_MASK_) >> CE_TAG_SHIFT_; - } + if (ch >= (HEURISTIC_SIZE_ << HEURISTIC_SHIFT_)) { + ch &= HEURISTIC_OVERFLOW_MASK_; + ch += HEURISTIC_OVERFLOW_OFFSET_; + } + int value = m_contractionEnd_[ch >> HEURISTIC_SHIFT_]; + return ((value >> (ch & HEURISTIC_MASK_)) & 1) != 0; + } + + /** + * Retrieve the tag of a special ce + * @param ce ce to test + * @return tag of ce + */ + static int getTag(int ce) + { + return (ce & CE_TAG_MASK_) >> CE_TAG_SHIFT_; + } /** - * Checking if ce is special - * @param ce to check - * @return true if ce is special - */ - static boolean isSpecial(int ce) - { - return (ce & CE_SPECIAL_FLAG_) == CE_SPECIAL_FLAG_; - } + * Checking if ce is special + * @param ce to check + * @return true if ce is special + */ + static boolean isSpecial(int ce) + { + return (ce & CE_SPECIAL_FLAG_) == CE_SPECIAL_FLAG_; + } /** * Checks if the argument ce is a continuation @@ -1451,7 +1402,7 @@ public final class RuleBasedCollator extends Collator static final boolean isContinuation(int ce) { return ce != CollationElementIterator.NULLORDER - && (ce & CE_CONTINUATION_TAG_) == CE_CONTINUATION_TAG_; + && (ce & CE_CONTINUATION_TAG_) == CE_CONTINUATION_TAG_; } // protected constructor ------------------------------------------------- @@ -1465,31 +1416,31 @@ public final class RuleBasedCollator extends Collator */ RuleBasedCollator(Locale locale) throws Exception { - ResourceBundle rb = ICULocaleData.getLocaleElements(locale); - - if (rb != null) { - byte map[] = (byte [])rb.getObject("%%CollationBin"); - BufferedInputStream input = - new BufferedInputStream(new ByteArrayInputStream(map)); - CollatorReader reader = new CollatorReader(input, false); - if (map.length > MIN_BINARY_DATA_SIZE_) { - reader.read(this); - } - else { - reader.readHeader(this); - reader.readOptions(this); - // duplicating UCA_'s data - setWithUCATables(); - } - Object rules = rb.getObject("CollationElements"); - if (rules != null) { - m_rules_ = (String)((Object[][])rules)[0][1]; - } - init(); - } - else { - setWithUCAData(); - } + ResourceBundle rb = ICULocaleData.getLocaleElements(locale); + + if (rb != null) { + byte map[] = (byte [])rb.getObject("%%CollationBin"); + BufferedInputStream input = + new BufferedInputStream(new ByteArrayInputStream(map)); + CollatorReader reader = new CollatorReader(input, false); + if (map.length > MIN_BINARY_DATA_SIZE_) { + reader.read(this); + } + else { + reader.readHeader(this); + reader.readOptions(this); + // duplicating UCA_'s data + setWithUCATables(); + } + Object rules = rb.getObject("CollationElements"); + if (rules != null) { + m_rules_ = (String)((Object[][])rules)[0][1]; + } + init(); + } + else { + setWithUCAData(); + } } // private inner classes ------------------------------------------------ @@ -1507,27 +1458,27 @@ public final class RuleBasedCollator extends Collator * latin 1 char, and some power of two for hashing the rest of the chars. * Size in bytes. */ - private static final char HEURISTIC_SIZE_ = 1056; + private static final char HEURISTIC_SIZE_ = 1056; /** * Mask value down to "some power of two" - 1, * number of bits, not num of bytes. */ - private static final char HEURISTIC_OVERFLOW_MASK_ = 0x1fff; - /** - * Unsafe character shift - */ - private static final int HEURISTIC_SHIFT_ = 3; - /** - * Unsafe character addition for character too large, it has to be folded - * then incremented. - */ - private static final char HEURISTIC_OVERFLOW_OFFSET_ = 256; - /** + private static final char HEURISTIC_OVERFLOW_MASK_ = 0x1fff; + /** + * Unsafe character shift + */ + private static final int HEURISTIC_SHIFT_ = 3; + /** + * Unsafe character addition for character too large, it has to be folded + * then incremented. + */ + private static final char HEURISTIC_OVERFLOW_OFFSET_ = 256; + /** * Mask value to get offset in heuristic table. */ - private static final char HEURISTIC_MASK_ = 7; - - private byte m_caseSwitch_; + private static final char HEURISTIC_MASK_ = 7; + + private byte m_caseSwitch_; private int m_common3_; private byte m_mask3_; /** @@ -1543,34 +1494,34 @@ public final class RuleBasedCollator extends Collator */ private int m_bottom3_; private int m_topCount3_; - private int m_bottomCount3_; - /** - * Case first constants - */ - private static final int CASE_SWITCH_ = 0xC0; - private static final int NO_CASE_SWITCH_ = 0; - /** - * Case level constants - */ - private static final int CE_REMOVE_CASE_ = 0x3F; - private static final int CE_KEEP_CASE_ = 0xFF; - /** - * Case strength mask - */ - private static final int CE_CASE_MASK_3_ = 0xFF; - /** - * Sortkey size factor. Values can be changed. - */ - private static final double PROPORTION_2_ = 0.5; - private static final double PROPORTION_3_ = 0.667; + private int m_bottomCount3_; + /** + * Case first constants + */ + private static final int CASE_SWITCH_ = 0xC0; + private static final int NO_CASE_SWITCH_ = 0; + /** + * Case level constants + */ + private static final int CE_REMOVE_CASE_ = 0x3F; + private static final int CE_KEEP_CASE_ = 0xFF; + /** + * Case strength mask + */ + private static final int CE_CASE_MASK_3_ = 0xFF; + /** + * Sortkey size factor. Values can be changed. + */ + private static final double PROPORTION_2_ = 0.5; + private static final double PROPORTION_3_ = 0.667; - // These values come from the UCA ---------------------------------------- - - /** - * This is an enum that lists magic special byte values from the - * fractional UCA - */ - private static final byte BYTE_ZERO_ = 0x0; + // These values come from the UCA ---------------------------------------- + + /** + * This is an enum that lists magic special byte values from the + * fractional UCA + */ + private static final byte BYTE_ZERO_ = 0x0; private static final byte BYTE_LEVEL_SEPARATOR_ = (byte)0x01; private static final byte BYTE_SORTKEY_GLUE_ = (byte)0x02; private static final byte BYTE_SHIFT_PREFIX_ = (byte)0x03; @@ -1579,35 +1530,35 @@ public final class RuleBasedCollator extends Collator private static final byte BYTE_LAST_LATIN_PRIMARY_ = (byte)0x4C; private static final byte BYTE_FIRST_NON_LATIN_PRIMARY_ = (byte)0x4D; private static final byte BYTE_UNSHIFTED_MAX_ = (byte)0xFF; - private static final int TOTAL_2_ = COMMON_TOP_2_ - COMMON_BOTTOM_2_ - 1; - private static final int FLAG_BIT_MASK_CASE_SWITCH_OFF_ = 0x80; - private static final int FLAG_BIT_MASK_CASE_SWITCH_ON_ = 0x40; - private static final int COMMON_TOP_CASE_SWITCH_OFF_3_ = 0x85; - private static final int COMMON_TOP_CASE_SWITCH_LOWER_3_ = 0x45; - private static final int COMMON_TOP_CASE_SWITCH_UPPER_3_ = 0xC5; - private static final int COMMON_BOTTOM_3_ = 0x05; - private static final int COMMON_BOTTOM_CASE_SWITCH_UPPER_3_ = 0x86; - private static final int COMMON_BOTTOM_CASE_SWITCH_LOWER_3_ = - COMMON_BOTTOM_3_; - private static final int TOP_COUNT_2_ = (int)(PROPORTION_2_ * TOTAL_2_); - private static final int BOTTOM_COUNT_2_ = TOTAL_2_ - TOP_COUNT_2_; - private static final int COMMON_2_ = COMMON_BOTTOM_2_; - private static final int COMMON_UPPER_FIRST_3_ = 0xC5; - private static final int COMMON_NORMAL_3_ = COMMON_BOTTOM_3_; - private static final int COMMON_4_ = (byte)0xFF; - - /** - * Minimum size required for the binary collation data in bytes. - * Size of UCA header + size of options to 4 bytes - */ - private static final int MIN_BINARY_DATA_SIZE_ = (41 + 8) << 2; - - /** - * If this collator is to generate only simple tertiaries for fast path - */ - private boolean m_isSimple3_; - - /** + private static final int TOTAL_2_ = COMMON_TOP_2_ - COMMON_BOTTOM_2_ - 1; + private static final int FLAG_BIT_MASK_CASE_SWITCH_OFF_ = 0x80; + private static final int FLAG_BIT_MASK_CASE_SWITCH_ON_ = 0x40; + private static final int COMMON_TOP_CASE_SWITCH_OFF_3_ = 0x85; + private static final int COMMON_TOP_CASE_SWITCH_LOWER_3_ = 0x45; + private static final int COMMON_TOP_CASE_SWITCH_UPPER_3_ = 0xC5; + private static final int COMMON_BOTTOM_3_ = 0x05; + private static final int COMMON_BOTTOM_CASE_SWITCH_UPPER_3_ = 0x86; + private static final int COMMON_BOTTOM_CASE_SWITCH_LOWER_3_ = + COMMON_BOTTOM_3_; + private static final int TOP_COUNT_2_ = (int)(PROPORTION_2_ * TOTAL_2_); + private static final int BOTTOM_COUNT_2_ = TOTAL_2_ - TOP_COUNT_2_; + private static final int COMMON_2_ = COMMON_BOTTOM_2_; + private static final int COMMON_UPPER_FIRST_3_ = 0xC5; + private static final int COMMON_NORMAL_3_ = COMMON_BOTTOM_3_; + private static final int COMMON_4_ = (byte)0xFF; + + /** + * Minimum size required for the binary collation data in bytes. + * Size of UCA header + size of options to 4 bytes + */ + private static final int MIN_BINARY_DATA_SIZE_ = (41 + 8) << 2; + + /** + * If this collator is to generate only simple tertiaries for fast path + */ + private boolean m_isSimple3_; + + /** * French collation sorting flag */ private boolean m_isFrenchCollation_; @@ -1621,33 +1572,33 @@ public final class RuleBasedCollator extends Collator * Extra case level for sorting */ private boolean m_isCaseLevel_; - - private static final int SORT_BUFFER_INIT_SIZE_ = 128; - private static final int SORT_BUFFER_INIT_SIZE_1_ = - SORT_BUFFER_INIT_SIZE_ << 3; - private static final int SORT_BUFFER_INIT_SIZE_2_ = SORT_BUFFER_INIT_SIZE_; - private static final int SORT_BUFFER_INIT_SIZE_3_ = SORT_BUFFER_INIT_SIZE_; - private static final int SORT_BUFFER_INIT_SIZE_CASE_ = - SORT_BUFFER_INIT_SIZE_ >> 2; - private static final int SORT_BUFFER_INIT_SIZE_4_ = SORT_BUFFER_INIT_SIZE_; + + private static final int SORT_BUFFER_INIT_SIZE_ = 128; + private static final int SORT_BUFFER_INIT_SIZE_1_ = + SORT_BUFFER_INIT_SIZE_ << 3; + private static final int SORT_BUFFER_INIT_SIZE_2_ = SORT_BUFFER_INIT_SIZE_; + private static final int SORT_BUFFER_INIT_SIZE_3_ = SORT_BUFFER_INIT_SIZE_; + private static final int SORT_BUFFER_INIT_SIZE_CASE_ = + SORT_BUFFER_INIT_SIZE_ >> 2; + private static final int SORT_BUFFER_INIT_SIZE_4_ = SORT_BUFFER_INIT_SIZE_; private static final int CE_CONTINUATION_TAG_ = 0xC0; - private static final int CE_REMOVE_CONTINUATION_MASK_ = 0xFFFFFF3F; + private static final int CE_REMOVE_CONTINUATION_MASK_ = 0xFFFFFF3F; - private static final int LAST_BYTE_MASK_ = 0xFF; - - private static final int CE_RESET_TOP_VALUE_ = 0x9F000303; - private static final int CE_NEXT_TOP_VALUE_ = 0xE8960303; + private static final int LAST_BYTE_MASK_ = 0xFF; + + private static final int CE_RESET_TOP_VALUE_ = 0x9F000303; + private static final int CE_NEXT_TOP_VALUE_ = 0xE8960303; - private static final byte SORT_CASE_BYTE_START_ = (byte)0x80; - private static final byte SORT_CASE_SHIFT_START_ = (byte)7; - - private static final byte SORT_LEVEL_TERMINATOR_ = 1; - - /** - * CE buffer size - */ - private static final int CE_BUFFER_SIZE_ = 512; + private static final byte SORT_CASE_BYTE_START_ = (byte)0x80; + private static final byte SORT_CASE_SHIFT_START_ = (byte)7; + + private static final byte SORT_LEVEL_TERMINATOR_ = 1; + + /** + * CE buffer size + */ + private static final int CE_BUFFER_SIZE_ = 512; // private methods ------------------------------------------------------- @@ -1658,7 +1609,7 @@ public final class RuleBasedCollator extends Collator * @param bytescount array of the size of each strength byte arrays * @param count array of counters for each of the strength * @param notIsContinuation flag indicating if the current bytes belong to - * a continuation ce + * a continuation ce * @param doShift flag indicating if ce is to be shifted * @param leadPrimary lead primary used for compression * @param commonBottom4 common byte value for Quaternary @@ -1666,82 +1617,82 @@ public final class RuleBasedCollator extends Collator * @return the new lead primary for compression */ private final int doPrimaryBytes(int ce, byte bytes[][], int bytescount[], - int count[], boolean notIsContinuation, - boolean doShift, int leadPrimary, - int commonBottom4, int bottomCount4) + int count[], boolean notIsContinuation, + boolean doShift, int leadPrimary, + int commonBottom4, int bottomCount4) { - - int p2 = (ce >>= 16) & LAST_BYTE_MASK_; // in ints for unsigned + + int p2 = (ce >>= 16) & LAST_BYTE_MASK_; // in ints for unsigned int p1 = ce >>> 8; // comparison - if (doShift) { - if (count[4] > 0) { - while (count[4] > bottomCount4) { - append(bytes, bytescount, 4, - (byte)(commonBottom4 + bottomCount4)); - count[4] -= bottomCount4; - } - append(bytes, bytescount, 4, - (byte)(commonBottom4 + (count[4] - 1))); - count[4] = 0; - } - // dealing with a variable and we're treating them as shifted + if (doShift) { + if (count[4] > 0) { + while (count[4] > bottomCount4) { + append(bytes, bytescount, 4, + (byte)(commonBottom4 + bottomCount4)); + count[4] -= bottomCount4; + } + append(bytes, bytescount, 4, + (byte)(commonBottom4 + (count[4] - 1))); + count[4] = 0; + } + // dealing with a variable and we're treating them as shifted // This is a shifted ignorable if (p1 != 0) { - // we need to check this since we could be in continuation - append(bytes, bytescount, 4, (byte)p1); + // we need to check this since we could be in continuation + append(bytes, bytescount, 4, (byte)p1); } if (p2 != 0) { - append(bytes, bytescount, 4, (byte)p2); + append(bytes, bytescount, 4, (byte)p2); } } else { - // Note: This code assumes that the table is well built - // i.e. not having 0 bytes where they are not supposed to be. - // Usually, we'll have non-zero primary1 & primary2, except - // in cases of LatinOne and friends, when primary2 will be - // regular and simple sortkey calc - if (p1 != CollationElementIterator.IGNORABLE) { - if (notIsContinuation) { - if (leadPrimary == p1) { - append(bytes, bytescount, 1, (byte)p2); - } - else { - if (leadPrimary != 0) { - append(bytes, bytescount, 1, - (byte)((p1 > leadPrimary) - ? BYTE_UNSHIFTED_MAX_ - : BYTE_UNSHIFTED_MIN_)); - } - if (p2 == CollationElementIterator.IGNORABLE) { - // one byter, not compressed - append(bytes, bytescount, 1, (byte)p1); - leadPrimary = 0; - } - else if (p1 < BYTE_FIRST_NON_LATIN_PRIMARY_ - || (p1 > ((CE_RESET_TOP_VALUE_ >> 24) & 0xFF) - && p1 < ((CE_NEXT_TOP_VALUE_ >> 24) & 0xFF))) { - // not compressible - leadPrimary = 0; - append(bytes, bytescount, 1, (byte)p1); - append(bytes, bytescount, 1, (byte)p2); - } - else { // compress - leadPrimary = p1; - append(bytes, bytescount, 1, (byte)p1); - append(bytes, bytescount, 1, (byte)p2); - } - } + // Note: This code assumes that the table is well built + // i.e. not having 0 bytes where they are not supposed to be. + // Usually, we'll have non-zero primary1 & primary2, except + // in cases of LatinOne and friends, when primary2 will be + // regular and simple sortkey calc + if (p1 != CollationElementIterator.IGNORABLE) { + if (notIsContinuation) { + if (leadPrimary == p1) { + append(bytes, bytescount, 1, (byte)p2); + } + else { + if (leadPrimary != 0) { + append(bytes, bytescount, 1, + (byte)((p1 > leadPrimary) + ? BYTE_UNSHIFTED_MAX_ + : BYTE_UNSHIFTED_MIN_)); + } + if (p2 == CollationElementIterator.IGNORABLE) { + // one byter, not compressed + append(bytes, bytescount, 1, (byte)p1); + leadPrimary = 0; + } + else if (p1 < BYTE_FIRST_NON_LATIN_PRIMARY_ + || (p1 > ((CE_RESET_TOP_VALUE_ >> 24) & 0xFF) + && p1 < ((CE_NEXT_TOP_VALUE_ >> 24) & 0xFF))) { + // not compressible + leadPrimary = 0; + append(bytes, bytescount, 1, (byte)p1); + append(bytes, bytescount, 1, (byte)p2); + } + else { // compress + leadPrimary = p1; + append(bytes, bytescount, 1, (byte)p1); + append(bytes, bytescount, 1, (byte)p2); + } + } } else { - // continuation, add primary to the key, no compression - append(bytes, bytescount, 1, (byte)p1); - if (p2 != CollationElementIterator.IGNORABLE) { - append(bytes, bytescount, 1, (byte)p2); // second part - } + // continuation, add primary to the key, no compression + append(bytes, bytescount, 1, (byte)p1); + if (p2 != CollationElementIterator.IGNORABLE) { + append(bytes, bytescount, 1, (byte)p2); // second part + } } } - } - return leadPrimary; + } + return leadPrimary; } /** @@ -1751,70 +1702,70 @@ public final class RuleBasedCollator extends Collator * @param bytescount array of the size of each strength byte arrays * @param count array of counters for each of the strength * @param notIsContinuation flag indicating if the current bytes belong to - * a continuation ce + * a continuation ce * @param doFrench flag indicator if french sort is to be performed * @param frenchOffset start and end offsets to source string for reversing */ private final void doSecondaryBytes(int ce, byte bytes[][], - int bytescount[], int count[], - boolean notIsContinuation, - boolean doFrench, int frenchOffset[]) + int bytescount[], int count[], + boolean notIsContinuation, + boolean doFrench, int frenchOffset[]) { - int s = (ce >>= 8) & LAST_BYTE_MASK_; // int for comparison - if (s != 0) { - if (!doFrench) { + int s = (ce >>= 8) & LAST_BYTE_MASK_; // int for comparison + if (s != 0) { + if (!doFrench) { // This is compression code. if (s == COMMON_2_ && notIsContinuation) { - count[2] ++; + count[2] ++; } else { - if (count[2] > 0) { - if (s > COMMON_2_) { // not necessary for 4th level. - while (count[2] > TOP_COUNT_2_) { - append(bytes, bytescount, 2, - (byte)(COMMON_TOP_2_ - TOP_COUNT_2_)); - count[2] -= TOP_COUNT_2_; - } - append(bytes, bytescount, 2, - (byte)(COMMON_TOP_2_ - (count[2] - 1))); - } - else { - while (count[2] > BOTTOM_COUNT_2_) { - append(bytes, bytescount, 2, - (byte)(COMMON_BOTTOM_2_ + BOTTOM_COUNT_2_)); - count[2] -= BOTTOM_COUNT_2_; - } - append(bytes, bytescount, 2, - (byte)(COMMON_BOTTOM_2_ + (count[2] - 1))); - } - count[2] = 0; - } - append(bytes, bytescount, 2, (byte)s); + if (count[2] > 0) { + if (s > COMMON_2_) { // not necessary for 4th level. + while (count[2] > TOP_COUNT_2_) { + append(bytes, bytescount, 2, + (byte)(COMMON_TOP_2_ - TOP_COUNT_2_)); + count[2] -= TOP_COUNT_2_; + } + append(bytes, bytescount, 2, + (byte)(COMMON_TOP_2_ - (count[2] - 1))); + } + else { + while (count[2] > BOTTOM_COUNT_2_) { + append(bytes, bytescount, 2, + (byte)(COMMON_BOTTOM_2_ + BOTTOM_COUNT_2_)); + count[2] -= BOTTOM_COUNT_2_; + } + append(bytes, bytescount, 2, + (byte)(COMMON_BOTTOM_2_ + (count[2] - 1))); + } + count[2] = 0; + } + append(bytes, bytescount, 2, (byte)s); } } else { - append(bytes, bytescount, 2, (byte)s); - // Do the special handling for French secondaries - // We need to get continuation elements and do intermediate - // restore - // abc1c2c3de with french secondaries need to be edc1c2c3ba - // NOT edc3c2c1ba - if (notIsContinuation) { - if (frenchOffset[0] != -1) { - // reverse secondaries from frenchStartPtr up to - // frenchEndPtr - reverseBuffer(bytes[2], frenchOffset); - frenchOffset[0] = -1; - } - } - else { - if (frenchOffset[0] == -1) { - frenchOffset[0] = bytescount[2] - 2; - } - frenchOffset[1] = bytescount[2] - 1; - } - } - } + append(bytes, bytescount, 2, (byte)s); + // Do the special handling for French secondaries + // We need to get continuation elements and do intermediate + // restore + // abc1c2c3de with french secondaries need to be edc1c2c3ba + // NOT edc3c2c1ba + if (notIsContinuation) { + if (frenchOffset[0] != -1) { + // reverse secondaries from frenchStartPtr up to + // frenchEndPtr + reverseBuffer(bytes[2], frenchOffset); + frenchOffset[0] = -1; + } + } + else { + if (frenchOffset[0] == -1) { + frenchOffset[0] = bytescount[2] - 2; + } + frenchOffset[1] = bytescount[2] - 1; + } + } + } } /** @@ -1824,59 +1775,59 @@ public final class RuleBasedCollator extends Collator */ private void reverseBuffer(byte buffer[], int offset[]) { - int start = offset[0]; - int end = offset[1]; - while (start < end) { - byte b = buffer[start]; - buffer[start ++] = buffer[end]; - buffer[end --] = b; - } - } + int start = offset[0]; + int end = offset[1]; + while (start < end) { + byte b = buffer[start]; + buffer[start ++] = buffer[end]; + buffer[end --] = b; + } + } - /** - * Insert the case shifting byte if required - * @param bytes array of byte arrays corresponding to each strength - * @param bytescount array of the size of the byte arrays - * @param caseshift value - * @return new caseshift value - */ - private static final int doCaseShift(byte bytes[][], int bytescount[], - int caseshift) - { - if (caseshift == 0) { - append(bytes, bytescount, 0, SORT_CASE_BYTE_START_); - caseshift = SORT_CASE_SHIFT_START_; - } - return caseshift; - } + /** + * Insert the case shifting byte if required + * @param bytes array of byte arrays corresponding to each strength + * @param bytescount array of the size of the byte arrays + * @param caseshift value + * @return new caseshift value + */ + private static final int doCaseShift(byte bytes[][], int bytescount[], + int caseshift) + { + if (caseshift == 0) { + append(bytes, bytescount, 0, SORT_CASE_BYTE_START_); + caseshift = SORT_CASE_SHIFT_START_; + } + return caseshift; + } - /** - * Performs the casing sort - * @param tertiary byte in ints for easy comparison - * @param bytes of byte arrays for each strength + /** + * Performs the casing sort + * @param tertiary byte in ints for easy comparison + * @param bytes of byte arrays for each strength * @param bytescount array of the size of each strength byte arrays * @param notIsContinuation flag indicating if the current bytes belong to - * a continuation ce - * @param caseshift - * @return the new value of case shift - */ - private final int doCaseBytes(int tertiary, byte bytes[][], - int bytescount[], boolean notIsContinuation, - int caseshift) - { - caseshift = doCaseShift(bytes, bytescount, caseshift); - + * a continuation ce + * @param caseshift + * @return the new value of case shift + */ + private final int doCaseBytes(int tertiary, byte bytes[][], + int bytescount[], boolean notIsContinuation, + int caseshift) + { + caseshift = doCaseShift(bytes, bytescount, caseshift); + if (notIsContinuation && tertiary != 0) { - byte casebits = (byte)(tertiary & 0xC0); + byte casebits = (byte)(tertiary & 0xC0); if (m_caseFirst_ == AttributeValue.UPPER_FIRST_) { if (casebits == 0) { bytes[0][bytescount[0] - 1] |= (1 << (-- caseshift)); } else { - // second bit - caseshift = doCaseShift(bytes, bytescount, caseshift); - bytes[0][bytescount[0] - 1] |= ((casebits >> 6) & 1) - << (-- caseshift); + // second bit + caseshift = doCaseShift(bytes, bytescount, caseshift); + bytes[0][bytescount[0] - 1] |= ((casebits >> 6) & 1) + << (-- caseshift); } } else { @@ -1885,93 +1836,93 @@ public final class RuleBasedCollator extends Collator // second bit caseshift = doCaseShift(bytes, bytescount, caseshift); bytes[0][bytescount[0] - 1] |= ((casebits >> 7) & 1) - << (-- caseshift); + << (-- caseshift); } } } - return caseshift; - } - - /** - * Gets the tertiary byte and adds it to the tertiary byte array + return caseshift; + } + + /** + * Gets the tertiary byte and adds it to the tertiary byte array * @param tertiary byte in int for easy comparison * @param bytes array of byte arrays for each strength * @param bytescount array of the size of each strength byte arrays * @param count array of counters for each of the strength * @param notIsContinuation flag indicating if the current bytes belong to - * a continuation ce - */ - private final void doTertiaryBytes(int tertiary, byte bytes[][], - int bytescount[], int count[], - boolean notIsContinuation) - { - if (tertiary != 0) { - // This is compression code. + * a continuation ce + */ + private final void doTertiaryBytes(int tertiary, byte bytes[][], + int bytescount[], int count[], + boolean notIsContinuation) + { + if (tertiary != 0) { + // This is compression code. // sequence size check is included in the if clause if (tertiary == m_common3_ && notIsContinuation) { - count[3] ++; + count[3] ++; } else { - int common3 = m_common3_ & LAST_BYTE_MASK_; + int common3 = m_common3_ & LAST_BYTE_MASK_; if ((tertiary > common3 - && m_common3_ == COMMON_NORMAL_3_) + && m_common3_ == COMMON_NORMAL_3_) || (tertiary <= common3 - && m_common3_ == COMMON_UPPER_FIRST_3_)) { + && m_common3_ == COMMON_UPPER_FIRST_3_)) { tertiary += m_addition3_; } if (count[3] > 0) { - if (tertiary > common3) { - while (count[3] > m_topCount3_) { - append(bytes, bytescount, 3, - (byte)(m_top3_ - m_topCount3_)); - count[3] -= m_topCount3_; - } - append(bytes, bytescount, 3, - (byte)(m_top3_ - (count[3] - 1))); - } - else { - while (count[3] > m_bottomCount3_) { - append(bytes, bytescount, 3, - (byte)(m_bottom3_ + m_bottomCount3_)); - count[3] -= m_bottomCount3_; - } - append(bytes, bytescount, 3, - (byte)(m_bottom3_ + (count[3] - 1))); + if (tertiary > common3) { + while (count[3] > m_topCount3_) { + append(bytes, bytescount, 3, + (byte)(m_top3_ - m_topCount3_)); + count[3] -= m_topCount3_; + } + append(bytes, bytescount, 3, + (byte)(m_top3_ - (count[3] - 1))); + } + else { + while (count[3] > m_bottomCount3_) { + append(bytes, bytescount, 3, + (byte)(m_bottom3_ + m_bottomCount3_)); + count[3] -= m_bottomCount3_; + } + append(bytes, bytescount, 3, + (byte)(m_bottom3_ + (count[3] - 1))); } count[3] = 0; } append(bytes, bytescount, 3, (byte)tertiary); } } - } - - /** - * Gets the Quaternary byte and adds it to the Quaternary byte array + } + + /** + * Gets the Quaternary byte and adds it to the Quaternary byte array * @param bytes array of byte arrays for each strength * @param bytescount array of the size of each strength byte arrays * @param count array of counters for each of the strength * @param isCodePointHiragana flag indicator if the previous codepoint - * we dealt with was Hiragana + * we dealt with was Hiragana * @param commonBottom4 smallest common Quaternary byte * @param bottomCount4 smallest Quaternary byte * @param hiragana4 hiragana Quaternary byte - */ - private final void doQuaternaryBytes(byte bytes[][], int bytescount[], - int count[], - boolean isCodePointHiragana, - int commonBottom4, int bottomCount4, - byte hiragana4) - { - if (isCodePointHiragana) { // This was Hiragana, need to note it - if (count[4] > 0) { // Close this part - while (count[4] > bottomCount4) { + */ + private final void doQuaternaryBytes(byte bytes[][], int bytescount[], + int count[], + boolean isCodePointHiragana, + int commonBottom4, int bottomCount4, + byte hiragana4) + { + if (isCodePointHiragana) { // This was Hiragana, need to note it + if (count[4] > 0) { // Close this part + while (count[4] > bottomCount4) { append(bytes, bytescount, 4, (byte)(commonBottom4 - + bottomCount4)); + + bottomCount4)); count[4] -= bottomCount4; } append(bytes, bytescount, 4, (byte)(commonBottom4 - + (count[4] - 1))); + + (count[4] - 1))); count[4] = 0; } append(bytes, bytescount, 4, hiragana4); // Add the Hiragana @@ -1979,1211 +1930,1211 @@ public final class RuleBasedCollator extends Collator else { // This wasn't Hiragana, so we can continue adding stuff count[4] ++; } - } - - /** - * Iterates through the argument string for all ces. - * Split the ces into their relevant primaries, secondaries etc. - * @param source normalized string - * @param compare array of flags indicating if a particular strength is - * to be processed - * @param bytes an array of byte arrays corresponding to the strengths - * @param bytescount an array of the size of the byte arrays - * @param count array of compression counters for each strength - * @param doFrench flag indicator if special handling of French has to be - * done - * @param hiragana4 offset for Hiragana quaternary - * @param commonBottom4 smallest common quaternary byte - * @param bottomCount4 smallest quaternary byte - */ - private final void getSortKeyBytes(String source, boolean compare[], - byte bytes[][], int bytescount[], - int count[], boolean doFrench, - byte hiragana4, int commonBottom4, - int bottomCount4) - - { - int backupDecomposition = getDecomposition(); - setDecomposition(NO_DECOMPOSITION); // have to revert to backup later - CollationElementIterator coleiter = - new CollationElementIterator(source, this); - - int frenchOffset[] = {-1, -1}; - - // scriptorder not implemented yet - // const uint8_t *scriptOrder = coll->scriptOrder; + } + + /** + * Iterates through the argument string for all ces. + * Split the ces into their relevant primaries, secondaries etc. + * @param source normalized string + * @param compare array of flags indicating if a particular strength is + * to be processed + * @param bytes an array of byte arrays corresponding to the strengths + * @param bytescount an array of the size of the byte arrays + * @param count array of compression counters for each strength + * @param doFrench flag indicator if special handling of French has to be + * done + * @param hiragana4 offset for Hiragana quaternary + * @param commonBottom4 smallest common quaternary byte + * @param bottomCount4 smallest quaternary byte + */ + private final void getSortKeyBytes(String source, boolean compare[], + byte bytes[][], int bytescount[], + int count[], boolean doFrench, + byte hiragana4, int commonBottom4, + int bottomCount4) + + { + int backupDecomposition = getDecomposition(); + setDecomposition(NO_DECOMPOSITION); // have to revert to backup later + CollationElementIterator coleiter = + new CollationElementIterator(source, this); + + int frenchOffset[] = {-1, -1}; + + // scriptorder not implemented yet + // const uint8_t *scriptOrder = coll->scriptOrder; - boolean doShift = false; - boolean notIsContinuation = false; + boolean doShift = false; + boolean notIsContinuation = false; - int leadPrimary = 0; // int for easier comparison - int caseShift = 0; - - while (true) { - int ce = coleiter.next(); + int leadPrimary = 0; // int for easier comparison + int caseShift = 0; + + while (true) { + int ce = coleiter.next(); if (ce == CollationElementIterator.NULLORDER) { - break; + break; } if (ce == CollationElementIterator.IGNORABLE) { - continue; + continue; } notIsContinuation = !isContinuation(ce); /* * if (notIsContinuation) { - if (scriptOrder != NULL) { - primary1 = scriptOrder[primary1]; - } - }*/ + if (scriptOrder != NULL) { + primary1 = scriptOrder[primary1]; + } + }*/ doShift = (m_isAlternateHandlingShifted_ - && ((notIsContinuation && ce <= m_variableTopValue_ - && (ce >> 24) != 0)) // primary byte not 0 - || (!notIsContinuation && doShift)); - leadPrimary = doPrimaryBytes(ce, bytes, bytescount, count, - notIsContinuation, doShift, leadPrimary, - commonBottom4, bottomCount4); - if (compare[2]) { - doSecondaryBytes(ce, bytes, bytescount, count, - notIsContinuation, doFrench, frenchOffset); - } - - int t = ce & LAST_BYTE_MASK_; - if (!notIsContinuation) { - t = ce & CE_REMOVE_CONTINUATION_MASK_; + && ((notIsContinuation && ce <= m_variableTopValue_ + && (ce >> 24) != 0)) // primary byte not 0 + || (!notIsContinuation && doShift)); + leadPrimary = doPrimaryBytes(ce, bytes, bytescount, count, + notIsContinuation, doShift, leadPrimary, + commonBottom4, bottomCount4); + if (compare[2]) { + doSecondaryBytes(ce, bytes, bytescount, count, + notIsContinuation, doFrench, frenchOffset); } - + + int t = ce & LAST_BYTE_MASK_; + if (!notIsContinuation) { + t = ce & CE_REMOVE_CONTINUATION_MASK_; + } + if (compare[0]) { - caseShift = doCaseBytes(t, bytes, bytescount, - notIsContinuation, caseShift); + caseShift = doCaseBytes(t, bytes, bytescount, + notIsContinuation, caseShift); } else if (notIsContinuation) { - t ^= m_caseSwitch_; + t ^= m_caseSwitch_; } t &= m_mask3_; - + if (compare[3]) { - doTertiaryBytes(t, bytes, bytescount, count, - notIsContinuation); + doTertiaryBytes(t, bytes, bytescount, count, + notIsContinuation); } if (compare[4] && notIsContinuation) { // compare quad doQuaternaryBytes(bytes, bytescount, count, - coleiter.m_isCodePointHiragana_, - commonBottom4, bottomCount4, hiragana4); + coleiter.m_isCodePointHiragana_, + commonBottom4, bottomCount4, hiragana4); } } - setDecomposition(backupDecomposition); // reverts to original + setDecomposition(backupDecomposition); // reverts to original if (frenchOffset[0] != -1) { - // one last round of checks - reverseBuffer(bytes[2], frenchOffset); - } - } - - /** - * From the individual strength byte results the final compact sortkey - * will be calculated. - * @param source text string - * @param compare array of flags indicating if a particular strength is - * to be processed - * @param bytes an array of byte arrays corresponding to the strengths - * @param bytescount an array of the size of the byte arrays - * @param count array of compression counters for each strength - * @param doFrench flag indicating that special handling of French has to - * be done - * @param commonBottom4 smallest common quaternary byte - * @param bottomCount4 smallest quaternary byte - * @return the compact sortkey - */ - private final byte[] getSortKey(String source, boolean compare[], - byte bytes[][], int bytescount[], - int count[], boolean doFrench, - int commonBottom4, int bottomCount4) - { - // we have done all the CE's, now let's put them together to form - // a key - if (compare[2]) { - doSecondary(bytes, bytescount, count, doFrench); - if (compare[0]) { - doCase(bytes, bytescount); - } - if (compare[3]) { - doTertiary(bytes, bytescount, count); - if (compare[4]) { - doQuaternary(bytes, bytescount, count, commonBottom4, - bottomCount4); - if (compare[5]) { - doIdentical(source, bytes, bytescount); - } + // one last round of checks + reverseBuffer(bytes[2], frenchOffset); + } + } + + /** + * From the individual strength byte results the final compact sortkey + * will be calculated. + * @param source text string + * @param compare array of flags indicating if a particular strength is + * to be processed + * @param bytes an array of byte arrays corresponding to the strengths + * @param bytescount an array of the size of the byte arrays + * @param count array of compression counters for each strength + * @param doFrench flag indicating that special handling of French has to + * be done + * @param commonBottom4 smallest common quaternary byte + * @param bottomCount4 smallest quaternary byte + * @return the compact sortkey + */ + private final byte[] getSortKey(String source, boolean compare[], + byte bytes[][], int bytescount[], + int count[], boolean doFrench, + int commonBottom4, int bottomCount4) + { + // we have done all the CE's, now let's put them together to form + // a key + if (compare[2]) { + doSecondary(bytes, bytescount, count, doFrench); + if (compare[0]) { + doCase(bytes, bytescount); + } + if (compare[3]) { + doTertiary(bytes, bytescount, count); + if (compare[4]) { + doQuaternary(bytes, bytescount, count, commonBottom4, + bottomCount4); + if (compare[5]) { + doIdentical(source, bytes, bytescount); + } - } - } - } - append(bytes, bytescount, 1, (byte)0); - return bytes[1]; - } - - /** - * Packs the French bytes - * @param bytes array of byte arrays corresponding to strenghts - * @param bytescount array of the size of byte arrays - * @param count array of compression counts - */ - private final void doFrench(byte bytes[][], int bytescount[], int count[]) - { - for (int i = 0; i < bytescount[2]; i ++) { - byte s = bytes[2][bytescount[2] - i - 1]; - // This is compression code. - if (s == COMMON_2_) { - ++ count[2]; - } - else { - if (count[2] > 0) { - // getting the unsigned value - if ((s & LAST_BYTE_MASK_) > COMMON_2_) { - // not necessary for 4th level. - while (count[2] > TOP_COUNT_2_) { - append(bytes, bytescount, 1, - (byte)(COMMON_TOP_2_ - TOP_COUNT_2_)); - count[2] -= TOP_COUNT_2_; - } - append(bytes, bytescount, 1, (byte)(COMMON_TOP_2_ - - (count[2] - 1))); - } - else { - while (count[2] > BOTTOM_COUNT_2_) { - append(bytes, bytescount, 1, - (byte)(COMMON_BOTTOM_2_ + BOTTOM_COUNT_2_)); - count[2] -= BOTTOM_COUNT_2_; - } - append(bytes, bytescount, 1, (byte)(COMMON_BOTTOM_2_ - + (count[2] - 1))); - } - count[2] = 0; - } - append(bytes, bytescount, 1, s); - } - } - if (count[2] > 0) { - while (count[2] > BOTTOM_COUNT_2_) { - append(bytes, bytescount, 1, (byte)(COMMON_BOTTOM_2_ - + BOTTOM_COUNT_2_)); - count[2] -= BOTTOM_COUNT_2_; - } - append(bytes, bytescount, 1, (byte)(COMMON_BOTTOM_2_ - + (count[2] - 1))); - } - } + } + } + } + append(bytes, bytescount, 1, (byte)0); + return bytes[1]; + } + + /** + * Packs the French bytes + * @param bytes array of byte arrays corresponding to strenghts + * @param bytescount array of the size of byte arrays + * @param count array of compression counts + */ + private final void doFrench(byte bytes[][], int bytescount[], int count[]) + { + for (int i = 0; i < bytescount[2]; i ++) { + byte s = bytes[2][bytescount[2] - i - 1]; + // This is compression code. + if (s == COMMON_2_) { + ++ count[2]; + } + else { + if (count[2] > 0) { + // getting the unsigned value + if ((s & LAST_BYTE_MASK_) > COMMON_2_) { + // not necessary for 4th level. + while (count[2] > TOP_COUNT_2_) { + append(bytes, bytescount, 1, + (byte)(COMMON_TOP_2_ - TOP_COUNT_2_)); + count[2] -= TOP_COUNT_2_; + } + append(bytes, bytescount, 1, (byte)(COMMON_TOP_2_ + - (count[2] - 1))); + } + else { + while (count[2] > BOTTOM_COUNT_2_) { + append(bytes, bytescount, 1, + (byte)(COMMON_BOTTOM_2_ + BOTTOM_COUNT_2_)); + count[2] -= BOTTOM_COUNT_2_; + } + append(bytes, bytescount, 1, (byte)(COMMON_BOTTOM_2_ + + (count[2] - 1))); + } + count[2] = 0; + } + append(bytes, bytescount, 1, s); + } + } + if (count[2] > 0) { + while (count[2] > BOTTOM_COUNT_2_) { + append(bytes, bytescount, 1, (byte)(COMMON_BOTTOM_2_ + + BOTTOM_COUNT_2_)); + count[2] -= BOTTOM_COUNT_2_; + } + append(bytes, bytescount, 1, (byte)(COMMON_BOTTOM_2_ + + (count[2] - 1))); + } + } - /** - * Compacts the secondary bytes and stores them into the primary array - * @param bytes array of byte arrays corresponding to the strengths - * @param bytecount array of the size of the byte arrays - * @param count array of the number of compression counts - * @param doFrench flag indicator that French has to be handled specially - */ - private final void doSecondary(byte bytes[][], int bytescount[], - int count[], boolean doFrench) - { - if (count[2] > 0) { - while (count[2] > BOTTOM_COUNT_2_) { - append(bytes, bytescount, 2, (byte)(COMMON_BOTTOM_2_ - + BOTTOM_COUNT_2_)); - count[2] -= BOTTOM_COUNT_2_; - } - append(bytes, bytescount, 2, (byte)(COMMON_BOTTOM_2_ + - (count[2] - 1))); + /** + * Compacts the secondary bytes and stores them into the primary array + * @param bytes array of byte arrays corresponding to the strengths + * @param bytecount array of the size of the byte arrays + * @param count array of the number of compression counts + * @param doFrench flag indicator that French has to be handled specially + */ + private final void doSecondary(byte bytes[][], int bytescount[], + int count[], boolean doFrench) + { + if (count[2] > 0) { + while (count[2] > BOTTOM_COUNT_2_) { + append(bytes, bytescount, 2, (byte)(COMMON_BOTTOM_2_ + + BOTTOM_COUNT_2_)); + count[2] -= BOTTOM_COUNT_2_; + } + append(bytes, bytescount, 2, (byte)(COMMON_BOTTOM_2_ + + (count[2] - 1))); } append(bytes, bytescount, 1, SORT_LEVEL_TERMINATOR_); if (doFrench) { // do the reverse copy - doFrench(bytes, bytescount, count); + doFrench(bytes, bytescount, count); } else { - if (bytes[1].length <= bytescount[1] + bytescount[2]) { - bytes[1] = increase(bytes[1], bytescount[1], bytescount[2]); - } - System.arraycopy(bytes[2], 0, bytes[1], bytescount[1], - bytescount[2]); + if (bytes[1].length <= bytescount[1] + bytescount[2]) { + bytes[1] = increase(bytes[1], bytescount[1], bytescount[2]); + } + System.arraycopy(bytes[2], 0, bytes[1], bytescount[1], + bytescount[2]); bytescount[1] += bytescount[2]; } - } - - /** - * Increase buffer size - * @param array array of bytes - * @param size of the byte array - * @param incrementsize size to increase - * @return the new buffer - */ - private static final byte[] increase(byte buffer[], int size, - int incrementsize) - { - byte result[] = new byte[buffer.length + incrementsize]; - System.arraycopy(buffer, 0, result, 0, size); - return result; - } - - /** - * Increase buffer size - * @param array array of bytes - * @param size of the byte array - * @param incrementsize size to increase - * @return the new buffer - */ - private static final int[] increase(int buffer[], int size, - int incrementsize) - { - int result[] = new int[buffer.length + incrementsize]; - System.arraycopy(buffer, 0, result, 0, size); - return result; - } - - /** - * Compacts the case bytes and stores them into the primary array - * @param bytes array of byte arrays corresponding to the strengths - * @param bytecount array of the size of the byte arrays - */ - private final void doCase(byte bytes[][], int bytescount[]) - { - append(bytes, bytescount, 1, SORT_LEVEL_TERMINATOR_); - if (bytes[1].length <= bytescount[1] + bytescount[0]) { - bytes[1] = increase(bytes[1], bytescount[1], bytescount[0]); - } - if (bytes[1].length <= bytescount[1] + bytescount[0]) { - bytes[1] = increase(bytes[1], bytescount[1], bytescount[0]); + } + + /** + * Increase buffer size + * @param array array of bytes + * @param size of the byte array + * @param incrementsize size to increase + * @return the new buffer + */ + private static final byte[] increase(byte buffer[], int size, + int incrementsize) + { + byte result[] = new byte[buffer.length + incrementsize]; + System.arraycopy(buffer, 0, result, 0, size); + return result; + } + + /** + * Increase buffer size + * @param array array of bytes + * @param size of the byte array + * @param incrementsize size to increase + * @return the new buffer + */ + private static final int[] increase(int buffer[], int size, + int incrementsize) + { + int result[] = new int[buffer.length + incrementsize]; + System.arraycopy(buffer, 0, result, 0, size); + return result; + } + + /** + * Compacts the case bytes and stores them into the primary array + * @param bytes array of byte arrays corresponding to the strengths + * @param bytecount array of the size of the byte arrays + */ + private final void doCase(byte bytes[][], int bytescount[]) + { + append(bytes, bytescount, 1, SORT_LEVEL_TERMINATOR_); + if (bytes[1].length <= bytescount[1] + bytescount[0]) { + bytes[1] = increase(bytes[1], bytescount[1], bytescount[0]); } - System.arraycopy(bytes[0], 0, bytes[1], bytescount[1], bytescount[0]); + if (bytes[1].length <= bytescount[1] + bytescount[0]) { + bytes[1] = increase(bytes[1], bytescount[1], bytescount[0]); + } + System.arraycopy(bytes[0], 0, bytes[1], bytescount[1], bytescount[0]); bytescount[1] += bytescount[0]; - } - - /** - * Compacts the tertiary bytes and stores them into the primary array - * @param bytes array of byte arrays corresponding to the strengths - * @param bytecount array of the size of the byte arrays - * @param count array of the number of compression counts - */ - private final void doTertiary(byte bytes[][], int bytescount[], - int count[]) - { - if (count[3] > 0) { - if (m_common3_ != COMMON_BOTTOM_3_) { - while (count[3] >= m_topCount3_) { - append(bytes, bytescount, 3, (byte)(m_top3_ - - m_topCount3_)); - count[3] -= m_topCount3_; - } - append(bytes, bytescount, 3, (byte)(m_top3_ - count[3])); - } - else { - while (count[3] > m_bottomCount3_) { - append(bytes, bytescount, 3, (byte)(m_bottom3_ - + m_bottomCount3_)); - count[3] -= m_bottomCount3_; - } - append(bytes, bytescount, 3, (byte)(m_bottom3_ - + (count[3] - 1))); - } + } + + /** + * Compacts the tertiary bytes and stores them into the primary array + * @param bytes array of byte arrays corresponding to the strengths + * @param bytecount array of the size of the byte arrays + * @param count array of the number of compression counts + */ + private final void doTertiary(byte bytes[][], int bytescount[], + int count[]) + { + if (count[3] > 0) { + if (m_common3_ != COMMON_BOTTOM_3_) { + while (count[3] >= m_topCount3_) { + append(bytes, bytescount, 3, (byte)(m_top3_ + - m_topCount3_)); + count[3] -= m_topCount3_; + } + append(bytes, bytescount, 3, (byte)(m_top3_ - count[3])); + } + else { + while (count[3] > m_bottomCount3_) { + append(bytes, bytescount, 3, (byte)(m_bottom3_ + + m_bottomCount3_)); + count[3] -= m_bottomCount3_; + } + append(bytes, bytescount, 3, (byte)(m_bottom3_ + + (count[3] - 1))); + } } append(bytes, bytescount, 1, SORT_LEVEL_TERMINATOR_); if (bytes[1].length <= bytescount[1] + bytescount[3]) { - bytes[1] = increase(bytes[1], bytescount[1], bytescount[3]); + bytes[1] = increase(bytes[1], bytescount[1], bytescount[3]); } System.arraycopy(bytes[3], 0, bytes[1], bytescount[1], bytescount[3]); bytescount[1] += bytescount[3]; - } - - /** - * Compacts the quaternary bytes and stores them into the primary array - * @param bytes array of byte arrays corresponding to the strengths - * @param bytecount array of the size of the byte arrays - * @param count array of compression counts - */ - private final void doQuaternary(byte bytes[][], int bytescount[], - int count[], int commonbottom4, - int bottomcount4) - { - if (count[4] > 0) { + } + + /** + * Compacts the quaternary bytes and stores them into the primary array + * @param bytes array of byte arrays corresponding to the strengths + * @param bytecount array of the size of the byte arrays + * @param count array of compression counts + */ + private final void doQuaternary(byte bytes[][], int bytescount[], + int count[], int commonbottom4, + int bottomcount4) + { + if (count[4] > 0) { while (count[4] > bottomcount4) { append(bytes, bytescount, 4, (byte)(commonbottom4 - + bottomcount4)); + + bottomcount4)); count[4] -= bottomcount4; } append(bytes, bytescount, 4, (byte)(commonbottom4 - + (count[4] - 1))); + + (count[4] - 1))); } append(bytes, bytescount, 1, SORT_LEVEL_TERMINATOR_); if (bytes[1].length <= bytescount[1] + bytescount[4]) { - bytes[1] = increase(bytes[1], bytescount[1], bytescount[4]); + bytes[1] = increase(bytes[1], bytescount[1], bytescount[4]); } System.arraycopy(bytes[4], 0, bytes[1], bytescount[1], bytescount[4]); bytescount[1] += bytescount[4]; - } - - /** - * Deals with the identical sort. - * Appends the BOCSU version of the source string to the ends of the - * byte buffer. - * @param source text string - * @param bytes array of a byte array corresponding to the strengths - * @param bytescount array of the byte array size - */ - private final void doIdentical(String source, byte bytes[][], - int bytescount[]) - { - int isize = BOSCU.lengthOfIdenticalLevelRun(source); - append(bytes, bytescount, 1, SORT_LEVEL_TERMINATOR_); - if (bytes[1].length <= bytescount[1] + isize) { - bytes[1] = increase(bytes[1], bytescount[1], 1 + isize); + } + + /** + * Deals with the identical sort. + * Appends the BOCSU version of the source string to the ends of the + * byte buffer. + * @param source text string + * @param bytes array of a byte array corresponding to the strengths + * @param bytescount array of the byte array size + */ + private final void doIdentical(String source, byte bytes[][], + int bytescount[]) + { + int isize = BOSCU.lengthOfIdenticalLevelRun(source); + append(bytes, bytescount, 1, SORT_LEVEL_TERMINATOR_); + if (bytes[1].length <= bytescount[1] + isize) { + bytes[1] = increase(bytes[1], bytescount[1], 1 + isize); } bytescount[1] = BOSCU.writeIdenticalLevelRun(source, bytes[1], - bytescount[1]); - } - - /** - * Gets the offset of the first unmatched characters in source and target. - * This method returns the offset of the start of a contraction or a - * combining sequence, if the first difference is in the middle of such a - * sequence. - * @param source string - * @param target string - * @return offset of the first unmatched characters in source and target. - */ - private final int getFirstUnmatchedOffset(String source, String target) - { - int result = 0; - int slength = source.length(); - int tlength = target.length(); - int minlength = slength; - if (minlength > tlength) { - minlength = tlength; - } - while (result < minlength - && source.charAt(result) == target.charAt(result)) { - result ++; - } - if (result > 0) { - // There is an identical portion at the beginning of the two - // strings. If the identical portion ends within a contraction or a - // combining character sequence, back up to the start of that - // sequence. - char schar = 0; - char tchar = 0; - if (result < minlength) { - schar = source.charAt(result); // first differing chars - tchar = target.charAt(result); - } - else { - if (slength == tlength) { - return result; - } - else if (slength < tlength) { - tchar = target.charAt(result); - } - else { - schar = source.charAt(result); - } - } - if (isUnsafe(schar) || isUnsafe(tchar)) - { - // We are stopped in the middle of a contraction or combining - // sequence. - // Look backwards for the part of the string for the start of - // the sequence - // It doesn't matter which string we scan, since they are the - // same in this region. - do { - result --; - } - while (result > 0 && isUnsafe(source.charAt(result))); - } - } - return result; - } - - /** - * Appending an byte to an array of bytes and increases it if we run out of - * space - * @param array of byte arrays - * @param array of the end offsets corresponding to array - * @param appendarrayindex of the int array to append - * @param value to append - */ - private static final void append(byte array[][], int arrayoffset[], - int appendarrayindex, byte value) - { - if (arrayoffset[appendarrayindex] + 1 - >= array[appendarrayindex].length) { - array[appendarrayindex] = increase(array[appendarrayindex], - arrayoffset[appendarrayindex], - SORT_BUFFER_INIT_SIZE_); - } - array[appendarrayindex][arrayoffset[appendarrayindex]] = value; - arrayoffset[appendarrayindex] ++; - } - - /** - * This is a trick string compare function that goes in and uses sortkeys - * to compare. It is used when compare gets in trouble and needs to bail - * out. - * @param source text string - * @param target text string - */ - private final int compareBySortKeys(String source, String target) - - { - CollationKey sourcekey = getCollationKey(source); - CollationKey targetkey = getCollationKey(target); - return sourcekey.compareTo(targetkey); - } - - /** - * Performs the primary comparisons, and fills up the CE buffer at the - * same time. - * The return value toggles between the comparison result and the hiragana - * result. If either the source is greater than target or vice versa, the - * return result is the comparison result, ie 1 or -1, furthermore the - * cebuffers will be cleared when that happens. If the primary comparisons - * are equal, we'll have to continue with secondary comparison. In this case - * the cebuffer will not be cleared and the return result will be the - * hiragana result. - * @param doHiragana4 flag indicator that Hiragana Quaternary has to be - * observed - * @param lowestpvalue the lowest primary value that will not be ignored if - * alternate handling is shifted - * @param source text string - * @param target text string - * @param textoffset offset in text to start the comparison - * @param cebuffer array of CE buffers to populate, offset 0 for source, - * 1 for target, cleared when a primary difference is - * found. - * @param cebuffersize array of CE buffer size corresponding to the - * cebuffer, 0 when a primary difference is found. - * @return comparion result if a primary difference is found, otherwise - * hiragana result - */ - private final int doPrimaryCompare(boolean doHiragana4, int lowestpvalue, - String source, String target, - int textoffset, int cebuffer[][], - int cebuffersize[]) - - { - // Preparing the context objects for iterating over strings - StringCharacterIterator siter = new StringCharacterIterator(source, - textoffset, - source.length(), - textoffset); - CollationElementIterator scoleiter = new CollationElementIterator( - siter, this); - StringCharacterIterator titer = new StringCharacterIterator(target, - textoffset, - target.length(), - textoffset); - CollationElementIterator tcoleiter = new CollationElementIterator( - titer, this); - - // Non shifted primary processing is quite simple - if (!m_isAlternateHandlingShifted_) { - int hiraganaresult = 0; - while (true) { - int sorder = 0; - // We fetch CEs until we hit a non ignorable primary or end. - do { - sorder = scoleiter.next(); - append(cebuffer, cebuffersize, 0, sorder); - sorder &= CE_PRIMARY_MASK_; - } while (sorder == CollationElementIterator.IGNORABLE); - - int torder = 0; - do { - torder = tcoleiter.next(); - append(cebuffer, cebuffersize, 1, torder); - torder &= CE_PRIMARY_MASK_; - } while (torder == CollationElementIterator.IGNORABLE); - - // if both primaries are the same - if (sorder == torder) { - // and there are no more CEs, we advance to the next level - if (cebuffer[0][cebuffersize[0] - 1] - == CollationElementIterator.NULLORDER) { - break; - } - if (doHiragana4 && hiraganaresult == 0 - && scoleiter.m_isCodePointHiragana_ != - tcoleiter.m_isCodePointHiragana_) { - if (scoleiter.m_isCodePointHiragana_) { - hiraganaresult = -1; - } - else { - hiraganaresult = 1; - } - } - } - else { - // if two primaries are different, we are done - return endPrimaryCompare(sorder, torder, cebuffer, - cebuffersize); - } - } - // no primary difference... do the rest from the buffers - return hiraganaresult; - } - else { // shifted - do a slightly more complicated processing :) - while (true) { - int sorder = getPrimaryShiftedCompareCE(scoleiter, lowestpvalue, - cebuffer, cebuffersize, 0); - int torder = getPrimaryShiftedCompareCE(tcoleiter, lowestpvalue, - cebuffer, cebuffersize, 1); - if (sorder == torder) { - if (cebuffer[0][cebuffersize[0] - 1] - == CollationElementIterator.NULLORDER) { - break; - } - else { - continue; - } - } - else { - return endPrimaryCompare(sorder, torder, cebuffer, - cebuffersize); - } - } // no primary difference... do the rest from the buffers - } - return 0; - } - - /** - * This is used only for primary strength when we know that sorder is - * already different from torder. - * Compares sorder and torder, returns -1 if sorder is less than torder. - * Clears the cebuffer at the same time. - * @param sorder source strength order - * @param torder target strength order - * @param cebuffer array of buffers containing the ce values - * @param cebuffersize array of cebuffer offsets - * @return the comparison result of sorder and torder - */ - private static final int endPrimaryCompare(int sorder, int torder, - int cebuffer[][], - int cebuffersize[]) - { - // if we reach here, the ce offset accessed is the last ce - // appended to the buffer - boolean isSourceNullOrder = (cebuffer[0][cebuffersize[0] - 1] - == CollationElementIterator.NULLORDER); - boolean isTargetNullOrder = (cebuffer[1][cebuffersize[1] - 1] - == CollationElementIterator.NULLORDER); - cebuffer[0] = null; - cebuffer[1] = null; - cebuffersize[0] = 0; - cebuffersize[1] = 0; - if (isSourceNullOrder) { - return -1; - } - if (isTargetNullOrder) { - return 1; - } - // getting rid of the sign - sorder >>>= CE_PRIMARY_SHIFT_; - torder >>>= CE_PRIMARY_SHIFT_; - if (sorder < torder) { - return -1; - } - return 1; - } - - /** - * Calculates the next primary shifted value and fills up cebuffer with the - * next non-ignorable ce. - * @param coleiter collation element iterator - * @param doHiragana4 flag indicator if hiragana quaternary is to be - * handled - * @param lowestpvalue lowest primary shifted value that will not be - * ignored - * @param cebuffer array of buffers to append with the next ce - * @param cebuffersize array of offsets corresponding to the cebuffer - * @param cebufferindex index of the buffer to append to - * @return result next modified ce - */ - private final static int getPrimaryShiftedCompareCE( - CollationElementIterator coleiter, - int lowestpvalue, int cebuffer[][], - int cebuffersize[], int cebufferindex) - - { - boolean shifted = false; - int result = CollationElementIterator.IGNORABLE; - while (true) { - result = coleiter.next(); - if (result == CollationElementIterator.NULLORDER) { - append(cebuffer, cebuffersize, cebufferindex, result); - break; - } - else if (result == CollationElementIterator.IGNORABLE) { - continue; - } - else if (isContinuation(result)) { - if ((result & CE_PRIMARY_MASK_) - != CollationElementIterator.IGNORABLE) { - // There is primary value - if (shifted) { - result = (result & CE_PRIMARY_MASK_) - | CE_CONTINUATION_MARKER_; - // preserve interesting continuation - append(cebuffer, cebuffersize, cebufferindex, result); - continue; - } - else { - append(cebuffer, cebuffersize, cebufferindex, result); - break; - } - } - else { // Just lower level values - if (!shifted) { - append(cebuffer, cebuffersize, cebufferindex, result); - } - } - } - else { // regular - if ((result & CE_PRIMARY_MASK_) > lowestpvalue) { - append(cebuffer, cebuffersize, cebufferindex, result); - break; - } - else { - if ((result & CE_PRIMARY_MASK_) > 0) { - shifted = true; - result &= CE_PRIMARY_MASK_; - append(cebuffer, cebuffersize, cebufferindex, result); - continue; - } - else { - append(cebuffer, cebuffersize, cebufferindex, result); - shifted = false; - continue; - } - } - } - } - result &= CE_PRIMARY_MASK_; - return result; - } - - /** - * Appending an int to an array of ints and increases it if we run out of - * space - * @param array of int arrays - * @param array of the end offsets corresponding to array - * @param appendarrayindex of the int array to append - * @param value to append - */ - private static final void append(int array[][], int arrayoffset[], - int appendarrayindex, int value) - { - if (arrayoffset[appendarrayindex] + 1 - >= array[appendarrayindex].length) { - array[appendarrayindex] = increase(array[appendarrayindex], - arrayoffset[appendarrayindex], - CE_BUFFER_SIZE_); - } - array[appendarrayindex][arrayoffset[appendarrayindex]] = value; - arrayoffset[appendarrayindex] ++; - } - - /** - * Does secondary strength comparison based on the collected ces. - * @param cebuffer array of int arrays that contains the collected ces - * @param cebuffersize array of offsets corresponding to the cebuffer, - * indicates the offset of the last ce in buffer - * @param doFrench flag indicates if French ordering is to be done - * @return the secondary strength comparison result - */ - private static final int doSecondaryCompare(int cebuffer[][], - int cebuffersize[], - boolean doFrench) - { - // now, we're gonna reexamine collected CEs - if (!doFrench) { // normal - int soffset = 0; - int toffset = 0; - while (true) { - int sorder = CollationElementIterator.IGNORABLE; - while (sorder == CollationElementIterator.IGNORABLE) { - sorder = cebuffer[0][soffset ++] & CE_SECONDARY_MASK_; - } - int torder = CollationElementIterator.IGNORABLE; - while (torder == CollationElementIterator.IGNORABLE) { - torder = cebuffer[1][toffset ++] & CE_SECONDARY_MASK_; - } - - if (sorder == torder) { - if (cebuffer[0][soffset - 1] - == CollationElementIterator.NULLORDER) { - break; - } - } - else { - if (cebuffer[0][soffset - 1] == - CollationElementIterator.NULLORDER) { - return -1; - } - if (cebuffer[1][toffset - 1] == - CollationElementIterator.NULLORDER) { - return 1; - } - return (sorder < torder) ? -1 : 1; - } - } - } - else { // do the French - int continuationoffset[] = {0, 0}; - int offset[] = {cebuffersize[0] - 2, cebuffersize[1] - 2} ; - while (true) { - int sorder = getSecondaryFrenchCE(cebuffer, offset, - continuationoffset, 0); - int torder = getSecondaryFrenchCE(cebuffer, offset, - continuationoffset, 1); - if (sorder == torder) { - if ((offset[0] < 0 && offset[1] < 0) - || cebuffer[0][offset[0]] - == CollationElementIterator.NULLORDER) { - break; - } - } - else { - return (sorder < torder) ? -1 : 1; - } - } - } - return 0; - } - - /** - * Calculates the next secondary french CE. - * @param cebuffer array of buffers to append with the next ce - * @param offset array of offsets corresponding to the cebuffer - * @param continuationoffset index of the start of a continuation - * @param index of cebuffer to use - * @return result next modified ce - */ - private static final int getSecondaryFrenchCE(int cebuffer[][], - int offset[], - int continuationoffset[], - int index) - { - int result = CollationElementIterator.IGNORABLE; - while (result == CollationElementIterator.IGNORABLE - && offset[index] >= 0) { - if (continuationoffset[index] == 0) { - result = cebuffer[index][offset[index]]; - while (isContinuation(cebuffer[index][offset[index] --])); - // after this, sorder is at the start of continuation, - // and offset points before that - if (isContinuation(cebuffer[index][offset[index] + 1])) { - // save offset for later - continuationoffset[index] = offset[index]; - offset[index] += 2; - } - //} - } - else { - result = cebuffer[index][offset[index] ++]; - if (!isContinuation(result)) { - // we have finished with this continuation - offset[index] = continuationoffset[index]; - // reset the pointer to before continuation - continuationoffset[index] = 0; - continue; - } - } - result &= CE_SECONDARY_MASK_; // remove continuation bit - } - return result; - } - - /** - * Does case strength comparison based on the collected ces. - * @param cebuffer array of int arrays that contains the collected ces - * @return the case strength comparison result - */ - private final int doCaseCompare(int cebuffer[][]) - { - int soffset = 0; - int toffset = 0; - while (true) { - int sorder = CollationElementIterator.IGNORABLE; - int torder = CollationElementIterator.IGNORABLE; - while ((sorder & CE_REMOVE_CASE_) - == CollationElementIterator.IGNORABLE) { - sorder = cebuffer[0][soffset ++]; - if (!isContinuation(sorder)) { - sorder &= CE_CASE_MASK_3_; - sorder ^= m_caseSwitch_; - } - else { - sorder = CollationElementIterator.IGNORABLE; - } - } - - while ((torder & CE_REMOVE_CASE_) - == CollationElementIterator.IGNORABLE) { - torder = cebuffer[1][toffset ++]; - if (!isContinuation(torder)) { - torder &= CE_CASE_MASK_3_; - torder ^= m_caseSwitch_; - } - else { - torder = CollationElementIterator.IGNORABLE; - } - } - - if (sorder == torder) { - if (cebuffer[0][soffset - 1] - == CollationElementIterator.NULLORDER) { - break; - } - } - else { - if (cebuffer[0][soffset - 1] - == CollationElementIterator.NULLORDER) { - return -1; - } - - return ((sorder & CE_CASE_BIT_MASK_) - < (torder & CE_CASE_BIT_MASK_)) ? -1 : 1; - } - } - return 0; - } - - /** - * Does tertiary strength comparison based on the collected ces. - * @param cebuffer array of int arrays that contains the collected ces - * @return the tertiary strength comparison result - */ - private final int doTertiaryCompare(int cebuffer[][]) - { - int soffset = 0; - int toffset = 0; - while (true) { - int sorder = CollationElementIterator.IGNORABLE; - int torder = CollationElementIterator.IGNORABLE; - while ((sorder & CE_REMOVE_CASE_) - == CollationElementIterator.IGNORABLE) { - sorder = cebuffer[0][soffset ++] & m_mask3_; - if (!isContinuation(sorder)) { - sorder ^= m_caseSwitch_; - } - else { - sorder &= CE_REMOVE_CASE_; - } - } - - while ((torder & CE_REMOVE_CASE_) - == CollationElementIterator.IGNORABLE) { - torder = cebuffer[1][toffset ++] & m_mask3_; - if (!isContinuation(torder)) { - torder ^= m_caseSwitch_; - } - else { - torder &= CE_REMOVE_CASE_; - } - } - - if (sorder == torder) { - if (cebuffer[0][soffset - 1] - == CollationElementIterator.NULLORDER) { - break; - } - } - else { - if (cebuffer[0][soffset - 1] == - CollationElementIterator.NULLORDER) { - return -1; - } - if (cebuffer[1][toffset - 1] == - CollationElementIterator.NULLORDER) { - return 1; - } - return (sorder < torder) ? -1 : 1; - } - } - return 0; - } - - /** - * Does quaternary strength comparison based on the collected ces. - * @param cebuffer array of int arrays that contains the collected ces - * @param lowestpvalue the lowest primary value that will not be ignored if - * alternate handling is shifted - * @return the quaternary strength comparison result - */ - private final int doQuaternaryCompare(int cebuffer[][], int lowestpvalue) - { - boolean sShifted = true; - boolean tShifted = true; - int soffset = 0; - int toffset = 0; - while (true) { - int sorder = CollationElementIterator.IGNORABLE; - int torder = CollationElementIterator.IGNORABLE; - while (sorder == CollationElementIterator.IGNORABLE - || (isContinuation(sorder) && !sShifted)) { - sorder = cebuffer[0][soffset ++]; - if (isContinuation(sorder)) { - if (!sShifted) { - continue; - } - } - else if (sorder > lowestpvalue - || (sorder & CE_PRIMARY_MASK_) - == CollationElementIterator.IGNORABLE) { - // non continuation - sorder = CE_PRIMARY_MASK_; - sShifted = false; - } - else { - sShifted = true; - } - } - sorder &= CE_PRIMARY_MASK_; - while (torder == CollationElementIterator.IGNORABLE - || (isContinuation(torder) && !tShifted)) { - torder = cebuffer[0][toffset ++]; - if (isContinuation(torder)) { - if (!tShifted) { - continue; - } - } - else if (torder > lowestpvalue - || (torder & CE_PRIMARY_MASK_) - == CollationElementIterator.IGNORABLE) { - // non continuation - torder = CE_PRIMARY_MASK_; - tShifted = false; - } - else { - tShifted = true; - } - } - torder &= CE_PRIMARY_MASK_; - - if (sorder == torder) { - if (cebuffer[0][soffset - 1] - == CollationElementIterator.NULLORDER) { - break; - } - } - else { - if (cebuffer[0][soffset - 1] == - CollationElementIterator.NULLORDER) { - return -1; - } - if (cebuffer[1][toffset - 1] == - CollationElementIterator.NULLORDER) { - return 1; - } - return (sorder < torder) ? -1 : 1; - } - } - return 0; - } - - /** - * Internal function. Does byte level string compare. Used by strcoll if - * strength == identical and strings are otherwise equal. This is a rare - * case. Comparison must be done on NFD normalized strings. FCD is not good - * enough. - * @param source text - * @param target text - * @param offset of the first difference in the text strings - * @param normalize flag indicating if we are to normalize the text before - * comparison - * @return 1 if source is greater than target, -1 less than and 0 if equals - */ - private static final int doIdenticalCompare(String source, String target, - int offset, boolean normalize) - - { - if (normalize) { - if (Normalizer.quickCheck(source, Normalizer.NFD) - != Normalizer.YES) { - source = Normalizer.decompose(source, false); - } - - if (Normalizer.quickCheck(target, Normalizer.NFD) - != Normalizer.YES) { - target = Normalizer.decompose(target, false); - } - offset = 0; - } - - return doStringCompare(source, target, offset); - } - - /** - * Compares string for their codepoint order. - * This comparison handles surrogate characters and place them after the - * all non surrogate characters. - * @param source text - * @param target text - * @param offset start offset for comparison - * @return 1 if source is greater than target, -1 less than and 0 if equals - */ - private static final int doStringCompare(String source, - String target, - int offset) - { - // compare identical prefixes - they do not need to be fixed up - char schar = 0; - char tchar = 0; - int slength = source.length(); - int tlength = target.length(); - int minlength = Math.min(slength, tlength); - while (offset < minlength) { - schar = source.charAt(offset); - tchar = target.charAt(offset ++); - if (schar != tchar) { - break; - } - } - - if (schar == tchar && offset == minlength) { - if (slength > minlength) { - return 1; - } - if (tlength > minlength) { - return -1; - } - return 0; - } + bytescount[1]); + } + + /** + * Gets the offset of the first unmatched characters in source and target. + * This method returns the offset of the start of a contraction or a + * combining sequence, if the first difference is in the middle of such a + * sequence. + * @param source string + * @param target string + * @return offset of the first unmatched characters in source and target. + */ + private final int getFirstUnmatchedOffset(String source, String target) + { + int result = 0; + int slength = source.length(); + int tlength = target.length(); + int minlength = slength; + if (minlength > tlength) { + minlength = tlength; + } + while (result < minlength + && source.charAt(result) == target.charAt(result)) { + result ++; + } + if (result > 0) { + // There is an identical portion at the beginning of the two + // strings. If the identical portion ends within a contraction or a + // combining character sequence, back up to the start of that + // sequence. + char schar = 0; + char tchar = 0; + if (result < minlength) { + schar = source.charAt(result); // first differing chars + tchar = target.charAt(result); + } + else { + if (slength == tlength) { + return result; + } + else if (slength < tlength) { + tchar = target.charAt(result); + } + else { + schar = source.charAt(result); + } + } + if (isUnsafe(schar) || isUnsafe(tchar)) + { + // We are stopped in the middle of a contraction or combining + // sequence. + // Look backwards for the part of the string for the start of + // the sequence + // It doesn't matter which string we scan, since they are the + // same in this region. + do { + result --; + } + while (result > 0 && isUnsafe(source.charAt(result))); + } + } + return result; + } + + /** + * Appending an byte to an array of bytes and increases it if we run out of + * space + * @param array of byte arrays + * @param array of the end offsets corresponding to array + * @param appendarrayindex of the int array to append + * @param value to append + */ + private static final void append(byte array[][], int arrayoffset[], + int appendarrayindex, byte value) + { + if (arrayoffset[appendarrayindex] + 1 + >= array[appendarrayindex].length) { + array[appendarrayindex] = increase(array[appendarrayindex], + arrayoffset[appendarrayindex], + SORT_BUFFER_INIT_SIZE_); + } + array[appendarrayindex][arrayoffset[appendarrayindex]] = value; + arrayoffset[appendarrayindex] ++; + } + + /** + * This is a trick string compare function that goes in and uses sortkeys + * to compare. It is used when compare gets in trouble and needs to bail + * out. + * @param source text string + * @param target text string + */ + private final int compareBySortKeys(String source, String target) + + { + CollationKey sourcekey = getCollationKey(source); + CollationKey targetkey = getCollationKey(target); + return sourcekey.compareTo(targetkey); + } + + /** + * Performs the primary comparisons, and fills up the CE buffer at the + * same time. + * The return value toggles between the comparison result and the hiragana + * result. If either the source is greater than target or vice versa, the + * return result is the comparison result, ie 1 or -1, furthermore the + * cebuffers will be cleared when that happens. If the primary comparisons + * are equal, we'll have to continue with secondary comparison. In this case + * the cebuffer will not be cleared and the return result will be the + * hiragana result. + * @param doHiragana4 flag indicator that Hiragana Quaternary has to be + * observed + * @param lowestpvalue the lowest primary value that will not be ignored if + * alternate handling is shifted + * @param source text string + * @param target text string + * @param textoffset offset in text to start the comparison + * @param cebuffer array of CE buffers to populate, offset 0 for source, + * 1 for target, cleared when a primary difference is + * found. + * @param cebuffersize array of CE buffer size corresponding to the + * cebuffer, 0 when a primary difference is found. + * @return comparion result if a primary difference is found, otherwise + * hiragana result + */ + private final int doPrimaryCompare(boolean doHiragana4, int lowestpvalue, + String source, String target, + int textoffset, int cebuffer[][], + int cebuffersize[]) + + { + // Preparing the context objects for iterating over strings + StringCharacterIterator siter = new StringCharacterIterator(source, + textoffset, + source.length(), + textoffset); + CollationElementIterator scoleiter = new CollationElementIterator( + siter, this); + StringCharacterIterator titer = new StringCharacterIterator(target, + textoffset, + target.length(), + textoffset); + CollationElementIterator tcoleiter = new CollationElementIterator( + titer, this); + + // Non shifted primary processing is quite simple + if (!m_isAlternateHandlingShifted_) { + int hiraganaresult = 0; + while (true) { + int sorder = 0; + // We fetch CEs until we hit a non ignorable primary or end. + do { + sorder = scoleiter.next(); + append(cebuffer, cebuffersize, 0, sorder); + sorder &= CE_PRIMARY_MASK_; + } while (sorder == CollationElementIterator.IGNORABLE); + + int torder = 0; + do { + torder = tcoleiter.next(); + append(cebuffer, cebuffersize, 1, torder); + torder &= CE_PRIMARY_MASK_; + } while (torder == CollationElementIterator.IGNORABLE); + + // if both primaries are the same + if (sorder == torder) { + // and there are no more CEs, we advance to the next level + if (cebuffer[0][cebuffersize[0] - 1] + == CollationElementIterator.NULLORDER) { + break; + } + if (doHiragana4 && hiraganaresult == 0 + && scoleiter.m_isCodePointHiragana_ != + tcoleiter.m_isCodePointHiragana_) { + if (scoleiter.m_isCodePointHiragana_) { + hiraganaresult = -1; + } + else { + hiraganaresult = 1; + } + } + } + else { + // if two primaries are different, we are done + return endPrimaryCompare(sorder, torder, cebuffer, + cebuffersize); + } + } + // no primary difference... do the rest from the buffers + return hiraganaresult; + } + else { // shifted - do a slightly more complicated processing :) + while (true) { + int sorder = getPrimaryShiftedCompareCE(scoleiter, lowestpvalue, + cebuffer, cebuffersize, 0); + int torder = getPrimaryShiftedCompareCE(tcoleiter, lowestpvalue, + cebuffer, cebuffersize, 1); + if (sorder == torder) { + if (cebuffer[0][cebuffersize[0] - 1] + == CollationElementIterator.NULLORDER) { + break; + } + else { + continue; + } + } + else { + return endPrimaryCompare(sorder, torder, cebuffer, + cebuffersize); + } + } // no primary difference... do the rest from the buffers + } + return 0; + } + + /** + * This is used only for primary strength when we know that sorder is + * already different from torder. + * Compares sorder and torder, returns -1 if sorder is less than torder. + * Clears the cebuffer at the same time. + * @param sorder source strength order + * @param torder target strength order + * @param cebuffer array of buffers containing the ce values + * @param cebuffersize array of cebuffer offsets + * @return the comparison result of sorder and torder + */ + private static final int endPrimaryCompare(int sorder, int torder, + int cebuffer[][], + int cebuffersize[]) + { + // if we reach here, the ce offset accessed is the last ce + // appended to the buffer + boolean isSourceNullOrder = (cebuffer[0][cebuffersize[0] - 1] + == CollationElementIterator.NULLORDER); + boolean isTargetNullOrder = (cebuffer[1][cebuffersize[1] - 1] + == CollationElementIterator.NULLORDER); + cebuffer[0] = null; + cebuffer[1] = null; + cebuffersize[0] = 0; + cebuffersize[1] = 0; + if (isSourceNullOrder) { + return -1; + } + if (isTargetNullOrder) { + return 1; + } + // getting rid of the sign + sorder >>>= CE_PRIMARY_SHIFT_; + torder >>>= CE_PRIMARY_SHIFT_; + if (sorder < torder) { + return -1; + } + return 1; + } + + /** + * Calculates the next primary shifted value and fills up cebuffer with the + * next non-ignorable ce. + * @param coleiter collation element iterator + * @param doHiragana4 flag indicator if hiragana quaternary is to be + * handled + * @param lowestpvalue lowest primary shifted value that will not be + * ignored + * @param cebuffer array of buffers to append with the next ce + * @param cebuffersize array of offsets corresponding to the cebuffer + * @param cebufferindex index of the buffer to append to + * @return result next modified ce + */ + private final static int getPrimaryShiftedCompareCE( + CollationElementIterator coleiter, + int lowestpvalue, int cebuffer[][], + int cebuffersize[], int cebufferindex) + + { + boolean shifted = false; + int result = CollationElementIterator.IGNORABLE; + while (true) { + result = coleiter.next(); + if (result == CollationElementIterator.NULLORDER) { + append(cebuffer, cebuffersize, cebufferindex, result); + break; + } + else if (result == CollationElementIterator.IGNORABLE) { + continue; + } + else if (isContinuation(result)) { + if ((result & CE_PRIMARY_MASK_) + != CollationElementIterator.IGNORABLE) { + // There is primary value + if (shifted) { + result = (result & CE_PRIMARY_MASK_) + | CE_CONTINUATION_MARKER_; + // preserve interesting continuation + append(cebuffer, cebuffersize, cebufferindex, result); + continue; + } + else { + append(cebuffer, cebuffersize, cebufferindex, result); + break; + } + } + else { // Just lower level values + if (!shifted) { + append(cebuffer, cebuffersize, cebufferindex, result); + } + } + } + else { // regular + if ((result & CE_PRIMARY_MASK_) > lowestpvalue) { + append(cebuffer, cebuffersize, cebufferindex, result); + break; + } + else { + if ((result & CE_PRIMARY_MASK_) > 0) { + shifted = true; + result &= CE_PRIMARY_MASK_; + append(cebuffer, cebuffersize, cebufferindex, result); + continue; + } + else { + append(cebuffer, cebuffersize, cebufferindex, result); + shifted = false; + continue; + } + } + } + } + result &= CE_PRIMARY_MASK_; + return result; + } + + /** + * Appending an int to an array of ints and increases it if we run out of + * space + * @param array of int arrays + * @param array of the end offsets corresponding to array + * @param appendarrayindex of the int array to append + * @param value to append + */ + private static final void append(int array[][], int arrayoffset[], + int appendarrayindex, int value) + { + if (arrayoffset[appendarrayindex] + 1 + >= array[appendarrayindex].length) { + array[appendarrayindex] = increase(array[appendarrayindex], + arrayoffset[appendarrayindex], + CE_BUFFER_SIZE_); + } + array[appendarrayindex][arrayoffset[appendarrayindex]] = value; + arrayoffset[appendarrayindex] ++; + } + + /** + * Does secondary strength comparison based on the collected ces. + * @param cebuffer array of int arrays that contains the collected ces + * @param cebuffersize array of offsets corresponding to the cebuffer, + * indicates the offset of the last ce in buffer + * @param doFrench flag indicates if French ordering is to be done + * @return the secondary strength comparison result + */ + private static final int doSecondaryCompare(int cebuffer[][], + int cebuffersize[], + boolean doFrench) + { + // now, we're gonna reexamine collected CEs + if (!doFrench) { // normal + int soffset = 0; + int toffset = 0; + while (true) { + int sorder = CollationElementIterator.IGNORABLE; + while (sorder == CollationElementIterator.IGNORABLE) { + sorder = cebuffer[0][soffset ++] & CE_SECONDARY_MASK_; + } + int torder = CollationElementIterator.IGNORABLE; + while (torder == CollationElementIterator.IGNORABLE) { + torder = cebuffer[1][toffset ++] & CE_SECONDARY_MASK_; + } + + if (sorder == torder) { + if (cebuffer[0][soffset - 1] + == CollationElementIterator.NULLORDER) { + break; + } + } + else { + if (cebuffer[0][soffset - 1] == + CollationElementIterator.NULLORDER) { + return -1; + } + if (cebuffer[1][toffset - 1] == + CollationElementIterator.NULLORDER) { + return 1; + } + return (sorder < torder) ? -1 : 1; + } + } + } + else { // do the French + int continuationoffset[] = {0, 0}; + int offset[] = {cebuffersize[0] - 2, cebuffersize[1] - 2} ; + while (true) { + int sorder = getSecondaryFrenchCE(cebuffer, offset, + continuationoffset, 0); + int torder = getSecondaryFrenchCE(cebuffer, offset, + continuationoffset, 1); + if (sorder == torder) { + if ((offset[0] < 0 && offset[1] < 0) + || cebuffer[0][offset[0]] + == CollationElementIterator.NULLORDER) { + break; + } + } + else { + return (sorder < torder) ? -1 : 1; + } + } + } + return 0; + } + + /** + * Calculates the next secondary french CE. + * @param cebuffer array of buffers to append with the next ce + * @param offset array of offsets corresponding to the cebuffer + * @param continuationoffset index of the start of a continuation + * @param index of cebuffer to use + * @return result next modified ce + */ + private static final int getSecondaryFrenchCE(int cebuffer[][], + int offset[], + int continuationoffset[], + int index) + { + int result = CollationElementIterator.IGNORABLE; + while (result == CollationElementIterator.IGNORABLE + && offset[index] >= 0) { + if (continuationoffset[index] == 0) { + result = cebuffer[index][offset[index]]; + while (isContinuation(cebuffer[index][offset[index] --])); + // after this, sorder is at the start of continuation, + // and offset points before that + if (isContinuation(cebuffer[index][offset[index] + 1])) { + // save offset for later + continuationoffset[index] = offset[index]; + offset[index] += 2; + } + //} + } + else { + result = cebuffer[index][offset[index] ++]; + if (!isContinuation(result)) { + // we have finished with this continuation + offset[index] = continuationoffset[index]; + // reset the pointer to before continuation + continuationoffset[index] = 0; + continue; + } + } + result &= CE_SECONDARY_MASK_; // remove continuation bit + } + return result; + } + + /** + * Does case strength comparison based on the collected ces. + * @param cebuffer array of int arrays that contains the collected ces + * @return the case strength comparison result + */ + private final int doCaseCompare(int cebuffer[][]) + { + int soffset = 0; + int toffset = 0; + while (true) { + int sorder = CollationElementIterator.IGNORABLE; + int torder = CollationElementIterator.IGNORABLE; + while ((sorder & CE_REMOVE_CASE_) + == CollationElementIterator.IGNORABLE) { + sorder = cebuffer[0][soffset ++]; + if (!isContinuation(sorder)) { + sorder &= CE_CASE_MASK_3_; + sorder ^= m_caseSwitch_; + } + else { + sorder = CollationElementIterator.IGNORABLE; + } + } + + while ((torder & CE_REMOVE_CASE_) + == CollationElementIterator.IGNORABLE) { + torder = cebuffer[1][toffset ++]; + if (!isContinuation(torder)) { + torder &= CE_CASE_MASK_3_; + torder ^= m_caseSwitch_; + } + else { + torder = CollationElementIterator.IGNORABLE; + } + } + + if (sorder == torder) { + if (cebuffer[0][soffset - 1] + == CollationElementIterator.NULLORDER) { + break; + } + } + else { + if (cebuffer[0][soffset - 1] + == CollationElementIterator.NULLORDER) { + return -1; + } + + return ((sorder & CE_CASE_BIT_MASK_) + < (torder & CE_CASE_BIT_MASK_)) ? -1 : 1; + } + } + return 0; + } + + /** + * Does tertiary strength comparison based on the collected ces. + * @param cebuffer array of int arrays that contains the collected ces + * @return the tertiary strength comparison result + */ + private final int doTertiaryCompare(int cebuffer[][]) + { + int soffset = 0; + int toffset = 0; + while (true) { + int sorder = CollationElementIterator.IGNORABLE; + int torder = CollationElementIterator.IGNORABLE; + while ((sorder & CE_REMOVE_CASE_) + == CollationElementIterator.IGNORABLE) { + sorder = cebuffer[0][soffset ++] & m_mask3_; + if (!isContinuation(sorder)) { + sorder ^= m_caseSwitch_; + } + else { + sorder &= CE_REMOVE_CASE_; + } + } + + while ((torder & CE_REMOVE_CASE_) + == CollationElementIterator.IGNORABLE) { + torder = cebuffer[1][toffset ++] & m_mask3_; + if (!isContinuation(torder)) { + torder ^= m_caseSwitch_; + } + else { + torder &= CE_REMOVE_CASE_; + } + } + + if (sorder == torder) { + if (cebuffer[0][soffset - 1] + == CollationElementIterator.NULLORDER) { + break; + } + } + else { + if (cebuffer[0][soffset - 1] == + CollationElementIterator.NULLORDER) { + return -1; + } + if (cebuffer[1][toffset - 1] == + CollationElementIterator.NULLORDER) { + return 1; + } + return (sorder < torder) ? -1 : 1; + } + } + return 0; + } + + /** + * Does quaternary strength comparison based on the collected ces. + * @param cebuffer array of int arrays that contains the collected ces + * @param lowestpvalue the lowest primary value that will not be ignored if + * alternate handling is shifted + * @return the quaternary strength comparison result + */ + private final int doQuaternaryCompare(int cebuffer[][], int lowestpvalue) + { + boolean sShifted = true; + boolean tShifted = true; + int soffset = 0; + int toffset = 0; + while (true) { + int sorder = CollationElementIterator.IGNORABLE; + int torder = CollationElementIterator.IGNORABLE; + while (sorder == CollationElementIterator.IGNORABLE + || (isContinuation(sorder) && !sShifted)) { + sorder = cebuffer[0][soffset ++]; + if (isContinuation(sorder)) { + if (!sShifted) { + continue; + } + } + else if (sorder > lowestpvalue + || (sorder & CE_PRIMARY_MASK_) + == CollationElementIterator.IGNORABLE) { + // non continuation + sorder = CE_PRIMARY_MASK_; + sShifted = false; + } + else { + sShifted = true; + } + } + sorder &= CE_PRIMARY_MASK_; + while (torder == CollationElementIterator.IGNORABLE + || (isContinuation(torder) && !tShifted)) { + torder = cebuffer[0][toffset ++]; + if (isContinuation(torder)) { + if (!tShifted) { + continue; + } + } + else if (torder > lowestpvalue + || (torder & CE_PRIMARY_MASK_) + == CollationElementIterator.IGNORABLE) { + // non continuation + torder = CE_PRIMARY_MASK_; + tShifted = false; + } + else { + tShifted = true; + } + } + torder &= CE_PRIMARY_MASK_; + + if (sorder == torder) { + if (cebuffer[0][soffset - 1] + == CollationElementIterator.NULLORDER) { + break; + } + } + else { + if (cebuffer[0][soffset - 1] == + CollationElementIterator.NULLORDER) { + return -1; + } + if (cebuffer[1][toffset - 1] == + CollationElementIterator.NULLORDER) { + return 1; + } + return (sorder < torder) ? -1 : 1; + } + } + return 0; + } + + /** + * Internal function. Does byte level string compare. Used by strcoll if + * strength == identical and strings are otherwise equal. This is a rare + * case. Comparison must be done on NFD normalized strings. FCD is not good + * enough. + * @param source text + * @param target text + * @param offset of the first difference in the text strings + * @param normalize flag indicating if we are to normalize the text before + * comparison + * @return 1 if source is greater than target, -1 less than and 0 if equals + */ + private static final int doIdenticalCompare(String source, String target, + int offset, boolean normalize) + + { + if (normalize) { + if (Normalizer.quickCheck(source, Normalizer.NFD) + != Normalizer.YES) { + source = Normalizer.decompose(source, false); + } + + if (Normalizer.quickCheck(target, Normalizer.NFD) + != Normalizer.YES) { + target = Normalizer.decompose(target, false); + } + offset = 0; + } + + return doStringCompare(source, target, offset); + } + + /** + * Compares string for their codepoint order. + * This comparison handles surrogate characters and place them after the + * all non surrogate characters. + * @param source text + * @param target text + * @param offset start offset for comparison + * @return 1 if source is greater than target, -1 less than and 0 if equals + */ + private static final int doStringCompare(String source, + String target, + int offset) + { + // compare identical prefixes - they do not need to be fixed up + char schar = 0; + char tchar = 0; + int slength = source.length(); + int tlength = target.length(); + int minlength = Math.min(slength, tlength); + while (offset < minlength) { + schar = source.charAt(offset); + tchar = target.charAt(offset ++); + if (schar != tchar) { + break; + } + } + + if (schar == tchar && offset == minlength) { + if (slength > minlength) { + return 1; + } + if (tlength > minlength) { + return -1; + } + return 0; + } - // if both values are in or above the surrogate range, Fix them up. - if (schar >= UTF16.LEAD_SURROGATE_MIN_VALUE - && tchar >= UTF16.LEAD_SURROGATE_MIN_VALUE) { - schar = fixupUTF16(schar); - tchar = fixupUTF16(tchar); - } + // if both values are in or above the surrogate range, Fix them up. + if (schar >= UTF16.LEAD_SURROGATE_MIN_VALUE + && tchar >= UTF16.LEAD_SURROGATE_MIN_VALUE) { + schar = fixupUTF16(schar); + tchar = fixupUTF16(tchar); + } - // now c1 and c2 are in UTF-32-compatible order - return (schar < tchar) ? -1 : 1; // schar and tchar has to be different - } - - /** - * Rotate surrogates to the top to get code point order - */ - private static final char fixupUTF16(char ch) - { - if (ch >= 0xe000) { - ch -= 0x800; - } - else { - ch += 0x2000; - } - return ch; - } - - /** - * Checks that the source after offset is ignorable - * @param source text string to check - * @param offset - * @return true if source after offset is ignorable. false otherwise - */ - private final boolean checkIgnorable(String source, int offset) - - { - StringCharacterIterator siter = new StringCharacterIterator(source, - offset, source.length(), offset); - CollationElementIterator coleiter = new CollationElementIterator( - siter, this); - int ce = coleiter.next(); - while (ce != CollationElementIterator.NULLORDER) { - if (ce != CollationElementIterator.IGNORABLE) { - return false; - } - ce = coleiter.next(); - } - return true; - } - - /** - * Resets the internal case data members and compression values. - */ - private void updateInternalState() - { - if (m_caseFirst_ == AttributeValue.UPPER_FIRST_) { - m_caseSwitch_ = (byte)CASE_SWITCH_; - } - else { - m_caseSwitch_ = NO_CASE_SWITCH_; - } + // now c1 and c2 are in UTF-32-compatible order + return (schar < tchar) ? -1 : 1; // schar and tchar has to be different + } + + /** + * Rotate surrogates to the top to get code point order + */ + private static final char fixupUTF16(char ch) + { + if (ch >= 0xe000) { + ch -= 0x800; + } + else { + ch += 0x2000; + } + return ch; + } + + /** + * Checks that the source after offset is ignorable + * @param source text string to check + * @param offset + * @return true if source after offset is ignorable. false otherwise + */ + private final boolean checkIgnorable(String source, int offset) + + { + StringCharacterIterator siter = new StringCharacterIterator(source, + offset, source.length(), offset); + CollationElementIterator coleiter = new CollationElementIterator( + siter, this); + int ce = coleiter.next(); + while (ce != CollationElementIterator.NULLORDER) { + if (ce != CollationElementIterator.IGNORABLE) { + return false; + } + ce = coleiter.next(); + } + return true; + } + + /** + * Resets the internal case data members and compression values. + */ + private void updateInternalState() + { + if (m_caseFirst_ == AttributeValue.UPPER_FIRST_) { + m_caseSwitch_ = (byte)CASE_SWITCH_; + } + else { + m_caseSwitch_ = NO_CASE_SWITCH_; + } - if (m_isCaseLevel_ || m_caseFirst_ == AttributeValue.OFF_) { - m_mask3_ = CE_REMOVE_CASE_; - m_common3_ = COMMON_NORMAL_3_; - m_addition3_ = FLAG_BIT_MASK_CASE_SWITCH_OFF_; - m_top3_ = COMMON_TOP_CASE_SWITCH_OFF_3_; - m_bottom3_ = COMMON_BOTTOM_3_; - } - else { - m_mask3_ = (byte)CE_KEEP_CASE_; - m_addition3_ = FLAG_BIT_MASK_CASE_SWITCH_ON_; - if (m_caseFirst_ == AttributeValue.UPPER_FIRST_) { - m_common3_ = COMMON_UPPER_FIRST_3_; - m_top3_ = COMMON_TOP_CASE_SWITCH_UPPER_3_; - m_bottom3_ = COMMON_BOTTOM_CASE_SWITCH_UPPER_3_; - } else { - m_common3_ = COMMON_NORMAL_3_; - m_top3_ = COMMON_TOP_CASE_SWITCH_LOWER_3_; - m_bottom3_ = COMMON_BOTTOM_CASE_SWITCH_LOWER_3_; - } - } + if (m_isCaseLevel_ || m_caseFirst_ == AttributeValue.OFF_) { + m_mask3_ = CE_REMOVE_CASE_; + m_common3_ = COMMON_NORMAL_3_; + m_addition3_ = FLAG_BIT_MASK_CASE_SWITCH_OFF_; + m_top3_ = COMMON_TOP_CASE_SWITCH_OFF_3_; + m_bottom3_ = COMMON_BOTTOM_3_; + } + else { + m_mask3_ = (byte)CE_KEEP_CASE_; + m_addition3_ = FLAG_BIT_MASK_CASE_SWITCH_ON_; + if (m_caseFirst_ == AttributeValue.UPPER_FIRST_) { + m_common3_ = COMMON_UPPER_FIRST_3_; + m_top3_ = COMMON_TOP_CASE_SWITCH_UPPER_3_; + m_bottom3_ = COMMON_BOTTOM_CASE_SWITCH_UPPER_3_; + } else { + m_common3_ = COMMON_NORMAL_3_; + m_top3_ = COMMON_TOP_CASE_SWITCH_LOWER_3_; + m_bottom3_ = COMMON_BOTTOM_CASE_SWITCH_LOWER_3_; + } + } - // Set the compression values - int total3 = m_top3_ - COMMON_BOTTOM_3_ - 1; - // we multilply double with int, but need only int - m_topCount3_ = (int)(PROPORTION_3_ * total3); - m_bottomCount3_ = total3 - m_topCount3_; + // Set the compression values + int total3 = m_top3_ - COMMON_BOTTOM_3_ - 1; + // we multilply double with int, but need only int + m_topCount3_ = (int)(PROPORTION_3_ * total3); + m_bottomCount3_ = total3 - m_topCount3_; - if (!m_isCaseLevel_ && getStrength() == AttributeValue.TERTIARY_ - && !m_isFrenchCollation_ && !m_isAlternateHandlingShifted_) { - m_isSimple3_ = true; - } - else { - m_isSimple3_ = false; - } - } - - /** + if (!m_isCaseLevel_ && getStrength() == AttributeValue.TERTIARY_ + && !m_isFrenchCollation_ && !m_isAlternateHandlingShifted_) { + m_isSimple3_ = true; + } + else { + m_isSimple3_ = false; + } + } + + /** * Initializes the RuleBasedCollator */ private final void init() { - for (m_minUnsafe_ = 0; m_minUnsafe_ < DEFAULT_MIN_HEURISTIC_; - m_minUnsafe_ ++) { - // Find the smallest unsafe char. - if (isUnsafe(m_minUnsafe_)) { - break; - } - } - - for (m_minContractionEnd_ = 0; - m_minContractionEnd_ < DEFAULT_MIN_HEURISTIC_; - m_minContractionEnd_ ++) { - // Find the smallest contraction-ending char. - if (isContractionEnd(m_minContractionEnd_)) { - break; - } - } - setStrength(m_defaultStrength_); - setDecomposition(m_defaultDecomposition_); - m_isFrenchCollation_ = m_defaultIsFrenchCollation_; - m_isAlternateHandlingShifted_ = m_defaultIsAlternateHandlingShifted_; - m_isCaseLevel_ = m_defaultIsCaseLevel_; - m_caseFirst_ = m_defaultCaseFirst_; - m_isHiragana4_ = m_defaultIsHiragana4_; - updateInternalState(); + for (m_minUnsafe_ = 0; m_minUnsafe_ < DEFAULT_MIN_HEURISTIC_; + m_minUnsafe_ ++) { + // Find the smallest unsafe char. + if (isUnsafe(m_minUnsafe_)) { + break; + } + } + + for (m_minContractionEnd_ = 0; + m_minContractionEnd_ < DEFAULT_MIN_HEURISTIC_; + m_minContractionEnd_ ++) { + // Find the smallest contraction-ending char. + if (isContractionEnd(m_minContractionEnd_)) { + break; + } + } + setStrength(m_defaultStrength_); + setDecomposition(m_defaultDecomposition_); + m_isFrenchCollation_ = m_defaultIsFrenchCollation_; + m_isAlternateHandlingShifted_ = m_defaultIsAlternateHandlingShifted_; + m_isCaseLevel_ = m_defaultIsCaseLevel_; + m_caseFirst_ = m_defaultCaseFirst_; + m_isHiragana4_ = m_defaultIsHiragana4_; + updateInternalState(); } } diff --git a/icu4j/src/com/ibm/icu/text/SearchIterator.java b/icu4j/src/com/ibm/icu/text/SearchIterator.java index dce19193b37..9d1ae5ade3f 100755 --- a/icu4j/src/com/ibm/icu/text/SearchIterator.java +++ b/icu4j/src/com/ibm/icu/text/SearchIterator.java @@ -5,8 +5,8 @@ ******************************************************************************* * * $Source: /xsrl/Nsvn/icu/icu4j/src/com/ibm/icu/text/SearchIterator.java,v $ - * $Date: 2002/06/22 07:46:58 $ - * $Revision: 1.8 $ + * $Date: 2002/06/22 08:37:04 $ + * $Revision: 1.9 $ * ***************************************************************************************** */ @@ -16,44 +16,46 @@ package com.ibm.icu.text; import java.text.CharacterIterator; /** - *- * SearchIterator is an abstract base class that defines a protocol for text - * searching. Subclasses provide concrete implementations of various search - * algorithms. The concrete subclass, StringSearch, is provided and implements - * language-sensitive pattern matching based on the comparison rules defined in - * a RuleBasedCollator object. Instances of SearchIterator maintain a current - * position and scan over the target text, returning the indices where a - * matched is found and the length of each match. Generally, the sequence of - * forward matches will be equivalent to the sequence of backward matches. - *
- *- * Internally, SearchIterator scans text using a CharacterIterator, and is thus - * able to scan text held by any object implementing that protocol. - *
- *- * If logical matches are required, BreakIterators can be used to define the - * boundaries of a logical match. For instance the pattern "e" will - * not be found in the string "\u00e9" if a CharacterBreakIterator is used. - * By default, the SearchIterator does not impose any logic matches, it will - * return any result that matches the pattern. Illustrating with the above - * example, "e" will be found in the string "\u00e9" if no BreakIterator is - * specified. - *
- *- * SearchIterator also provides means to handle overlapping matches via the - * API setOverlapping(boolean). For example, if the overlapping mode is set, - * searching for the pattern "abab" in the text "ababab" will yield the results - * 0 and 2, where else if overlapping is not set, SearchIterator will only - * produce the result of 0. By default the overlapping mode is not set. - *
- *- * The APIs in SearchIterator is similar to that of other text iteration - * classes such as the BreakIterator. Using this class, it is easy to - * scan through text looking for all occurances of a match. The - * following example uses a StringSearch object to find all instances of - * "fox" in the target string. Any other subclass of SearchIterator can be - * used in an identical manner. - *
+ *SearchIterator is an abstract base class that defines a protocol + * for text searching. Subclasses provide concrete implementations of + * various search algorithms. A concrete subclass, StringSearch, is + * provided that implements language-sensitive pattern matching based + * on the comparison rules defined in a RuleBasedCollator + * object. Instances of SearchIterator maintain a current position and + * scan over the target text, returning the indices where a match is + * found and the length of each match. Generally, the sequence of + * forward matches will be equivalent to the sequence of backward + * matches. (Syn Wee: so what's an example where they are _not_ + * equivalent?)
+ * + + *If logical matches are required, BreakIterators can be used to + * define the boundaries of a logical match. For instance the pattern + * "e" will not be found in the string "\u00e9" if a + * CharacterBreakIterator is used. By default, the SearchIterator + * does not impose any logic matches, it will return any result that + * matches the pattern. Illustrating with the above example, "e" will + * be found in the string "\u00e9" if no BreakIterator is + * specified. (Syn Wee: I don't get the term 'logical match.' Are + * you searching over the decomposed form of the text by default? How + * does BreakIterator affect this?)
+ * + *SearchIterator also provides a means to handle overlapping + * matches via the API setOverlapping(boolean). For example, if + * overlapping mode is set, searching for the pattern "abab" in the + * text "ababab" will match at positions 0 and 2, whereas if + * overlapping is not set, SearchIterator will only match at position + * 0. By default, overlapping mode is not set.
+ * + *The APIs in SearchIterator are similar to that of other text + * iteration classes such as BreakIterator. Using this class, it is + * easy to scan through text looking for all occurances of a + * match. The following example uses a StringSearch object to find all + * instances of "fox" in the target string.
+ * + * (Syn Wee: what we really need are examples of how the overlapping + * mode and setIndex interact with next and previous. I don't understand + * exactly what happens myself.) ** Example of use:
*@@ -70,13 +72,11 @@ import java.text.CharacterIterator; * @author Laura Werner, synwee * @since release 1.0 * @draft release 2.2 - * @see BreakIterator - */ + * @see BreakIterator */ public abstract class SearchIterator { - - // public data members ------------------------------------------------- - + // public data members ------------------------------------------------- + /** * DONE is returned by previous() and next() after all valid matches have * been returned, and by first() and last() if there are no matches at all. @@ -91,10 +91,10 @@ public abstract class SearchIterator /** *- * Sets the position in the target text which the next search will start - * from to the argument. This method clears all previous states. + * Sets the position in the target text at which the next search will start. + * This method clears any previous match. *
- * @param position index to start next search from. + * @param position position from which to start the next search * @exception IndexOutOfBoundsException thrown if argument position is out * of the target text range. * @see #getIndex @@ -104,39 +104,39 @@ public abstract class SearchIterator if (position < targetText.getBeginIndex() || position > targetText.getEndIndex()) { throw new IndexOutOfBoundsException( - "setIndex(int) expected position to be between " + - targetText.getBeginIndex() + " and " + targetText.getEndIndex()); + "setIndex(int) expected position to be between " + + targetText.getBeginIndex() + " and " + targetText.getEndIndex()); } m_setOffset_ = position; m_reset_ = false; matchLength = 0; } - - /** - *+ + /** + *
* Determines whether overlapping matches are returned. See the class * documentation for more information about overlapping matches. *
- *+ *
* The default setting of this property is false *
- * @param allowOverlap flag indicator if overlapping matches are allowed + * @param allowOverlap flag indicator if overlapping matches are allowed * @see #isOverlapping - * @draft release 2.2 - */ - public void setOverlapping(boolean allowOverlap) - { - m_isOverlap_ = allowOverlap; - } - - /** + * @draft release 2.2 + */ + public void setOverlapping(boolean allowOverlap) + { + m_isOverlap_ = allowOverlap; + } + + /** * Set the BreakIterator that is used to restrict the points at which * matches are detected. * Using null as the parameter is legal; it means that break * detection should not be attempted. * See class documentation for more information. * @param breakiter A BreakIterator that will be used to restrict the - * points at which matches are detected. + * points at which matches are detected. * @see #getBreakIterator * @see BreakIterator */ @@ -144,23 +144,23 @@ public abstract class SearchIterator { breakIterator = breakiter; if (breakIterator != null) { - breakIterator.setText(targetText); + breakIterator.setText(targetText); } } /** - * Set the target text to be searched. Text iteration will hence begin at - * the start of the text string. This method is useful if you want to - * re-use an iterator to search within a different body of text. - * @param text new text iterator to look for match, - * @exception IllegalArgumentException thrown when text is null or has - * 0 length - * @see #getTarget - * @draft ICU 2.0 - */ - public void setTarget(CharacterIterator text) - { - if (text == null || text.getEndIndex() == text.getIndex()) { + * Set the target text to be searched. Text iteration will then begin at + * the start of the text string. This method is useful if you want to + * reuse an iterator to search within a different body of text. + * @param text new text iterator to look for match, + * @exception IllegalArgumentException thrown when text is null or has + * 0 length + * @see #getTarget + * @draft ICU 2.0 + */ + public void setTarget(CharacterIterator text) + { + if (text == null || text.getEndIndex() == text.getIndex()) { throw new IllegalArgumentException("Illegal null or empty text"); } @@ -170,28 +170,27 @@ public abstract class SearchIterator m_reset_ = true; m_isForwardSearching_ = true; if (breakIterator != null) { - breakIterator.setText(targetText); + breakIterator.setText(targetText); } - } + } - // public getters ---------------------------------------------------- - - /** + // public getters ---------------------------------------------------- + + /** *- * Returns the index to the most recent match in the target text that was - * searched. - * This call returns a valid result only after a successful call to - * {@link #first}, {@link #next}, {@link #previous}, or {@link #last}. - * Just after construction, or after a searching method returns - * DONE, this method will return DONE. + * Returns the index of the most recent match in the target text. + * This call returns a valid result only after a successful call to + * {@link #first}, {@link #next}, {@link #previous}, or {@link #last}. + * Just after construction, or after a searching method returns + * DONE, this method will return DONE. *
- *- * Use getMatchLength to get the matched text length. + *
+ * Use getMatchLength to get the length of the matched text. * getMatchedText will return the subtext in the searched * target text from index getMatchStart() with length getMatchLength(). *
- * @return index to a substring within the text string that is being - * searched. + * @return index to a substring within the text string that is being + * searched. * @see #getMatchLength * @see #getMatchedText * @see #first @@ -199,20 +198,20 @@ public abstract class SearchIterator * @see #previous * @see #last * @see #DONE - * @draft release 2.2 - */ - public int getMatchStart() - { + * @draft release 2.2 + */ + public int getMatchStart() + { return m_lastMatchStart_; - } + } - /** - * Return the index in the target text where the iterator is currently - * positioned at. - * If the iteration has gone past the end of the target text or past + /** + * Return the index in the target text at which the iterator is currently + * positioned. + * If the iteration has gone past the end of the target text, or past * the beginning for a backwards search, {@link #DONE} is returned. - * @return index in the target text where the iterator is currently - * positioned at. + * @return index in the target text at which the iterator is currently + * positioned. * @draft release 2.2 * @see #first * @see #next @@ -224,7 +223,7 @@ public abstract class SearchIterator /** *- * Returns the subtext length of the most recent match in the target text. + * Returns the length of the most recent match in the target text. * This call returns a valid result only after a successful * call to {@link #first}, {@link #next}, {@link #previous}, or * {@link #last}. @@ -263,7 +262,7 @@ public abstract class SearchIterator } /** - * Return the target text which is being searched. + * Return the target text that is being searched. * @return target text being searched. * @see #setTarget */ @@ -285,38 +284,39 @@ public abstract class SearchIterator * @see #previous * @see #last * @see #DONE - * @return the subtext in target text of the most recent match + * @return the substring in the target text of the most recent match */ public String getMatchedText() { if (matchLength > 0) { int limit = m_lastMatchStart_ + matchLength; - StringBuffer result = new StringBuffer(matchLength); - result.append(targetText.current()); - targetText.next(); - while (targetText.getIndex() < limit) { - result.append(targetText.current()); - targetText.next(); - } + StringBuffer result = new StringBuffer(matchLength); + result.append(targetText.current()); + targetText.next(); + while (targetText.getIndex() < limit) { + result.append(targetText.current()); + targetText.next(); + } targetText.setIndex(m_lastMatchStart_); - return result.toString(); - } + return result.toString(); + } return null; } - // miscellaneous public methods ----------------------------------------- - - /** - * Returns the index of the next forwards valid match in the target - * text, + // miscellaneous public methods ----------------------------------------- + + /** + * Search forwards in the target text for the next valid match, * starting the search from the current iterator position. The iterator is - * adjusted so that its current index, as returned by {@link #getIndex}, - * is the starting position of the match if one was found. If a match is - * not found, DONE will be returned. - * @return The starting index of the next forward match after the current + * adjusted so that its current index, as returned by {@link #getIndex}, + * is the starting position of the match if one was found. If a match is + * found, the index of the match is returned, otherwise DONE is + * returned. If overlapping mode is set, the beginning of the found match + * can be before the end of the current match, if any. + * @return The starting index of the next forward match after the current * iterator position, or - * DONE if there are no more matches. - * @see #getMatchStart + * DONE if there are no more matches. + * @see #getMatchStart * @see #getMatchLength * @see #getMatchedText * @see #following @@ -328,50 +328,53 @@ public abstract class SearchIterator */ public int next() { - int start = targetText.getIndex(); - if (m_setOffset_ != DONE) { - start = m_setOffset_; - m_setOffset_ = DONE; - } - if (m_isForwardSearching_) { - if (!m_reset_ && - start + matchLength >= targetText.getEndIndex()) { - // not enough characters to match + int start = targetText.getIndex(); + if (m_setOffset_ != DONE) { + start = m_setOffset_; + m_setOffset_ = DONE; + } + if (m_isForwardSearching_) { + if (!m_reset_ && + start + matchLength >= targetText.getEndIndex()) { + // not enough characters to match matchLength = 0; targetText.setIndex(targetText.getEndIndex()); m_lastMatchStart_ = DONE; - return DONE; - } - m_reset_ = false; - } - else { - // switching direction. - // if matchedIndex == USEARCH_DONE, it means that either a - // setIndex has been called or that previous ran off the text - // string. the iterator would have been set to offset 0 if a - // match is not found. - m_isForwardSearching_ = true; - if (start != DONE) { - // there's no need to set the collation element iterator - // the next call to next will set the offset. - return start; - } - } - - if (start == DONE) { - start = targetText.getBeginIndex(); + return DONE; + } + m_reset_ = false; } - m_lastMatchStart_ = handleNext(start); - return m_lastMatchStart_; + m_reset_ = false; } + else { + // switching direction. + // if matchedIndex == USEARCH_DONE, it means that either a + // setIndex has been called or that previous ran off the text + // string. the iterator would have been set to offset 0 if a + // match is not found. + m_isForwardSearching_ = true; + if (start != DONE) { + // there's no need to set the collation element iterator + // the next call to next will set the offset. + return start; + } + } + + if (start == DONE) { + start = targetText.getBeginIndex(); + } + m_lastMatchStart_ = handleNext(start); + return m_lastMatchStart_; +} /** - * Returns the index of the next backwards valid match in the target - * text, + * Search backwards in the target text for the next valid match, * starting the search from the current iterator position. The iterator is * adjusted so that its current index, as returned by {@link #getIndex}, - * is the starting position of the match if one was found. If a match is - * not found, DONE will be returned. + * is the starting position of the match if one was found. If a match is + * found, the index is returned, otherwise DONE is returned. If + * overlapping mode is set, the end of the found match can be after the + * beginning of the previous match, if any. * @return The starting index of the next backwards match after the current * iterator position, or * DONE if there are no more matches. @@ -387,12 +390,12 @@ public abstract class SearchIterator */ public int previous() { - int start = targetText.getIndex(); - if (m_setOffset_ != DONE) { - start = m_setOffset_; - m_setOffset_ = DONE; - } - if (m_reset_) { + int start = targetText.getIndex(); + if (m_setOffset_ != DONE) { + start = m_setOffset_; + m_setOffset_ = DONE; + } + if (m_reset_) { m_isForwardSearching_ = false; m_reset_ = false; start = targetText.getEndIndex();; @@ -410,7 +413,7 @@ public abstract class SearchIterator } } else { - if (start == targetText.getBeginIndex()) { + if (start == targetText.getBeginIndex()) { // not enough characters to match matchLength = 0; targetText.setIndex(targetText.getBeginIndex()); @@ -424,7 +427,7 @@ public abstract class SearchIterator } /** - * Checks if the overlapping property has been set. + * Return true if the overlapping property has been set. * See setOverlapping(boolean) for more information. * @see #setOverlapping * @return true if the overlapping property has been set, false otherwise @@ -436,39 +439,34 @@ public abstract class SearchIterator } /** - *
- * Resets the search iteration. All properties will be reset to the - * default value. + *
+ * Resets the search iteration. All properties will be reset to their + * default values. *
- *- * Search will begin at the start of the target text if a forward iteration - * is initiated before a backwards iteration. Otherwise if a - * backwards iteration is initiated before a forwards iteration, the search - * will begin at the end of the target text. + *
+ * If a forward iteration is initiated, the next search will begin at the + * start of the target text. Otherwise, if a backwards iteration is initiated, + * the next search will begin at the end of the target text. *
- * @draft release 2.2 - */ - public void reset() - { - // reset is setting the attributes that are already in string search + * @draft release 2.2 + */ + public void reset() + { + // reset is setting the attributes that are already in string search matchLength = 0; setIndex(targetText.getBeginIndex()); m_isOverlap_ = false; m_isForwardSearching_ = true; m_reset_ = true; m_setOffset_ = DONE; - } - - /** + } + + /** * Return the index of the first forward match in the target text. - * This method effectively sets the iteration to begin at the start of the - * target text and searches forwards from there. - * The iterator is - * adjusted so that its current index, as returned by {@link #getIndex}, - * is the starting position of the match if one was found. If a match is - * not found, DONE will be returned. + * This method sets the iteration to begin at the start of the + * target text and searches forward from there. * @return The index of the first forward match, orDONE
- * if there are no matches. + * if there are no matches. * @see #getMatchStart * @see #getMatchLength * @see #getMatchedText @@ -488,13 +486,13 @@ public abstract class SearchIterator /** * Return the index of the first forward match in target text that - * is greater than argument position. - * This method effectively sets the iteration to begin at the argument - * position index of the target text and searches forwards from there. - * The iterator is - * adjusted so that its current index, as returned by {@link #getIndex}, - * is the starting position of the match if one was found. If a match is - * not found, DONE will be returned. + * is greater than argument position. + * (Syn Wee: what if the match is at position? It seems like this has to + * return a match there, since 'first' does the same thing and it must + * return a match at the start of the text if there is one. So instead + * of 'greater than' this should read 'at or after'). + * This method sets the iteration to begin at the specified + * position in the the target text and searches forward from there. * @return The index of the first forward match, orDONE
* if there are no matches. * @see #getMatchStart @@ -509,21 +507,17 @@ public abstract class SearchIterator */ public final int following(int position) { - m_isForwardSearching_ = true; - // position checked in usearch_setOffset + m_isForwardSearching_ = true; + // position checked in usearch_setOffset setIndex(position); return next(); } /** - * Return the index of the last forward match in target text. - * This method effectively sets the iteration to begin at the end of the + * Return the index of the first backward match in target text. + * This method sets the iteration to begin at the end of the * target text and searches backwards from there. - * The iterator is - * adjusted so that its current index, as returned by {@link #getIndex}, - * is the starting position of the match if one was found. If a match is - * not found, DONE will be returned. - * @return The starting index of the last forward match, or + * @return The starting index of the first backward match, or *DONE
if there are no matches. * @see #getMatchStart * @see #getMatchLength @@ -545,12 +539,10 @@ public abstract class SearchIterator /** * Return the index of the first backwards match in target * text that is less than argument position. - * This method effectively sets the iteration to begin at the argument + * (Syn Wee, instead of 'less than' shouldn't this read 'ends + * at or before'?) + * This method sets the iteration to begin at the argument * position index of the target text and searches backwards from there. - * The iterator is - * adjusted so that its current index, as returned by {@link #getIndex}, - * is the starting position of the match if one was found. If a match is - * not found, DONE will be returned. * @return The starting index of the first backwards match, or *DONE
* if there are no matches. @@ -583,12 +575,14 @@ public abstract class SearchIterator * @see BreakIterator */ protected BreakIterator breakIterator; + /** * Target text for searching. * @see #setTarget(CharacterIterator) * @see #getTarget */ protected CharacterIterator targetText; + /** * Length of the most current match in target text. * Value 0 is the default value. @@ -599,7 +593,7 @@ public abstract class SearchIterator // protected constructor ---------------------------------------------- - /** + /** * Protected constructor for use by subclasses. * Initializes the iterator with the argument target text for searching * and sets the BreakIterator. @@ -616,16 +610,16 @@ public abstract class SearchIterator { if (target == null || (target.getEndIndex() - target.getBeginIndex()) == 0) { - throw new IllegalArgumentException( - "Illegal argument target. " + - " Argument can not be null or of length 0"); + throw new IllegalArgumentException( + "Illegal argument target. " + + " Argument can not be null or of length 0"); } - targetText = target; - breakIterator = breaker; - if (breakIterator != null) { - breakIterator.setText(target); - } - matchLength = 0; + targetText = target; + breakIterator = breaker; + if (breakIterator != null) { + breakIterator.setText(target); + } + matchLength = 0; m_lastMatchStart_ = DONE; m_isOverlap_ = false; m_isForwardSearching_ = true; @@ -634,66 +628,65 @@ public abstract class SearchIterator } // protected methods -------------------------------------------------- - /** - * Sets the length of the most recent match in the target text. - * Subclasses' handleNext() and handlePrevious() methods should call this + * Sets the length of the most recent match in the target text. + * Subclasses' handleNext() and handlePrevious() methods should call this * after they find a match in the target text. - * @param length new length to set + * @param length new length to set * @see #handleNext * @see #handlePrevious - */ + */ protected void setMatchLength(int length) { - matchLength = length; + matchLength = length; } - /** - *- * Abstract method which subclasses override to provide the mechanism - * for finding the next forwards match in the target text. This + /** + *
+ * Abstract method that subclasses override to provide the mechanism + * for finding the next forwards match in the target text. This * allows different subclasses to provide different search algorithms. *
- *- * If a match is found, setMatchLength(int) would have to be called to + *
+ * If a match is found, this function must call setMatchLength(int) to * set the length of the result match. * The iterator is adjusted so that its current index, as returned by * {@link #getIndex}, is the starting position of the match if one was * found. If a match is not found, DONE will be returned. *
- * @param start index in the target text at which the forwards search + * @param start index in the target text at which the forwards search * should begin. - * @return the starting index of the next forwards match if found, DONE + * @return the starting index of the next forwards match if found, DONE * otherwise - * @see #setMatchLength(int) + * @see #setMatchLength(int) * @see #handlePrevious(int) * @see #DONE - */ + */ protected abstract int handleNext(int start); /** - *+ *
* Abstract method which subclasses override to provide the mechanism - * for finding the next backwards match in the target text. + * for finding the next backwards match in the target text. * This allows different - * subclasses to provide different search algorithms. + * subclasses to provide different search algorithms. *
- *- * If a match is found, setMatchLength(int) would have to be called to + *
+ * If a match is found, this function must call setMatchLength(int) to * set the length of the result match. * The iterator is adjusted so that its current index, as returned by * {@link #getIndex}, is the starting position of the match if one was * found. If a match is not found, DONE will be returned. *
- * @param start index in the target text at which the backwards search + * @param start index in the target text at which the backwards search * should begin. - * @return the starting index of the next backwards match if found, + * @return the starting index of the next backwards match if found, * DONE otherwise - * @see #setMatchLength(int) + * @see #setMatchLength(int) * @see #handleNext(int) * @see #DONE - */ + */ protected abstract int handlePrevious(int startAt); // private data members ------------------------------------------------ @@ -702,16 +695,19 @@ public abstract class SearchIterator * Flag indicates if we are doing a forwards search */ private boolean m_isForwardSearching_; + /** * Flag to indicate if overlapping search is to be done. * E.g. looking for "aa" in "aaa" will yield matches at offset 0 and 1. */ private boolean m_isOverlap_; + /** * Flag indicates if we are at the start of a string search. * This indicates that we are in forward search and at the start of m_text. */ private boolean m_reset_; + /** * Data member to store user defined position in setIndex(). * If setIndex() is not called, this value will be DONE.