ICU-9101 Updated API docs for SearchIterator and StringSearch. Tried to keep them synchronized with ICU4C API docs as much as possible.

X-SVN-Rev: 35353
This commit is contained in:
Yoshito Umaoka 2014-03-06 01:25:31 +00:00
parent d92c13c285
commit 5799b71849
2 changed files with 353 additions and 618 deletions

View file

@ -10,118 +10,43 @@ package com.ibm.icu.text;
import java.text.CharacterIterator;
/**
* <p>SearchIterator is an abstract base class that defines a protocol
* for text searching. Subclasses provide concrete implementations of
* various search algorithms. A concrete subclass, StringSearch, is
* provided that implements language-sensitive pattern matching based
* on the comparison rules defined in a RuleBasedCollator
* object. Instances of SearchIterator maintain a current position and
* scan over the target text, returning the indices where a match is
* found and the length of each match. Generally, the sequence of forward
* matches will be equivalent to the sequence of backward matches.One
* case where this statement may not hold is when non-overlapping mode
* is set on and there are continuous repetitive patterns in the text.
* Consider the case searching for pattern "aba" in the text
* "ababababa", setting overlapping mode off will produce forward matches
* at offsets 0, 4. However when a backwards search is done, the
* results will be at offsets 6 and 2.</p>
*
* <p>If matches searched for have boundary restrictions. BreakIterators
* can be used to define the valid boundaries of such a match. Once a
* BreakIterator is set, potential matches will be tested against the
* BreakIterator to determine if the boundaries are valid and that all
* characters in the potential match are equivalent to the pattern
* searched for. For example, looking for the pattern "fox" in the text
* "foxy fox" will produce match results at offset 0 and 5 with length 3
* if no BreakIterators were set. However if a WordBreakIterator is set,
* the only match that would be found will be at the offset 5. Since,
* the SearchIterator guarantees that if a BreakIterator is set, all its
* matches will match the given pattern exactly, a potential match that
* passes the BreakIterator might still not produce a valid match. For
* instance the pattern "e" will not be found in the string
* "&#92;u00e9" (latin small letter e with acute) if a
* CharacterBreakIterator is used. Even though "e" is
* a part of the character "&#92;u00e9" and the potential match at
* offset 0 length 1 passes the CharacterBreakIterator test, "&#92;u00e9"
* is not equivalent to "e", hence the SearchIterator rejects the potential
* match. By default, the SearchIterator
* does not impose any boundary restriction on the matches, it will
* return all results that match the pattern. Illustrating with the
* above example, "e" will
* be found in the string "&#92;u00e9" if no BreakIterator is
* specified.</p>
*
* <p>SearchIterator also provides a means to handle overlapping
* matches via the API setOverlapping(boolean). For example, if
* overlapping mode is set, searching for the pattern "abab" in the
* text "ababab" will match at positions 0 and 2, whereas if
* overlapping is not set, SearchIterator will only match at position
* 0. By default, overlapping mode is not set.</p>
*
* <p>The APIs in SearchIterator are similar to that of other text
* iteration classes such as BreakIterator. Using this class, it is
* easy to scan through text looking for all occurances of a
* match.</p>
* <tt>SearchIterator</tt> is an abstract base class that provides
* methods to search for a pattern within a text string. Instances of
* <tt>SearchIterator</tt> maintain a current position and scans over the
* target text, returning the indices the pattern is matched and the length
* of each match.
* <p>
* Example of use:<br>
* <pre>
* <tt>SearchIterator</tt> defines a protocol for text searching.
* Subclasses provide concrete implementations of various search algorithms.
* For example, <tt>StringSearch</tt> implements language-sensitive pattern
* matching based on the comparison rules defined in a
* <tt>RuleBasedCollator</tt> object.
* <p>
* Other options for searching includes using a BreakIterator to restrict
* the points at which matches are detected.
* <p>
* <tt>SearchIterator</tt> provides an API that is similar to that of
* other text iteration classes such as <tt>BreakIterator</tt>. Using
* this class, it is easy to scan through text looking for all occurances of
* a given pattern. The following example uses a <tt>StringSearch</tt>
* object to find all instances of "fox" in the target string. Any other
* subclass of <tt>SearchIterator</tt> can be used in an identical
* manner.
* <pre><code>
* String target = "The quick brown fox jumped over the lazy fox";
* String pattern = "fox";
* SearchIterator iter = new StringSearch(pattern, target);
* for (int pos = iter.first(); pos != SearchIterator.DONE;
* pos = iter.next()) {
* // println matches at offset 16 and 41 with length 3
* System.out.println("Found match at " + pos + ", length is "
* + iter.getMatchLength());
* for (int pos = iter.first(); pos != SearchIterator.DONE;
* pos = iter.next()) {
* System.out.println("Found match at " + pos +
* ", length is " + iter.getMatchLength());
* }
* target = "ababababa";
* pattern = "aba";
* iter.setTarget(new StringCharacterIterator(pattern));
* iter.setOverlapping(false);
* System.out.println("Overlapping mode set to false");
* System.out.println("Forward matches of pattern " + pattern + " in text "
* + text + ": ");
* for (int pos = iter.first(); pos != SearchIterator.DONE;
* pos = iter.next()) {
* // println matches at offset 0 and 4 with length 3
* System.out.println("offset " + pos + ", length "
* + iter.getMatchLength());
* }
* System.out.println("Backward matches of pattern " + pattern + " in text "
* + text + ": ");
* for (int pos = iter.last(); pos != SearchIterator.DONE;
* pos = iter.previous()) {
* // println matches at offset 6 and 2 with length 3
* System.out.println("offset " + pos + ", length "
* + iter.getMatchLength());
* }
* System.out.println("Overlapping mode set to true");
* System.out.println("Index set to 2");
* iter.setIndex(2);
* iter.setOverlapping(true);
* System.out.println("Forward matches of pattern " + pattern + " in text "
* + text + ": ");
* for (int pos = iter.first(); pos != SearchIterator.DONE;
* pos = iter.next()) {
* // println matches at offset 2, 4 and 6 with length 3
* System.out.println("offset " + pos + ", length "
* + iter.getMatchLength());
* }
* System.out.println("Index set to 2");
* iter.setIndex(2);
* System.out.println("Backward matches of pattern " + pattern + " in text "
* + text + ": ");
* for (int pos = iter.last(); pos != SearchIterator.DONE;
* pos = iter.previous()) {
* // println matches at offset 0 with length 3
* System.out.println("offset " + pos + ", length "
* + iter.getMatchLength());
* }
* </pre>
* </p>
* </code></pre>
*
* @author Laura Werner, synwee
* @stable ICU 2.0
* @see BreakIterator
* @see RuleBasedCollator
*/
public abstract class SearchIterator
{
@ -242,7 +167,7 @@ public abstract class SearchIterator
* @stable ICU 2.0
*/
public static final int DONE = -1;
// public methods -----------------------------------------------------
// public setters -----------------------------------------------------
@ -269,38 +194,36 @@ public abstract class SearchIterator
search_.setMatchedLength(0);
search_.matchedIndex_ = DONE;
}
/**
* <p>
* Determines whether overlapping matches are returned. See the class
* documentation for more information about overlapping matches.
* </p>
* <p>
* The default setting of this property is false
* </p>
*
* @param allowOverlap flag indicator if overlapping matches are allowed
* @see #isOverlapping
* @stable ICU 2.8
*/
public void setOverlapping(boolean allowOverlap)
{
public void setOverlapping(boolean allowOverlap) {
search_.isOverlap_ = allowOverlap;
}
/**
* Set the BreakIterator that is used to restrict the points at which
* matches are detected.
* Using <tt>null</tt> as the parameter is legal; it means that break
* detection should not be attempted.
* See class documentation for more information.
* Set the BreakIterator that will be used to restrict the points
* at which matches are detected.
*
* @param breakiter A BreakIterator that will be used to restrict the
* points at which matches are detected.
* @see #getBreakIterator
* points at which matches are detected. If a match is
* found, but the match's start or end index is not a
* boundary as determined by the {@link BreakIterator},
* the match will be rejected and another will be searched
* for. If this parameter is <tt>null</tt>, no break
* detection is attempted.
* @see BreakIterator
* @stable ICU 2.0
*/
public void setBreakIterator(BreakIterator breakiter)
{
public void setBreakIterator(BreakIterator breakiter) {
search_.setBreakIter(breakiter);
if (search_.breakIter() != null) {
// Create a clone of CharacterItearator, so it won't
@ -313,8 +236,9 @@ public abstract class SearchIterator
/**
* Set the target text to be searched. Text iteration will then begin at
* the start of the text string. This method is useful if you want to
* the start of the text string. This method is useful if you want to
* reuse an iterator to search within a different body of text.
*
* @param text new text iterator to look for match,
* @exception IllegalArgumentException thrown when text is null or has
* 0 length
@ -343,128 +267,103 @@ public abstract class SearchIterator
}
}
//TODO: We should add APIs below to match ICU4C APIs
//TODO: We may add APIs below to match ICU4C APIs
// setCanonicalMatch
// setElementComparison
// public getters ----------------------------------------------------
/**
* <p>
* Returns the index of the most recent match in the target text.
* This call returns a valid result only after a successful call to
* {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.
* Just after construction, or after a searching method returns
* <tt>DONE</tt>, this method will return <tt>DONE</tt>.
* </p>
* <p>
* Use <tt>getMatchLength</tt> to get the length of the matched text.
* <tt>getMatchedText</tt> will return the subtext in the searched
* target text from index getMatchStart() with length getMatchLength().
* </p>
* @return index to a substring within the text string that is being
* searched.
* @see #getMatchLength
* @see #getMatchedText
* @see #first
* @see #next
* @see #previous
* @see #last
* @see #DONE
* @stable ICU 2.8
*/
public int getMatchStart()
{
* Returns the index to the match in the text string that was searched.
* This call returns a valid result only after a successful call to
* {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.
* Just after construction, or after a searching method returns
* {@link #DONE}, this method will return {@link #DONE}.
* <p>
* Use {@link #getMatchLength} to get the matched string length.
*
* @return index of a substring within the text string that is being
* searched.
* @see #first
* @see #next
* @see #previous
* @see #last
* @stable ICU 2.0
*/
public int getMatchStart() {
return search_.matchedIndex_;
}
/**
* Return the index in the target text at which the iterator is currently
* positioned.
* If the iteration has gone past the end of the target text, or past
* the beginning for a backwards search, {@link #DONE} is returned.
* @return index in the target text at which the iterator is currently
* positioned.
* Return the current index in the text being searched.
* If the iteration has gone past the end of the text
* (or past the beginning for a backwards search), {@link #DONE}
* is returned.
*
* @return current index in the text being searched.
* @stable ICU 2.8
* @see #first
* @see #next
* @see #previous
* @see #last
* @see #DONE
*/
public abstract int getIndex();
/**
* <p>
* Returns the length of the most recent match in the target text.
* This call returns a valid result only after a successful
* call to {@link #first}, {@link #next}, {@link #previous}, or
* {@link #last}.
* Just after construction, or after a searching method returns
* <tt>DONE</tt>, this method will return 0. See getMatchStart() for
* more details.
* </p>
* @return The length of the most recent match in the target text, or 0 if
* there is no match.
* @see #getMatchStart
* @see #getMatchedText
* Returns the length of text in the string which matches the search
* pattern. This call returns a valid result only after a successful call
* to {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.
* Just after construction, or after a searching method returns
* {@link #DONE}, this method will return 0.
*
* @return The length of the match in the target text, or 0 if there
* is no match currently.
* @see #first
* @see #next
* @see #previous
* @see #last
* @see #DONE
* @stable ICU 2.0
*/
public int getMatchLength()
{
public int getMatchLength() {
return search_.matchedLength();
}
/**
* Returns the BreakIterator that is used to restrict the indexes at which
* matches are detected. This will be the same object that was passed to
* the constructor or to <code>setBreakIterator</code>.
* If the BreakIterator has not been set, <tt>null</tt> will be returned.
* See setBreakIterator for more information.
* the constructor or to {@link #setBreakIterator}.
* If the {@link BreakIterator} has not been set, <tt>null</tt> will be returned.
* See {@link #setBreakIterator} for more information.
*
* @return the BreakIterator set to restrict logic matches
* @see #setBreakIterator
* @see BreakIterator
* @stable ICU 2.0
*/
public BreakIterator getBreakIterator()
{
public BreakIterator getBreakIterator() {
return search_.breakIter();
}
/**
* Return the target text that is being searched.
* @return target text being searched.
* @see #setTarget
* Return the string text to be searched.
* @return text string to be searched.
* @stable ICU 2.0
*/
public CharacterIterator getTarget()
{
public CharacterIterator getTarget() {
return search_.text();
}
/**
* Returns the text that was matched by the most recent call to
* {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.
* If the iterator is not pointing at a valid match, for instance just
* after construction or after <tt>DONE</tt> has been returned, an empty
* String will be returned. See getMatchStart for more information
* @see #getMatchStart
* @see #getMatchLength
* {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.
* If the iterator is not pointing at a valid match (e.g. just after
* construction or after {@link #DONE} has been returned,
* returns an empty string.
*
* @return the substring in the target test of the most recent match,
* or null if there is no match currently.
* @see #first
* @see #next
* @see #previous
* @see #last
* @see #DONE
* @return the substring in the target text of the most recent match
* @stable ICU 2.0
*/
public String getMatchedText()
{
public String getMatchedText() {
if (search_.matchedLength() > 0) {
int limit = search_.matchedIndex_ + search_.matchedLength();
StringBuilder result = new StringBuilder(search_.matchedLength());
@ -481,31 +380,22 @@ public abstract class SearchIterator
}
// miscellaneous public methods -----------------------------------------
/**
* Search <b>forwards</b> in the target text for the next valid match,
* starting the search from the current iterator position. The iterator is
* adjusted so that its current index, as returned by {@link #getIndex},
* is the starting position of the match if one was found. If a match is
* found, the index of the match is returned, otherwise <tt>DONE</tt> is
* returned. If overlapping mode is set, the beginning of the found match
* can be before the end of the current match, if any.
* @return The starting index of the next forward match after the current
* iterator position, or
* <tt>DONE</tt> if there are no more matches.
* @see #getMatchStart
* @see #getMatchLength
* @see #getMatchedText
* @see #following
* @see #preceding
* @see #previous
* @see #first
* @see #last
* @see #DONE
* Returns the index of the next point at which the text matches the
* search pattern, starting from the current position
* The iterator is adjusted so that its current index (as returned by
* {@link #getIndex}) is the match position if one was found.
* If a match is not found, {@link #DONE} will be returned and
* the iterator will be adjusted to a position after the end of the text
* string.
*
* @return The index of the next match after the current position,
* or {@link #DONE} if there are no more matches.
* @see #getIndex
* @stable ICU 2.0
*/
public int next()
{
public int next() {
int index = getIndex(); // offset = getOffset() in ICU4C
int matchindex = search_.matchedIndex_;
int matchlength = search_.matchedLength();
@ -545,29 +435,19 @@ public abstract class SearchIterator
}
/**
* Search <b>backwards</b> in the target text for the next valid match,
* starting the search from the current iterator position. The iterator is
* adjusted so that its current index, as returned by {@link #getIndex},
* is the starting position of the match if one was found. If a match is
* found, the index is returned, otherwise <tt>DONE</tt> is returned. If
* overlapping mode is set, the end of the found match can be after the
* beginning of the previous match, if any.
* @return The starting index of the next backwards match after the current
* iterator position, or
* <tt>DONE</tt> if there are no more matches.
* @see #getMatchStart
* @see #getMatchLength
* @see #getMatchedText
* @see #following
* @see #preceding
* @see #next
* @see #first
* @see #last
* @see #DONE
* Returns the index of the previous point at which the string text
* matches the search pattern, starting at the current position.
* The iterator is adjusted so that its current index (as returned by
* {@link #getIndex}) is the match position if one was found.
* If a match is not found, {@link #DONE} will be returned and
* the iterator will be adjusted to the index {@link #DONE}.
*
* @return The index of the previous match before the current position,
* or {@link #DONE} if there are no more matches.
* @see #getIndex
* @stable ICU 2.0
*/
public int previous()
{
public int previous() {
int index; // offset in ICU4C
if (search_.reset_) {
index = search_.endIndex(); // m_search_->textLength in ICU4C
@ -611,34 +491,29 @@ public abstract class SearchIterator
/**
* Return true if the overlapping property has been set.
* See setOverlapping(boolean) for more information.
* See {@link #setOverlapping(boolean)} for more information.
*
* @see #setOverlapping
* @return true if the overlapping property has been set, false otherwise
* @stable ICU 2.8
*/
public boolean isOverlapping()
{
public boolean isOverlapping() {
return search_.isOverlap_;
}
//TODO: We should add APIs below to match ICU4C APIs
//TODO: We may add APIs below to match ICU4C APIs
// isCanonicalMatch
// getElementComparison
/**
* <p>
* Resets the search iteration. All properties will be reset to their
* default values.
* </p>
* <p>
* If a forward iteration is initiated, the next search will begin at the
* start of the target text. Otherwise, if a backwards iteration is initiated,
* the next search will begin at the end of the target text.
* </p>
* @stable ICU 2.8
*/
public void reset()
{
* Resets the iteration.
* Search will begin at the start of the text string if a forward
* iteration is initiated before a backwards iteration. Otherwise if a
* backwards iteration is initiated before a forwards iteration, the
* search will begin at the end of the text string.
*
* @stable ICU 2.0
*/
public void reset() {
setMatchNotFound();
setIndex(search_.beginIndex());
search_.isOverlap_ = false;
@ -647,112 +522,103 @@ public abstract class SearchIterator
search_.isForwardSearching_ = true;
search_.reset_ = true;
}
/**
* Return the index of the first <b>forward</b> match in the target text.
* This method sets the iteration to begin at the start of the
* target text and searches forward from there.
* @return The index of the first forward match, or <code>DONE</code>
* if there are no matches.
* @see #getMatchStart
* @see #getMatchLength
* @see #getMatchedText
* @see #following
* @see #preceding
* @see #next
* @see #previous
* @see #last
* @see #DONE
* Returns the first index at which the string text matches the search
* pattern. The iterator is adjusted so that its current index (as
* returned by {@link #getIndex()}) is the match position if one
*
* was found.
* If a match is not found, {@link #DONE} will be returned and
* the iterator will be adjusted to the index {@link #DONE}.
* @return The character index of the first match, or
* {@link #DONE} if there are no matches.
*
* @see #getIndex
* @stable ICU 2.0
*/
public final int first()
{
public final int first() {
int startIdx = search_.beginIndex();
setIndex(startIdx);
return handleNext(startIdx);
}
/**
* Return the index of the first <b>forward</b> match in target text that
* is at or after argument <tt>position</tt>.
* This method sets the iteration to begin at the specified
* position in the the target text and searches forward from there.
* @return The index of the first forward match, or <code>DONE</code>
* if there are no matches.
* @see #getMatchStart
* @see #getMatchLength
* @see #getMatchedText
* @see #first
* @see #preceding
* @see #next
* @see #previous
* @see #last
* @see #DONE
* Returns the first index equal or greater than <tt>position</tt> at which the
* string text matches the search pattern. The iterator is adjusted so
* that its current index (as returned by {@link #getIndex()}) is the
* match position if one was found.
* If a match is not found, {@link #DONE} will be returned and the
* iterator will be adjusted to the index {@link #DONE}.
*
* @param position where search if to start from.
* @return The character index of the first match following
* <tt>position</tt>, or {@link #DONE} if there are no matches.
* @throws IndexOutOfBoundsException If position is less than or greater
* than the text range for searching.
* @see #getIndex
* @stable ICU 2.0
*/
public final int following(int position)
{
public final int following(int position) {
setIndex(position);
return handleNext(position);
}
/**
* Return the index of the first <b>backward</b> match in target text.
* This method sets the iteration to begin at the end of the
* target text and searches backwards from there.
* @return The starting index of the first backward match, or
* <code>DONE</code> if there are no matches.
* @see #getMatchStart
* @see #getMatchLength
* @see #getMatchedText
* @see #first
* @see #preceding
* @see #next
* @see #previous
* @see #following
* @see #DONE
* Returns the last index in the target text at which it matches the
* search pattern. The iterator is adjusted so that its current index
* (as returned by {@link #getIndex}) is the match position if one was
* found.
* If a match is not found, {@link #DONE} will be returned and
* the iterator will be adjusted to the index {@link #DONE}.
*
* @return The index of the first match, or {@link #DONE} if
* there are no matches.
* @see #getIndex
* @stable ICU 2.0
*/
public final int last()
{
public final int last() {
int endIdx = search_.endIndex();
setIndex(endIdx);
return handlePrevious(endIdx);
}
/**
* Return the index of the first <b>backwards</b> match in target
* text that ends at or before argument <tt>position</tt>.
* This method sets the iteration to begin at the argument
* position index of the target text and searches backwards from there.
* @return The starting index of the first backwards match, or
* <code>DONE</code>
* if there are no matches.
* @see #getMatchStart
* @see #getMatchLength
* @see #getMatchedText
* @see #first
* @see #following
* @see #next
* @see #previous
* @see #last
* @see #DONE
* Returns the first index less than <tt>position</tt> at which the string
* text matches the search pattern. The iterator is adjusted so that its
* current index (as returned by {@link #getIndex}) is the match
* position if one was found. If a match is not found,
* {@link #DONE} will be returned and the iterator will be
* adjusted to the index {@link #DONE}
* <p>
* When the overlapping option ({@link #isOverlapping}) is off, the last index of the
* result match is always less than <tt>position</tt>.
* When the overlapping option is on, the result match may span across
* <tt>position</tt>.
*
* @param position where search is to start from.
* @return The character index of the first match preceding
* <tt>position</tt>, or {@link #DONE} if there are
* no matches.
* @throws IndexOutOfBoundsException If position is less than or greater than
* the text range for searching
* @see #getIndex
* @stable ICU 2.0
*/
public final int preceding(int position)
{
public final int preceding(int position) {
setIndex(position);
return handlePrevious(position);
}
// protected constructor ----------------------------------------------
/**
* Protected constructor for use by subclasses.
* Initializes the iterator with the argument target text for searching
* and sets the BreakIterator.
* See class documentation for more details on the use of the target text
* and BreakIterator.
* and {@link BreakIterator}.
*
* @param target The target text to be searched.
* @param breaker A {@link BreakIterator} that is used to determine the
* boundaries of a logical match. This argument can be null.
@ -790,7 +656,8 @@ public abstract class SearchIterator
/**
* Sets the length of the most recent match in the target text.
* Subclasses' handleNext() and handlePrevious() methods should call this
* after they find a match in the target text.
* after they find a match in the target text.
*
* @param length new length to set
* @see #handleNext
* @see #handlePrevious
@ -802,50 +669,41 @@ public abstract class SearchIterator
}
/**
* Abstract method which subclasses override to provide the mechanism
* for finding the next match in the target text. This allows different
* subclasses to provide different search algorithms.
* <p>
* Abstract method that subclasses override to provide the mechanism
* for finding the next <b>forwards</b> match in the target text. This
* allows different subclasses to provide different search algorithms.
* </p>
* <p>
* If a match is found, this function must call setMatchLength(int) to
* set the length of the result match.
* The iterator is adjusted so that its current index, as returned by
* {@link #getIndex}, is the starting position of the match if one was
* found. If a match is not found, <tt>DONE</tt> will be returned.
* </p>
* @param start index in the target text at which the forwards search
* should begin.
* @return the starting index of the next forwards match if found, DONE
* otherwise
* @see #setMatchLength(int)
* @see #handlePrevious(int)
* @see #DONE
* If a match is found, the implementation should return the index at
* which the match starts and should call
* {@link #setMatchLength} with the number of characters
* in the target text that make up the match. If no match is found, the
* method should return {@link #DONE}.
*
* @param start The index in the target text at which the search
* should start.
* @return index at which the match starts, else if match is not found
* {@link #DONE} is returned
* @see #setMatchLength
* @stable ICU 2.0
*/
protected abstract int handleNext(int start);
/**
* Abstract method which subclasses override to provide the mechanism for
* finding the previous match in the target text. This allows different
* subclasses to provide different search algorithms.
* <p>
* Abstract method which subclasses override to provide the mechanism
* for finding the next <b>backwards</b> match in the target text.
* This allows different
* subclasses to provide different search algorithms.
* </p>
* <p>
* If a match is found, this function must call setMatchLength(int) to
* set the length of the result match.
* The iterator is adjusted so that its current index, as returned by
* {@link #getIndex}, is the starting position of the match if one was
* found. If a match is not found, <tt>DONE</tt> will be returned.
* </p>
* @param startAt index in the target text at which the backwards search
* should begin.
* @return the starting index of the next backwards match if found,
* DONE otherwise
* @see #setMatchLength(int)
* @see #handleNext(int)
* @see #DONE
* If a match is found, the implementation should return the index at
* which the match starts and should call
* {@link #setMatchLength} with the number of characters
* in the target text that make up the match. If no match is found, the
* method should return {@link #DONE}.
*
* @param startAt The index in the target text at which the search
* should start.
* @return index at which the match starts, else if match is not found
* {@link #DONE} is returned
* @see #setMatchLength
* @stable ICU 2.0
*/
protected abstract int handlePrevious(int startAt);
@ -878,16 +736,16 @@ public abstract class SearchIterator
*/
STANDARD_ELEMENT_COMPARISON,
/**
* <p>Collation element comparison is modified to effectively provide behavior
* between the specified strength and strength - 1.</p>
*
* <p>Collation elements in the pattern that have the base weight for the specified
* Collation element comparison is modified to effectively provide behavior
* between the specified strength and strength - 1.
* <p>
* Collation elements in the pattern that have the base weight for the specified
* strength are treated as "wildcards" that match an element with any other
* weight at that collation level in the searched text. For example, with a
* secondary-strength English collator, a plain 'e' in the pattern will match
* a plain e or an e with any diacritic in the searched text, but an e with
* diacritic in the pattern will only match an e with the same diacritic in
* the searched text.<p>
* the searched text.
*
* @draft ICU 53
* @provisional This API might change or be removed in a future release.
@ -895,16 +753,16 @@ public abstract class SearchIterator
PATTERN_BASE_WEIGHT_IS_WILDCARD,
/**
* <p>Collation element comparison is modified to effectively provide behavior
* between the specified strength and strength - 1.</p>
*
* <p>Collation elements in either the pattern or the searched text that have the
* Collation element comparison is modified to effectively provide behavior
* between the specified strength and strength - 1.
* <p>
* Collation elements in either the pattern or the searched text that have the
* base weight for the specified strength are treated as "wildcards" that match
* an element with any other weight at that collation level. For example, with
* a secondary-strength English collator, a plain 'e' in the pattern will match
* a plain e or an e with any diacritic in the searched text, but an e with
* diacritic in the pattern will only match an e with the same diacritic or a
* plain e in the searched text.</p>
* plain e in the searched text.
*
* @draft ICU 53
* @provisional This API might change or be removed in a future release.
@ -913,9 +771,9 @@ public abstract class SearchIterator
}
/**
* <p>Sets the collation element comparison type.</p>
*
* <p>The default comparison type is {@link ElementComparisonType#STANDARD_ELEMENT_COMPARISON}.</p>
* Sets the collation element comparison type.
* <p>
* The default comparison type is {@link ElementComparisonType#STANDARD_ELEMENT_COMPARISON}.
*
* @see ElementComparisonType
* @see #getElementComparisonType()
@ -927,7 +785,7 @@ public abstract class SearchIterator
}
/**
* <p>Returns the collation element comparison type.</p>
* Returns the collation element comparison type.
*
* @see ElementComparisonType
* @see #setElementComparisonType(ElementComparisonType)

View file

@ -14,150 +14,111 @@ import com.ibm.icu.util.ICUException;
import com.ibm.icu.util.ULocale;
// Java porting note:
// ICU4C implementation contains dead code in many places.
//
// ICU4C implementation contains dead code in many places.
// While porting ICU4C linear search implementation, these dead codes
// were not fully ported. The code block tagged by "// *** Boyer-Moore ***"
// are those dead code, still available in ICU4C.
//TODO: ICU4C implementation does not seem to handle UCharacterIterator pointing
// ICU4C implementation does not seem to handle UCharacterIterator pointing
// a fragment of text properly. ICU4J uses CharacterIterator to navigate through
// the input text. We need to carefully review the code ported from ICU4C
// assuming the start index is 0.
//TODO: ICU4C implementation initializes pattern.CE and pattern.PCE. It looks
// ICU4C implementation initializes pattern.CE and pattern.PCE. It looks
// CE is no longer used, except a few places checking CELength. It looks this
// is a left over from already disable Boyer-Moore search code. This Java implementation
// preserves the code, but we should clean them up later.
//TODO: We need to update document to remove the term "Boyer-Moore search".
/**
/**
*
* <tt>StringSearch</tt> is a {@link SearchIterator} that provides
* language-sensitive text searching based on the comparison rules defined
* in a {@link RuleBasedCollator} object.
* StringSearch ensures that language eccentricity can be
* handled, e.g. for the German collator, characters &szlig; and SS will be matched
* if case is chosen to be ignored.
* See the <a href="http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm">
* "ICU Collation Design Document"</a> for more information.
* <p>
* <code>StringSearch</code> is the concrete subclass of
* <code>SearchIterator</code> that provides language-sensitive text searching
* based on the comparison rules defined in a {@link RuleBasedCollator} object.
* </p>
* <p>
* <code>StringSearch</code> uses a version of the fast Boyer-Moore search
* algorithm that has been adapted to work with the large character set of
* Unicode. Refer to
* <a href="http://www.icu-project.org/docs/papers/efficient_text_searching_in_java.html">
* "Efficient Text Searching in Java"</a>, published in the
* <i>Java Report</i> on February, 1999, for further information on the
* algorithm.
* </p>
* <p>
* Users are also strongly encouraged to read the section on
* <a href="http://www.icu-project.org/userguide/searchString.html">
* String Search</a> and
* <a href="http://www.icu-project.org/userguide/Collate_Intro.html">
* Collation</a> in the user guide before attempting to use this class.
* </p>
* <p>
* String searching becomes a little complicated when accents are encountered at
* match boundaries. If a match is found and it has preceding or trailing
* accents not part of the match, the result returned will include the
* preceding accents up to the first base character, if the pattern searched
* for starts an accent. Likewise,
* if the pattern ends with an accent, all trailing accents up to the first
* base character will be included in the result.
* </p>
* <p>
* For example, if a match is found in target text "a&#92;u0325&#92;u0300" for
* the pattern
* "a&#92;u0325", the result returned by StringSearch will be the index 0 and
* length 3 &lt;0, 3&gt;. If a match is found in the target
* "a&#92;u0325&#92;u0300"
* for the pattern "&#92;u0300", then the result will be index 1 and length 2
* <1, 2>.
* </p>
* <p>
* In the case where the decomposition mode is on for the RuleBasedCollator,
* all matches that starts or ends with an accent will have its results include
* preceding or following accents respectively. For example, if pattern "a" is
* looked for in the target text "&aacute;&#92;u0325", the result will be
* index 0 and length 2 &lt;0, 2&gt;.
* </p>
* <p>
* The StringSearch class provides two options to handle accent matching
* described below:
* </p>
* <p>
* Let S' be the sub-string of a text string S between the offsets start and
* end &lt;start, end&gt;.
* <br>
* A pattern string P matches a text string S at the offsets &lt;start,
* length&gt;
* There are 2 match options for selection:<br>
* Let S' be the sub-string of a text string S between the offsets start and
* end [start, end].
* <br>
* A pattern string P matches a text string S at the offsets [start, end]
* if
* <pre>
* option 1. P matches some canonical equivalent string of S'. Suppose the
* RuleBasedCollator used for searching has a collation strength of
* TERTIARY, all accents are non-ignorable. If the pattern
* "a&#92;u0300" is searched in the target text
* "a&#92;u0325&#92;u0300",
* a match will be found, since the target text is canonically
* equivalent to "a&#92;u0300&#92;u0325"
* option 2. P matches S' and if P starts or ends with a combining mark,
* there exists no non-ignorable combining mark before or after S'
* in S respectively. Following the example above, the pattern
* "a&#92;u0300" will not find a match in "a&#92;u0325&#92;u0300",
* since
* there exists a non-ignorable accent '&#92;u0325' in the middle of
* 'a' and '&#92;u0300'. Even with a target text of
* "a&#92;u0300&#92;u0325" a match will not be found because of the
* non-ignorable trailing accent &#92;u0325.
* option 1. Some canonical equivalent of P matches some canonical equivalent
* of S'
* option 2. P matches S' and if P starts or ends with a combining mark,
* there exists no non-ignorable combining mark before or after S?
* in S respectively.
* </pre>
* Option 2. will be the default mode for dealing with boundary accents unless
* specified via the API setCanonical(boolean).
* One restriction is to be noted for option 1. Currently there are no
* composite characters that consists of a character with combining class > 0
* before a character with combining class == 0. However, if such a character
* exists in the future, the StringSearch may not work correctly with option 1
* when such characters are encountered.
* </p>
* Option 2. will be the default.
* <p>
* <tt>SearchIterator</tt> provides APIs to specify the starting position
* within the text string to be searched, e.g. <tt>setIndex</tt>,
* <tt>preceding</tt> and <tt>following</tt>. Since the starting position will
* be set as it is specified, please take note that there are some dangerous
* positions which the search may render incorrect results:
* This search has APIs similar to that of other text iteration mechanisms
* such as the break iterators in {@link BreakIterator}. Using these
* APIs, it is easy to scan through text looking for all occurrences of
* a given pattern. This search iterator allows changing of direction by
* calling a {@link #reset} followed by a {@link #next} or {@link #previous}.
* Though a direction change can occur without calling {@link #reset} first,
* this operation comes with some speed penalty.
* Match results in the forward direction will match the result matches in
* the backwards direction in the reverse order
* <p>
* {@link SearchIterator} provides APIs to specify the starting position
* within the text string to be searched, e.g. {@link SearchIterator#setIndex setIndex},
* {@link SearchIterator#preceding preceding} and {@link SearchIterator#following following}. Since the
* starting position will be set as it is specified, please take note that
* there are some danger points which the search may render incorrect
* results:
* <ul>
* <li> The midst of a substring that requires decomposition.
* <li> The midst of a substring that requires normalization.
* <li> If the following match is to be found, the position should not be the
* second character which requires to be swapped with the preceding
* character. Vice versa, if the preceding match is to be found,
* position to search from should not be the first character which
* second character which requires to be swapped with the preceding
* character. Vice versa, if the preceding match is to be found,
* position to search from should not be the first character which
* requires to be swapped with the next character. E.g certain Thai and
* Lao characters require swapping.
* <li> If a following pattern match is to be found, any position within a
* contracting sequence except the first will fail. Vice versa if a
* preceding pattern match is to be found, a invalid starting point
* <li> If a following pattern match is to be found, any position within a
* contracting sequence except the first will fail. Vice versa if a
* preceding pattern match is to be found, a invalid starting point
* would be any character within a contracting sequence except the last.
* </ul>
* </p>
* <p>
* Though collator attributes will be taken into consideration while
* performing matches, there are no APIs provided in StringSearch for setting
* and getting the attributes. These attributes can be set by getting the
* collator from <tt>getCollator</tt> and using the APIs in
* <tt>com.ibm.icu.text.Collator</tt>. To update StringSearch to the new
* collator attributes, <tt>reset()</tt> or
* <tt>setCollator(RuleBasedCollator)</tt> has to be called.
* </p>
* A {@link BreakIterator} can be used if only matches at logical breaks are desired.
* Using a {@link BreakIterator} will only give you results that exactly matches the
* boundaries given by the {@link BreakIterator}. For instance the pattern "e" will
* not be found in the string "\u00e9" if a character break iterator is used.
* <p>
* Consult the
* <a href="http://www.icu-project.org/userguide/searchString.html">
* String Search</a> user guide and the <code>SearchIterator</code>
* documentation for more information and examples of use.
* </p>
* Options are provided to handle overlapping matches.
* E.g. In English, overlapping matches produces the result 0 and 2
* for the pattern "abab" in the text "ababab", where else mutually
* exclusive matches only produce the result of 0.
* <p>
* This class is not subclassable
* Though collator attributes will be taken into consideration while
* performing matches, there are no APIs here for setting and getting the
* attributes. These attributes can be set by getting the collator
* from {@link #getCollator} and using the APIs in {@link RuleBasedCollator}.
* Lastly to update <tt>StringSearch</tt> to the new collator attributes,
* {@link #reset} has to be called.
* <p>
* Restriction: <br>
* Currently there are no composite characters that consists of a
* character with combining class > 0 before a character with combining
* class == 0. However, if such a character exists in the future,
* <tt>StringSearch</tt> does not guarantee the results for option 1.
* <p>
* Consult the {@link SearchIterator} documentation for information on
* and examples of how to use instances of this class to implement text
* searching.
* <p>
* Note, <tt>StringSearch</tt> is not to be subclassed.
* </p>
* @see SearchIterator
* @see RuleBasedCollator
* @author Laura Werner, synwee
* @stable ICU 2.0
* @since ICU 2.0
*/
// internal notes: all methods do not guarantee the correct status of the
// characteriterator. the caller has to maintain the original index position
@ -165,8 +126,9 @@ import com.ibm.icu.util.ULocale;
public final class StringSearch extends SearchIterator {
/**
* DONE is returned by previous() and next() after all valid matches have
* been returned, and by first() and last() if there are no matches at all.
* DONE is returned by {@link #previous()} and {@link #next()} after all valid matches have
* been returned, and by {@link SearchIterator#first() first()} and
* {@link SearchIterator#last() last()} if there are no matches at all.
* @see #previous
* @see #next
* @stable ICU 2.0
@ -198,19 +160,18 @@ public final class StringSearch extends SearchIterator {
/**
* Initializes the iterator to use the language-specific rules defined in
* the argument collator to search for argument pattern in the argument
* target text. The argument breakiter is used to define logical matches.
* target text. The argument <code>breakiter</code> is used to define logical matches.
* See super class documentation for more details on the use of the target
* text and BreakIterator.
* text and {@link BreakIterator}.
* @param pattern text to look for.
* @param target target text to search for pattern.
* @param collator RuleBasedCollator that defines the language rules
* @param collator {@link RuleBasedCollator} that defines the language rules
* @param breakiter A {@link BreakIterator} that is used to determine the
* boundaries of a logical match. This argument can be null.
* @exception IllegalArgumentException thrown when argument target is null,
* @throws IllegalArgumentException thrown when argument target is null,
* or of length 0
* @see BreakIterator
* @see RuleBasedCollator
* @see SearchIterator
* @stable ICU 2.0
*/
public StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator,
@ -259,14 +220,13 @@ public final class StringSearch extends SearchIterator {
/**
* Initializes the iterator to use the language-specific rules defined in
* the argument collator to search for argument pattern in the argument
* target text. No BreakIterators are set to test for logical matches.
* target text. No {@link BreakIterator}s are set to test for logical matches.
* @param pattern text to look for.
* @param target target text to search for pattern.
* @param collator RuleBasedCollator that defines the language rules
* @exception IllegalArgumentException thrown when argument target is null,
* @param collator {@link RuleBasedCollator} that defines the language rules
* @throws IllegalArgumentException thrown when argument target is null,
* or of length 0
* @see RuleBasedCollator
* @see SearchIterator
* @stable ICU 2.0
*/
public StringSearch(String pattern, CharacterIterator target, RuleBasedCollator collator) {
@ -277,17 +237,12 @@ public final class StringSearch extends SearchIterator {
* Initializes the iterator to use the language-specific rules and
* break iterator rules defined in the argument locale to search for
* argument pattern in the argument target text.
* See super class documentation for more details on the use of the target
* text and BreakIterator.
* @param pattern text to look for.
* @param target target text to search for pattern.
* @param locale locale to use for language and break iterator rules
* @exception IllegalArgumentException thrown when argument target is null,
* @throws IllegalArgumentException thrown when argument target is null,
* or of length 0. ClassCastException thrown if the collator for
* the specified locale is not a RuleBasedCollator.
* @see BreakIterator
* @see RuleBasedCollator
* @see SearchIterator
* @stable ICU 2.0
*/
public StringSearch(String pattern, CharacterIterator target, Locale locale) {
@ -299,11 +254,11 @@ public final class StringSearch extends SearchIterator {
* break iterator rules defined in the argument locale to search for
* argument pattern in the argument target text.
* See super class documentation for more details on the use of the target
* text and BreakIterator.
* text and {@link BreakIterator}.
* @param pattern text to look for.
* @param target target text to search for pattern.
* @param locale ulocale to use for language and break iterator rules
* @exception IllegalArgumentException thrown when argument target is null,
* @param locale locale to use for language and break iterator rules
* @throws IllegalArgumentException thrown when argument target is null,
* or of length 0. ClassCastException thrown if the collator for
* the specified locale is not a RuleBasedCollator.
* @see BreakIterator
@ -318,17 +273,12 @@ public final class StringSearch extends SearchIterator {
/**
* Initializes the iterator to use the language-specific rules and
* break iterator rules defined in the default locale to search for
* argument pattern in the argument target text.
* See super class documentation for more details on the use of the target
* text and BreakIterator.
* argument pattern in the argument target text.
* @param pattern text to look for.
* @param target target text to search for pattern.
* @exception IllegalArgumentException thrown when argument target is null,
* @throws IllegalArgumentException thrown when argument target is null,
* or of length 0. ClassCastException thrown if the collator for
* the default locale is not a RuleBasedCollator.
* @see BreakIterator
* @see RuleBasedCollator
* @see SearchIterator
* @stable ICU 2.0
*/
public StringSearch(String pattern, String target) {
@ -337,17 +287,14 @@ public final class StringSearch extends SearchIterator {
}
/**
* Gets the {@link RuleBasedCollator} used for the language rules.
* <p>
* Gets the RuleBasedCollator used for the language rules.
* Since <tt>StringSearch</tt> depends on the returned {@link RuleBasedCollator}, any
* changes to the {@link RuleBasedCollator} result should follow with a call to
* either {@link #reset()} or {@link #setCollator(RuleBasedCollator)} to ensure the correct
* search behavior.
* </p>
* <p>
* Since StringSearch depends on the returned RuleBasedCollator, any
* changes to the RuleBasedCollator result should follow with a call to
* either StringSearch.reset() or
* StringSearch.setCollator(RuleBasedCollator) to ensure the correct
* search behaviour.
* </p>
* @return RuleBasedCollator used by this StringSearch
* @return {@link RuleBasedCollator} used by this <tt>StringSearch</tt>
* @see RuleBasedCollator
* @see #setCollator
* @stable ICU 2.0
@ -357,15 +304,11 @@ public final class StringSearch extends SearchIterator {
}
/**
* Sets the {@link RuleBasedCollator} to be used for language-specific searching.
* <p>
* Sets the RuleBasedCollator to be used for language-specific searching.
* </p>
* <p>
* This method causes internal data such as Boyer-Moore shift tables
* to be recalculated, but the iterator's position is unchanged.
* </p>
* @param collator to use for this StringSearch
* @exception IllegalArgumentException thrown when collator is null
* The iterator's position will not be changed by this method.
* @param collator to use for this <tt>StringSearch</tt>
* @throws IllegalArgumentException thrown when collator is null
* @see #getCollator
* @stable ICU 2.0
*/
@ -390,7 +333,7 @@ public final class StringSearch extends SearchIterator {
}
/**
* Returns the pattern for which StringSearch is searching for.
* Returns the pattern for which <tt>StringSearch</tt> is searching for.
* @return the pattern searched for
* @stable ICU 2.0
*/
@ -399,13 +342,8 @@ public final class StringSearch extends SearchIterator {
}
/**
* <p>
* Set the pattern to search for.
* </p>
* <p>
* This method causes internal data such as Boyer-Moore shift tables
* to be recalculated, but the iterator's position is unchanged.
* </p>
* The iterator's position will not be changed by this method.
* @param pattern for searching
* @see #getPattern
* @exception IllegalArgumentException thrown if pattern is null or of
@ -435,10 +373,8 @@ public final class StringSearch extends SearchIterator {
}
/**
* <p>
* Set the canonical match mode. See class documentation for details.
* The default setting for this property is false.
* </p>
* @param allowCanonical flag indicator if canonical matches are allowed
* @see #isCanonical
* @stable ICU 2.8
@ -449,13 +385,7 @@ public final class StringSearch extends SearchIterator {
}
/**
* Set the target text to be searched. Text iteration will hence begin at
* the start of the text string. This method is useful if you want to
* re-use an iterator to search within a different body of text.
* @param text new text iterator to look for match,
* @exception IllegalArgumentException thrown when text is null or has
* 0 length
* @see #getTarget
* {@inheritDoc}
* @stable ICU 2.8
*/
@Override
@ -465,12 +395,7 @@ public final class StringSearch extends SearchIterator {
}
/**
* Return the index in the target text where the iterator is currently
* positioned at.
* If the iteration has gone past the end of the target text or past
* the beginning for a backwards search, {@link #DONE} is returned.
* @return index in the target text where the iterator is currently
* positioned at
* {@inheritDoc}
* @stable ICU 2.8
*/
@Override
@ -483,23 +408,7 @@ public final class StringSearch extends SearchIterator {
}
/**
* <p>
* Sets the position in the target text which the next search will start
* from to the argument. This method clears all previous states.
* </p>
* <p>
* This method takes the argument position and sets the position in the
* target text accordingly, without checking if position is pointing to a
* valid starting point to begin searching.
* </p>
* <p>
* Search positions that may render incorrect results are highlighted in
* the class documentation.
* </p>
* @param position index to start next search from.
* @exception IndexOutOfBoundsException thrown if argument position is out
* of the target text range.
* @see #getIndex
* {@inheritDoc}
* @stable ICU 2.8
*/
@Override
@ -513,19 +422,7 @@ public final class StringSearch extends SearchIterator {
}
/**
* <p>
* Resets the search iteration. All properties will be reset to the
* default value.
* </p>
* <p>
* Search will begin at the start of the target text if a forward iteration
* is initiated before a backwards iteration. Otherwise if a
* backwards iteration is initiated before a forwards iteration, the search
* will begin at the end of the target text.
* </p>
* <p>
* Canonical match option will be reset to false, ie an exact match.
* </p>
* {@inheritDoc}
* @stable ICU 2.8
*/
@Override
@ -581,17 +478,7 @@ public final class StringSearch extends SearchIterator {
}
/**
* <p>
* Concrete method to provide the mechanism
* for finding the next <b>forwards</b> match in the target text.
* See super class documentation for its use.
* </p>
* @param position index in the target text at which the forwards search
* should begin.
* @return the starting index of the next forwards match if found, DONE
* otherwise
* @see #handlePrevious(int)
* @see #DONE
* {@inheritDoc}
* @stable ICU 2.8
*/
@Override
@ -641,17 +528,7 @@ public final class StringSearch extends SearchIterator {
}
/**
* <p>
* Concrete method to provide the mechanism
* for finding the next <b>backwards</b> match in the target text.
* See super class documentation for its use.
* </p>
* @param position index in the target text at which the backwards search
* should begin.
* @return the starting index of the next backwards match if found, DONE
* otherwise
* @see #handleNext(int)
* @see #DONE
* {@inheritDoc}
* @stable ICU 2.8
*/
@Override