mirror of
https://github.com/unicode-org/icu.git
synced 2025-04-06 14:05:32 +00:00
ICU-21710 Remove BOYER_MOORE dead code from usearch.cpp
This commit is contained in:
parent
9ddda243d7
commit
6d850be783
5 changed files with 25 additions and 2406 deletions
|
@ -268,8 +268,8 @@ Werner's text searching article for more details
|
|||
(<http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>).
|
||||
|
||||
However, implementing collation-based search with the Boyer-Moore method
|
||||
while getting correct results is very tricky,
|
||||
and ICU no longer uses this method.
|
||||
while getting correct results is very tricky, and ICU no longer uses this method
|
||||
(as of ICU4C 4.0 and ICU4J 53).
|
||||
|
||||
Please see the [String Search Service](./string-search) chapter.
|
||||
|
||||
|
|
|
@ -270,20 +270,24 @@ the following `StringSearch` specific considerations:
|
|||
|
||||
### Search Algorithm
|
||||
|
||||
ICU4C releases up to 3.8 used the Boyer-Moore search algorithm in the string
|
||||
ICU4C (C/C++) releases up to 3.8 used the Boyer-Moore search algorithm in the string
|
||||
search service. There were some known issues in these previous releases.
|
||||
(See ICU tickets [ICU-5024](https://unicode-org.atlassian.net/browse/ICU-5024),
|
||||
[ICU-5382](https://unicode-org.atlassian.net/browse/ICU-5382),
|
||||
[ICU-5420](https://unicode-org.atlassian.net/browse/ICU-5420))
|
||||
[ICU-5420](https://unicode-org.atlassian.net/browse/ICU-5420)).
|
||||
|
||||
In ICU4C 4.0, the string
|
||||
search service was updated with the simple linear search algorithm, which
|
||||
locates a match by shifting a cursor in the target text one by one, and these
|
||||
issues were fixed. In ICU4C 4.0.1, the Boyer-Moore search code was reintroduced
|
||||
as a separated API set as a technology preview. In a later release, this code was deleted.
|
||||
In ICU4C 4.0, the string search service was updated to use a simple linear search
|
||||
algorithm, which locates a match by shifting a cursor in the target text one by one,
|
||||
and these issues were fixed.
|
||||
|
||||
The Boyer-Moore searching
|
||||
algorithm is based on automata or combinatorial properties of strings and
|
||||
In ICU4C 4.0.1, the Boyer-Moore search code was reintroduced as a separate API with
|
||||
technology preview status. However, in ICU4C 51.1, this was removed.
|
||||
(See ICU ticket [ICU-9573](https://unicode-org.atlassian.net/browse/ICU-9573)).
|
||||
|
||||
Similarly, in ICU4J 53 (Java) the Boyer-Moore search algorithm was replaced by the
|
||||
simple linear search algorithm, ported from ICU4C. (See ICU ticket [ICU-6288](https://unicode-org.atlassian.net/browse/ICU-6288)).
|
||||
|
||||
The Boyer-Moore search algorithm is based on automata or combinatorial properties of strings and
|
||||
pre-processes the pattern and known to be much faster than the linear search
|
||||
when search pattern length is longer. According to performance evaluation
|
||||
between these two implementations, the Boyer-Moore search is faster than the
|
||||
|
|
|
@ -195,7 +195,9 @@ determine whether case and accents are ignored during a search.
|
|||
|
||||
#### What algorithm are you using to perform the search?
|
||||
|
||||
StringSearch uses a version of the Boyer-Moore search algorithm that has been
|
||||
As of ICU4J 53 / ICU4C 4.0, StringSearch uses a simple linear search algorithm which
|
||||
locates a match by shifting a cursor in the target text one by one. Previous
|
||||
versions of ICU used a version of the Boyer-Moore search algorithm which was
|
||||
modified for use with Unicode. Rather than using raw Unicode character values in
|
||||
its comparisons and shift tables, the algorithm uses collation elements that
|
||||
have been "hashed" down to a smaller range to make the tables a reasonable size.
|
||||
|
|
|
@ -35,8 +35,9 @@
|
|||
* See the <a href="http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm">
|
||||
* "ICU Collation Design Document"</a> for more information.
|
||||
* <p>
|
||||
* The implementation may use a linear search or a modified form of the Boyer-Moore
|
||||
* search; for more information on the latter see
|
||||
* As of ICU4C 4.0 / ICU4J 53, the implementation uses a linear search. In previous versions,
|
||||
* a modified form of the Boyer-Moore searching algorithm was used. For more information
|
||||
* on the modified Boyer-Moore algorithm see
|
||||
* <a href="http://icu-project.org/docs/papers/efficient_text_searching_in_java.html">
|
||||
* "Efficient Text Searching in Java"</a>, published in <i>Java Report</i>
|
||||
* in February, 1999.
|
||||
|
@ -595,8 +596,8 @@ U_CAPI UCollator * U_EXPORT2 usearch_getCollator(
|
|||
/**
|
||||
* Sets the collator used for the language rules. User retains the ownership
|
||||
* of this collator, thus the responsibility of deletion lies with the user.
|
||||
* This method causes internal data such as Boyer-Moore shift tables to
|
||||
* be recalculated, but the iterator's position is unchanged.
|
||||
* This method causes internal data such as the pattern collation elements
|
||||
* and shift tables to be recalculated, but the iterator's position is unchanged.
|
||||
* @param strsrch search iterator data struct
|
||||
* @param collator to be used
|
||||
* @param status for errors if it occurs
|
||||
|
@ -608,7 +609,7 @@ U_CAPI void U_EXPORT2 usearch_setCollator( UStringSearch *strsrch,
|
|||
|
||||
/**
|
||||
* Sets the pattern used for matching.
|
||||
* Internal data like the Boyer Moore table will be recalculated, but the
|
||||
* Internal data like the pattern collation elements will be recalculated, but the
|
||||
* iterator's position is unchanged.
|
||||
*
|
||||
* The UStringSearch retains a pointer to the pattern string. The caller must not
|
||||
|
|
File diff suppressed because it is too large
Load diff
Loading…
Add table
Reference in a new issue