[search] Improve search ranking behavior #1661
Labels
No labels
Accessibility
Accessibility
Address
Address
Android
Android
Android Auto
Android Auto
Android Automotive (AAOS)
Android Automotive (AAOS)
API
API
AppGallery
AppGallery
AppStore
AppStore
Battery and Performance
Battery and Performance
Blocker
Blocker
Bookmarks and Tracks
Bookmarks and Tracks
Borders
Borders
Bug
Bug
Build
Build
CarPlay
CarPlay
Classificator
Classificator
Community
Community
Core
Core
CrashReports
CrashReports
Cycling
Cycling
Desktop
Desktop
DevEx
DevEx
DevOps
DevOps
dev_sandbox
dev_sandbox
Directions
Directions
Documentation
Documentation
Downloader
Downloader
Drape
Drape
Driving
Driving
Duplicate
Duplicate
Editor
Editor
Elevation
Elevation
Enhancement
Enhancement
Epic
Epic
External Map Datasets
External Map Datasets
F-Droid
F-Droid
Fonts
Fonts
Frequently User Reported
Frequently User Reported
Fund
Fund
Generator
Generator
Good first issue
Good first issue
Google Play
Google Play
GPS
GPS
GSoC
GSoC
iCloud
iCloud
Icons
Icons
iOS
iOS
Legal
Legal
Linux Desktop
Linux Desktop
Linux packaging
Linux packaging
Linux Phone
Linux Phone
Mac OS
Mac OS
Map Data
Map Data
Metro
Metro
Navigation
Navigation
Need Feedback
Need Feedback
Night Mode
Night Mode
NLnet 2024-06-281
NLnet 2024-06-281
No Feature Parity
No Feature Parity
Opening Hours
Opening Hours
Outdoors
Outdoors
POI Info
POI Info
Privacy
Privacy
Public Transport
Public Transport
Raw Idea
Raw Idea
Refactoring
Refactoring
Regional
Regional
Regression
Regression
Releases
Releases
RoboTest
RoboTest
Route Planning
Route Planning
Routing
Routing
Ruler
Ruler
Search
Search
Security
Security
Styles
Styles
Tests
Tests
Track Recording
Track Recording
Translations
Translations
TTS
TTS
UI
UI
UX
UX
Walk Navigation
Walk Navigation
Watches
Watches
Web
Web
Wikipedia
Wikipedia
Windows
Windows
Won't fix
Won't fix
World Map
World Map
No milestone
No project
No assignees
1 participant
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/organicmaps-tmp#1661
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "improved-search"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR fixes some bugs in search rankings, and should make search more usable, especially for partial matches and buildings whose numbers aren't yet in OSM. I think it'll help with some of the issues linked in #1560
Here's a summary of changes:
Added new synonyms
Search can now recognize
rt
=route,ct
=court, and a few other synonyms.Disabled m_prefix
(Reverted this to prior behavior with better understanding of its role)
The current version of search pops the last word of a query and saves it to a variable called m_prefix. The prefix token was used differently than the rest of the query when ranking results, so a search for "S Valley Way" would be more-or-less truncated to "S Valley". As a result, the search was unable to differentiate between Way/St/Circle/etc.Searching for an address as "1600 Pennsylvania Ave" is very common, and the prefix handling made that very inaccurate.That said... I'm American. I'm not sure if the prefix handling serves an important role for users of other nationalities. Let me know if so and I'll revisit it. (All code to handle prefixes is still there; I just disabled it insearch/processor.cpp
)Fixed matchedLength
The current search computes matchLength based on the name of matched feature. That meant a search for "S Xenia St" could match "South Valley Highway" - they share the "South". But the highway is 18 characters long. That produced a matchedFraction of 18/8, which incorrectly increased the relevance of that match.
The new implementation computes and stores matchedLength as part of the GetNameScores routine. It computes match length on a token-by-token basis and uses the query length instead of feature length. In the prior example, that means that only the letter S matches. That yields a 1/8 matchedFraction score, which is much more reasonable.
Migrating matchedLength into nameScores allows
GetNameScores
to return just aNameScores
instead of apair<NameScores, size_t>
. That simplified accessing the data.The change also allowed for removing a few other places that computed matchLength, such as
ranker.cpp:195
.Leave suggestions available as results
The current implementation removes typing suggestions from search results. While that prevents duplication, it also keeps users from accessing the highly-ranked result. They'd have to click the suggestion and wait for the search to rerun.
The new implementation leaves suggestions alone, but also presents the suggestions as normal results.
Rewrote
GetNameScores
Most of the ranking logic lies in the ranking_utils.hpp implementation of
GetNameScores
. I was struggling to understand it as-was due to nondescript variable names and missing comments. The new version is heavily commented and behaves very similarly in unit testing.Unit tests
There are a few new test cases to explicitly test synonyms (s/south, st/street, etc). All tests now check matchedLength in addition to error count.
One was removed:failed since the tokenizing process strips spaces. Once spaces are removed, "Fran" does match "San Francisco".(Reenabling m_prefix tokenization restored this test)
Let me know what you think! Between playing around in unit testing and the desktop app, I'm much happier with the search results I'm seeing.
@ -45,1 +45,4 @@
};
// @todo These are worth reevaluating. A few issues (i.e. 1376) say
// that distant cities outrank nearby buildings & SUBPOIs when searching.
// Adjusting kDistanceToPivot or the values below would help with that.
It might be worth changing more of these. A few issues in the tracker talk about getting city entries from far away. Reducing the city/state/country bonuses below, or increasing
kCategoriesDistanceToPivot
would help with that.Thanks for the PR! Let's wait for the CI tests first. Search ranking is very sensitive part and unit/integration tests are the best/only quality criteria here.
I have doubts here, because m_prefix is not only for ranking, but for fetching results/suggestions first.
m_tokens work like full match, m_prefix works like prefix match.
Try "penselv" search query and check the suggestions.
Got it. I'll try turning prefixing back on and looking around a little. The prefix token was definitely being ignored in
GetNameScores
, but I can probably fix that less invasively.I'll go back and take a look at the tests. I was focusing on
search_tests
, but it looks like I broke some of thesearch_quality_tests
.Is this log really necessary? It would make sense to log matched synonym.
Nope. I'd meant to take that out when I finished debugging why S/South type matches weren't working. I'll push a new round of changes later today.
Fixed in 1da320 - I just removed the logging statement.
Reverted in 1da320 - adding synonym matching to the prefix token fixed the issues I was seeing with st/rd/etc matching without breaking the PREFIX nameScore.
Use "const &" in both functions.
Not sure here, but probably sequenced equal calls check for no side effect .. But, most likely, they are useless.
Add "@todo" marker here
I think it's testing the trailing space. Adding a trailing space forces a normal match instead of an
m_prefixToken
prefix match.But
abcd
doesn't normal-match toabcdstr
since there are three errors and length-four tokens only allow one error. It works for prefix matching since those accept partial matches.@ -182,0 +208,4 @@
if (sliceCount == tokenCount)
nameScore = NAME_SCORE_FULL_MATCH;
else
nameScore = NAME_SCORE_PREFIX;
The correspondent .cpp functions should also be fixed.
nit comments
@ -29,3 +35,4 @@
{"rt", {"route"}},
{"св", {"святой", "святого", "святая", "святые", "святых", "свято"}},
{"б", {"большая", "большой"}},
{"бол", {"большая", "большой"}},
@vng in Germany and Switzerland "str" is used for "strasse" (street).
Should these synonyms check the language of the map region and use only corresponding translation? @TODO it in the issue?
Why is it not removed now?
@ -45,1 +45,4 @@
};
// @todo These are worth reevaluating. A few issues (i.e. 1376) say
// that distant cities outrank nearby buildings & SUBPOIs when searching.
// Adjusting kDistanceToPivot or the values below would help with that.
Please add it as a todo comment.
Space after // here and around.
Is
enum
necessary here?nit: omit {} for single block bodies.
Please align all comments here and below (add space after //)
Please rebase on top of the master branch, check and fix all GH action check errors.
Ah, I see! Well, but it should, because "straße" suffix is generated as a separate token. This test is valid and should work.
@ -29,3 +35,4 @@
{"rt", {"route"}},
{"св", {"святой", "святого", "святая", "святые", "святых", "свято"}},
{"б", {"большая", "большой"}},
{"бол", {"большая", "большой"}},
strasse (straße) is a special case and it's processed separately.
See AddStrasseNames in search_index_builder.cpp
@ -29,3 +35,4 @@
{"rt", {"route"}},
{"св", {"святой", "святого", "святая", "святые", "святых", "свято"}},
{"б", {"большая", "большой"}},
{"бол", {"большая", "большой"}},
@bnitkin it is worth being added as a comment here.
@ -29,3 +35,4 @@
{"rt", {"route"}},
{"св", {"святой", "святого", "святая", "святые", "святых", "свято"}},
{"б", {"большая", "большой"}},
{"бол", {"большая", "большой"}},
@todo is a good idea. I was thinking about matching languages, but don't know how to get the map region language from
QueryParams::AddSynonyms()
.This fixes the third bullet in #1376.
vec
is a shallow copy of the search results, so erasing the element means there's an autocomplete match for the feature but not a result the user can select.The attached image shows an example. In the PR version (top) "S Quebec Street" is present both as an autocomplete hint and as the second result. In the bottom (current), it's only a suggestion. That forces the user to click the suggestion and rerun the search.
Doesn't look it. I'll get rid of it.
This test should work, because "strasse" suffix is separately processed in generator. Is this test-case breaks your logic somehow?
This test should work, because "strasse" suffix is separately processed in generator. Is this test-case breaks your logic somehow?
Hm. Going back to the master branch, there are two flavors of prefix matches, and I only support one right now. I'll fix that.
One case is where the prefix token partially matches the result token, like below. That's supported in my code.
The other case is where a non-prefix token matches the start of a query:
I'm currently parsing that as a SUBSTRING match, but as a human, PREFIX makes more sense.
Let me think on how to write that without adding too many special cases. I'll have something soon.
Well, "фото " should not match "фотоателье". But "abcdstrasse" is an exception because of German language specific.
Generator does index for "abcdstrasse" as 2 separate tokens like "abcd" and "strasse", so "abcd " query will match the first token.
I think this is the same issue as in the "abcd " test - I'll fix my ranking code and put them both back in.
Great, waiting for checks. And the last commit is bad. You should make "git submodule update --recursive" after rebase, instead of making a new commit with old boost ref.
CI checks are ok. I approve, but remove boost submodule change and merge all changes as 1 commit!
I force-undid the submodule commit. Sorry about that - I'm not familiar with submodules. It's rebuilding & running tests locally now.
Should I squash the branch down to one commit, or do you make that happen on merge?
Thanks for all the feedback! I'm really excited to see this change in a release.
Squash by yourself, please.
Squished and retested locally. Let me know if you need anything else from me on this!
Comment this log, please.
Can do.
@bnitkin Congrats with your first merged PR!
Wait, but then the result (if it is NOT a street, like in #1376) should be removed from the suggestion: existing logic should be inverted. Why user needs the result (not a street) in the suggestion? The case for streets is to type the full address with fewer clicks. That's why they need a suggestion (and you're right, they shouldn't be displayed in results).
To clarify about german streets: they are often named like "Bahnhofstraße" (ß is normalized to "ss"). But people often type them as "Bahnhof strasse" or "Bahnhofstr" or "Bahnhofstr."
I think it already works that way - suggestions require the
RankerResult
to be a country, city, or street:But I'm not sure how that intersects with #1376. Maybe there's a street that's also called Carrefour that was converted to a suggestion, and the search never found the SUBPOI?
Well, this change is not about #1376.
The logic of "i = vec.erase(i);" - If some street (country, city) result goes to the suggestion part, it will NOT go to the results list. Actually, I can't say for sure is it good or bad, so I also agree to change this logic and test.
Now we have "South Quebec Street" as a suggestion AND as a result (see highway-secondary).
What about this approach?
At least for the US, I think it's important to present streets as results.
As mentioned in #1263, lots of people have trouble when the address they entered doesn't exist. In the US, there are many areas where roads are mapped fully but street numbers aren't. Returning the street as a result is an easy fallback to get a user close to the right place.
That's what I was thinking in including the suggestions as results. (It'd obviously be nicer if all buildings existed and had addresses, but that's a much larger project.)
Users will see their street as a suggestion (but not in results, saving potential space for another relevant result or suggestion). And they click on the suggested street. And it will appear in results after the click was done (and disappear from suggestions).
The main idea with suggestions is to save the number of clicks and avoid potential mistypes. The most popular pattern is to type several first street letters, then click a suggestion, and then type the house number.
In the US, however, people usually type the number first and then type the street name (and here we don't show suggestions).
If there is a house with a number, then it should be in results (and no street alone, of course). If there is no house, then only street should be displayed in results (but not in suggestion in both cases).
I'd like to check how it will go with street results. And I like it so far.