Add check for unicode normalization in config #44
No reviewers
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/wikiparser#44
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "normalization"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This ensures that the config sections match Wikipedia's Unicode
normalization. We could also normalize every section in every article to
handle an edge case, but I don't think that's worth it yet.
Can config values be always normalized after the loading, and each wiki section be normalized before comparison? Or is there a comparison function that automatically handle normalization?
The wiki section headers are already normalized to NFC by Wikipedia as explained here, with this exception (which I think would be unlikely in our use case):
We could normalize the config values after loading and either normalize each header (to double-check) or not.
In this case, since normalization behavior is sometimes unexpected and most keyboards/websites are already normalized, I think it's better to make the change explicit. If someone copies the header from Wikipedia or types it out, it should already be normalized.
I'm not aware of a library that handles case-sensitive unicode normalization comparison. We could make a small wrapper around the
unicode-normalization
iterators with.zip()
and.all()
.We could also do case-insensitive comparisons, there are:
regex also supports case-insensitive Unicode comparison, building a single regex automata from all fixed string for a language would probably be the best-performing solution. It doesn't support normalization though.
Regardless, before choosing any of those methods I think we should aggregate all the headers of the articles and see if any are:
Then we could compare a couple different methods and see if they give different results.
This PR just makes sure we don't have any characters in the wrong form for the current bytewise comparison.
Thanks for the details!
Ok, let's apply the current approach and only fix/react in the future if some bugs are discovered.