Add check for unicode normalization in config #44

Merged
newsch merged 2 commits from normalization into main 2024-04-24 14:34:11 +00:00
newsch commented 2024-04-21 16:42:58 +00:00 (Migrated from github.com)

This ensures that the config sections match Wikipedia's Unicode
normalization. We could also normalize every section in every article to
handle an edge case, but I don't think that's worth it yet.

This ensures that the config sections match Wikipedia's Unicode normalization. We could also normalize every section in every article to handle an edge case, but I don't think that's worth it yet.
biodranik (Migrated from github.com) reviewed 2024-04-21 18:49:27 +00:00
biodranik (Migrated from github.com) left a comment

Can config values be always normalized after the loading, and each wiki section be normalized before comparison? Or is there a comparison function that automatically handle normalization?

Can config values be always normalized after the loading, and each wiki section be normalized before comparison? Or is there a comparison function that automatically handle normalization?
newsch commented 2024-04-22 16:00:41 +00:00 (Migrated from github.com)

Can config values be always normalized after the loading, and each wiki section be normalized before comparison?

The wiki section headers are already normalized to NFC by Wikipedia as explained here, with this exception (which I think would be unlikely in our use case):

MediaWiki doesn't apply any normalization to its output, for example cafe<nowiki/>́ becomes "café" (shows U+0065 U+0301 in a row, without precomposed characters like U+00E9 appearing).

We could normalize the config values after loading and either normalize each header (to double-check) or not.

In this case, since normalization behavior is sometimes unexpected and most keyboards/websites are already normalized, I think it's better to make the change explicit. If someone copies the header from Wikipedia or types it out, it should already be normalized.

Or is there a comparison function that automatically handle normalization?

I'm not aware of a library that handles case-sensitive unicode normalization comparison. We could make a small wrapper around the unicode-normalization iterators with .zip() and .all().

We could also do case-insensitive comparisons, there are:

regex also supports case-insensitive Unicode comparison, building a single regex automata from all fixed string for a language would probably be the best-performing solution. It doesn't support normalization though.


Regardless, before choosing any of those methods I think we should aggregate all the headers of the articles and see if any are:

  • not normalized
  • mixed case

Then we could compare a couple different methods and see if they give different results.

This PR just makes sure we don't have any characters in the wrong form for the current bytewise comparison.

> Can config values be always normalized after the loading, and each wiki section be normalized before comparison? The wiki section headers are already normalized to NFC by Wikipedia [as explained here](https://mediawiki.org/wiki/Unicode_normalization_considerations), with this exception (which I think would be unlikely in our use case): > MediaWiki doesn't apply any normalization to its output, for example `cafe<nowiki/>́` becomes "café" (shows U+0065 [U+0301](https://en.wikipedia.org/wiki/Combining_Diacritical_Marks) in a row, without precomposed characters like U+00E9 appearing). We _could_ normalize the config values after loading and either normalize each header (to double-check) or not. In this case, since normalization behavior is sometimes unexpected and most keyboards/websites are already normalized, I think it's better to make the change _explicit_. If someone copies the header from Wikipedia or types it out, it should already be normalized. > Or is there a comparison function that automatically handle normalization? I'm not aware of a library that handles case-sensitive unicode normalization comparison. We could make a small wrapper around the `unicode-normalization` iterators with `.zip()` and `.all()`. We could also do case-insensitive comparisons, there are: - [unicase](https://lib.rs/crates/unicase) - [caseless](https://lib.rs/crates/caseless/) [regex](https://lib.rs/crates/regex/) also supports case-insensitive Unicode comparison, building a single regex automata from all fixed string for a language would probably be the best-performing solution. It [doesn't support normalization though](https://github.com/rust-lang/regex/blob/b12a2761f91320bc8bf8246f88d2884a90034b5a/UNICODE.md). --- Regardless, before choosing any of those methods I think we should aggregate all the headers of the articles and see if any are: - not normalized - mixed case Then we could compare a couple different methods and see if they give different results. This PR just makes sure we don't have any characters in the wrong form for the current bytewise comparison.
biodranik (Migrated from github.com) approved these changes 2024-04-22 20:55:05 +00:00
biodranik (Migrated from github.com) left a comment

Thanks for the details!
Ok, let's apply the current approach and only fix/react in the future if some bugs are discovered.

Thanks for the details! Ok, let's apply the current approach and only fix/react in the future if some bugs are discovered.
Sign in to join this conversation.
No description provided.