Inside of locimp_forLanguageTag() in _appendKeywords() in uloc_tag.cpp
there's a hardcoded special case for "-u-va-posix" which appends the
"_POSIX" variant but this was missed during the refactoring made for
ICU-22520 (there isn't any test case that covers this).
So the call to locimp_forLanguageTag() did more than previously
understood, but we still don't want to have to call that for every
language tag that has BCP-47 extensions just in order to get to this
special case. Instead, add a special case also to ulocimp_getSubtags().
For this to work nicely, the loop in _getVariant() that copies variants
needs to be refactored so that it easily can break when encountering the
start of any BCP-47 extension (which also has the welcome side-effect of
making it more efficient, being able to append an entire variant at once
to the output sink).
This was broken by commit 678d5c1273.
Document the fact
uloc_getDisplay(Language|Script|Country|Variant|Keyword|KeywordValue)
would fallback with the code, case canonicalied in same cases, and
set the status to U_USING_DEFAULT_WARNING.
No change to the implementation behavior. Only complete the missing
comments and tweak line wrap, remove double spaces and add test to
validate this pre-existing behavior that I added the documents now.
This eliminates the need for a scratch buffer in Locale::toLanguageTag()
and also the need for counting bytes required in uloc_toLanguageTag(),
something that ByteSink will now handle correctly and thereby
eliminating the bug where too few bytes required was returned.
uloc_forLanguageTag has a few mapping tables to map grandfathered
language tags and deprecated language subtags to their preferred or
modern values.
Update them based on the latest version of the IANA
language subtag registry. [1]
Five grandfathered tags without a preferred value are still mapped to
what ICU has mapped them to for backward compatibility until the
wisdom of continuing to do so is reviewed.
In addition, map redundant language tags to their preferred values
regardless of whether they're followed by other subtags or not. (e.g.
zh-yue vs zh-yue-u-co-pinyin) .
Similary, ja-latn-hepburn-heploc is mapped to ja-latn-alaic97 (the
variant subtag 'hepburn-helploc' with the prefix 'ja-latn' has the
preferred value, 'alaic97') .
Update the mapping for deprecated language subtags (e.g. 'jw' to
'jv' and a bunch of 3-letter language codes).
Add a new table for deprecated region subtags to map them to their
modern values. (e.g. 'DD' to 'DE').
Add a new test case for deprecated language and region mapping and
a few more cases for updated grandfathered and redundant tag mapping.
[1]
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
This gets rid of those fixed buffers that caused ICU-13417 to be filed
in the first place, those that prevent handling language tags with very
large amounts of keywords.
A number of fixed buffers will still remain in uloc_tag.cpp (and
elsewhere in the locale handling code) for the time being, but this
change is a necessary first step in cleaning up this code and will
alleviate the most pressing problem encountered by ICU4C users.
An off-by-one error in _getKeywords() caused uloc_canonicalize() to not
write out the final keyword value in case the result would fill up the
buffer exactly, resulting in U_STRING_NOT_TERMINATED_WARNING.
* ICU-20140 Allow duplicated keys in U-extension per RFC 6067
RFC 6067 [1] does allow duplicate keywords, but ICU4C's
uloc_forLanguageCode rejects it as invalid.
Change it to accept duplicate keywords and honor only the
1st one while ignoring subsequent ones per RFC 6067.
[1] Unicode extension to BCP 47:
https://tools.ietf.org/html/rfc6067
* ICU-20140 Add ICU4J test and tweak ICU4C test
ICU4J test diverges from ICU4C tests:
1. Handling of duplicate variants in ICU4J seem to be wrong:
https://unicode-org.atlassian.net/browse/ICU-20148
2. ULocale.forLanguageTag only throws NullPointException so
that ICU4C's test for duplicate attributes cannot be ported.