Inside of locimp_forLanguageTag() in _appendKeywords() in uloc_tag.cpp
there's a hardcoded special case for "-u-va-posix" which appends the
"_POSIX" variant but this was missed during the refactoring made for
ICU-22520 (there isn't any test case that covers this).
So the call to locimp_forLanguageTag() did more than previously
understood, but we still don't want to have to call that for every
language tag that has BCP-47 extensions just in order to get to this
special case. Instead, add a special case also to ulocimp_getSubtags().
For this to work nicely, the loop in _getVariant() that copies variants
needs to be refactored so that it easily can break when encountering the
start of any BCP-47 extension (which also has the welcome side-effect of
making it more efficient, being able to append an entire variant at once
to the output sink).
This was broken by commit 678d5c1273.
These optional output parameters weren't used when these function were
originally added so they were most likely included just in case someone
would want to use them in the future, but that was 10 years ago now and
they still haven't been used yet, so it's unlikely that they'll be used
in the foreseeable future and call sites as well as the implementation
can instead be simplified by removing them.
The repeated sequence of allocating a CharString and CharStringByteSink,
before calling some function that writes into this, can be moved into a
single shared helper function which then is used to give all ulocimp.h
functions that write to ByteSink an overload that instead returns a
CharString, to make call sites look like perfectly normal C++ code.
The repeated sequence of allocating a CheckedArrayByteSink, calling some
function that writes into this, then checking for overflow and returning
through u_terminateChars() can all be moved into a single shared helper
function.
There's a subtle difference between these two ways of accessing the
value of an optional and that is that the value() method can throw an
exception if there isn't any value, but operator* won't do that (it's
just undefined behavior if there isn't any value).
ICU4C code never tries to access any optional value without first
checking that it exists, but the ability of the value() method to throw
an exception in case there wasn't any such check first is the reason why
std::exception symbols previously could show up in debug builds.
This reverts the changes that were made to dependencies.txt by
commit dc70b5a056.
· No function should do anything if an error has already occurred.
· On error, a value of 0, nullptr, {}, etc., should be returned.
· Values shouldn't have overloaded meanings (eg. index or found).
· Values that are never used should not be returned at all.
Some of this code was originally written as C code and some of this code
was originally written as C++ code but made to resemble the then already
existing code that had once been C code. Changing it all to normal C++
now will make it easier and safer to work with going forward.
· Use unnamed namespace instead of static.
· Use reference instead of non-nullable pointer.
· Use bool instead of UBool.
· Use constexpr for static data.
· Use U_EXPORT instead of U_CAPI or U_CFUNC.
· Use the default calling convention instead of U_EXPORT2.
These functions that eventually write their output to a ByteSink need a
small temporary buffer for processing the subtag they're about to write
and currently use a local CharString object to provide this buffer,
which then gets written to the ByteSink and discarded.
This intermediate step is unnecessary as a ByteSink can provide an
append buffer which can be used instead, eliminating the need to
allocate a local temporary buffer and to copy the data around.
This approach also makes it natural to split the processing into two
steps, first calculating the length of the subtag, then processing it,
which makes it possible to return early when no output is requested.
These wrappers that call ulocimp_getSubtags() to get only one particular
subtag and then return that as icu::CharString will be convenient for
replacing code that currently calls the uloc_get*() functions writing
into a fixed size buffer.
These functions now no longer have any other callers so they can be made
internal to the compilation unit of ulocimp_getSubtags(), thus bringing
them back to how they originally were intended to be used (and making
the comment above them true once again).
This also makes it possible to remove the temporary icu::CharString
objects that previously were returned to callers and instead write
directly to icu::ByteSink, making the code both simpler and less
wasteful (also that how this was once intended).
The logic for parsing a localeID string into its constituent subtags is
currently repeated over and over again in each one of the uloc_get*()
functions, so that calling all these functions one after the other in
order to get all the subtags does the parsing all over again from the
beginning for each function call.
In order to avoid having to do this parsing over and over again, a lot
of code instead has its own copy of the parsing logic in order to call
the underlying ulocimp_get*() functions directly for lower runtime cost
at the price of increased code complexity and repetition.
This new ulocimp_getSubtags() function, which writes natively to
icu::ByteSink and has a convenience wrapper to write to icu::CharString,
removes the repeated code from the uloc_get*() functions and makes it
possible to update all code that calls the ulocimp_get*() functions.
Originally added by commit 24055f8585
for ICU-7882, converting any language tag with a BCP-47 extension into a
legacy Unicode locale ID was a simple way to make the existing code keep
working unchanged also with BCP-47 extensions.
But the only thing that uloc_getVariant() needs is being able to find
out where variants end and extensions begin, for which converting the
entire language tag is unnecessary, it's much more straightforward to
instead just check for the -t-, -u- or -x- marker that indicates the
start of a BCP-47 extension.
This is the normal standard way in C, C++ as well as Java and there's no
longer any reason for ICU to be different. The various internal macros
providing custom boolean constants can all be deleted and code as well
as documentation can be updated to use lowercase true/false everywhere.