From 50df916b5072400c9bde781e2fd9202f27bde9e4 Mon Sep 17 00:00:00 2001 From: Nemanja Trifunovic Date: Sat, 21 Oct 2023 16:56:49 -0400 Subject: [PATCH] Update README.md Restructure the reference, add installation instructions, toc, other minor changes --- README.md | 1125 +++++++++++++++++++++++++++++++---------------------- 1 file changed, 667 insertions(+), 458 deletions(-) diff --git a/README.md b/README.md index 4b0cad0..bf0c3bc 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,124 @@ + + # UTF8-CPP: UTF-8 with C++ in a Portable Way + ## Introduction -C++ developers miss an easy and portable way of handling Unicode encoded strings. The original C++ Standard (known as C++98 or C++03) is Unicode agnostic. C++11 provides some support for Unicode on core language and library level: u8, u, and U character and string literals, char16_t and char32_t character types, u16string and u32string library classes, and codecvt support for conversions between Unicode encoding forms. In the meantime, developers use third party libraries like ICU, OS specific capabilities, or simply roll out their own solutions. +C++ developers still miss an easy and portable way of handling Unicode encoded strings. The original C++ standard (known as C++98 or C++03) is Unicode agnostic. Some progress has been made in the later editions of the standard, but it is still hard to work with Unicode using only the standard facilities. -In order to easily handle UTF-8 encoded Unicode strings, I came up with a small, C++98 compatible generic library. For anybody used to work with STL algorithms and iterators, it should be easy and natural to use. The code is freely available for any purpose - check out the [license](./LICENSE). The library has been used a lot in the past ten years both in commercial and open-source projects and is considered feature-complete now. If you run into bugs or performance issues, please let me know and I'll do my best to address them. +I came up with a small, C++98 compatible generic library in order to handle UTF-8 encoded strings. For anybody used to work with STL algorithms and iterators, it should be easy and natural to use. The code is freely available for any purpose - check out the [license](./LICENSE). The library has been used a lot since the first release in 2006 both in commercial and open-source projects and proved to be stable and useful. -The purpose of this article is not to offer an introduction to Unicode in general, and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out [Unicode Home Page](http://www.unicode.org/) or some other source of information for Unicode. Also, it is not my aim to advocate the use of UTF-8 encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from C++, I am sure you have good reasons for it. +## Table of Contents +- [UTF8-CPP: UTF-8 with C++ in a Portable Way](#utf8-cpp-utf-8-with-c-in-a-portable-way) + * [Introduction](#introduction) + * [Installation](#installation) + * [Examples of use](#examples-of-use) + + [Introductory Sample](#introductory-sample) + + [Checking if a file contains valid UTF-8 text](#checking-if-a-file-contains-valid-utf-8-text) + + [Ensure that a string contains valid UTF-8 text](#ensure-that-a-string-contains-valid-utf-8-text) + * [Points of interest](#points-of-interest) + - [Design goals and decisions](#design-goals-and-decisions) + - [Alternatives](#alternatives) + * [Reference](#reference) + + [Functions From utf8 Namespace](#functions-from-utf8-namespace) + - [utf8::append](#utf8append) + * [octet_iterator append(utfchar32_t cp, octet_iterator result)](#octet_iterator-appendutfchar32_t-cp-octet_iterator-result) + * [void append(utfchar32_t cp, std::string& s);](#void-appendutfchar32_t-cp-stdstring-s) + - [utf8::append16](#utf8append16) + * [word_iterator append16(utfchar32_t cp, word_iterator result)](#word_iterator-append16utfchar32_t-cp-word_iterator-result) + * [void append(utfchar32_t cp, std::u16string& s)](#void-appendutfchar32_t-cp-stdu16string-s) + - [utf8::next](#utf8next) + - [utf8::next16](#utf8next16) + - [utf8::peek_next](#utf8peek_next) + - [utf8::prior](#utf8prior) + - [utf8::advance](#utf8advance) + - [utf8::distance](#utf8distance) + - [utf8::utf16to8](#utf8utf16to8) + * [octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result)](#octet_iterator-utf16to8-u16bit_iterator-start-u16bit_iterator-end-octet_iterator-result) + * [std::string utf16to8(const std::u16string& s)](#stdstring-utf16to8const-stdu16string-s) + * [std::string utf16to8(std::u16string_view s)](#stdstring-utf16to8stdu16string_view-s) + - [utf8::utf16tou8](#utf8utf16tou8) + * [std::u8string utf16tou8(const std::u16string& s)](#stdu8string-utf16tou8const-stdu16string-s) + * [std::u8string utf16tou8(const std::u16string_view& s)](#stdu8string-utf16tou8const-stdu16string_view-s) + - [utf8::utf8to16](#utf8utf8to16) + * [u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result)](#u16bit_iterator-utf8to16-octet_iterator-start-octet_iterator-end-u16bit_iterator-result) + * [std::u16string utf8to16(const std::string& s)](#stdu16string-utf8to16const-stdstring-s) + * [std::u16string utf8to16(std::string_view s)](#stdu16string-utf8to16stdstring_view-s) + * [std::u16string utf8to16(std::u8string& s)](#stdu16string-utf8to16stdu8string-s) + * [std::u16string utf8to16(std::u8string_view& s)](#stdu16string-utf8to16stdu8string_view-s) + - [utf8::utf32to8](#utf8utf32to8) + * [octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result)](#octet_iterator-utf32to8-u32bit_iterator-start-u32bit_iterator-end-octet_iterator-result) + * [std::string utf32to8(const std::u32string& s)](#stdstring-utf32to8const-stdu32string-s) + * [std::u8string utf32to8(const std::u32string& s)](#stdu8string-utf32to8const-stdu32string-s) + * [std::u8string utf32to8(const std::u32string_view& s)](#stdu8string-utf32to8const-stdu32string_view-s) + * [std::string utf32to8(const std::u32string& s)](#stdstring-utf32to8const-stdu32string-s-1) + * [std::string utf32to8(std::u32string_view s)](#stdstring-utf32to8stdu32string_view-s) + - [utf8::utf8to32](#utf8utf8to32) + * [u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result)](#u32bit_iterator-utf8to32-octet_iterator-start-octet_iterator-end-u32bit_iterator-result) + * [std::u32string utf8to32(const std::u8string& s)](#stdu32string-utf8to32const-stdu8string-s) + * [std::u32string utf8to32(const std::u8string_view& s)](#stdu32string-utf8to32const-stdu8string_view-s) + * [std::u32string utf8to32(const std::string& s)](#stdu32string-utf8to32const-stdstring-s) + * [std::u32string utf8to32(std::string_view s)](#stdu32string-utf8to32stdstring_view-s) + - [utf8::find_invalid](#utf8find_invalid) + * [octet_iterator find_invalid(octet_iterator start, octet_iterator end)](#octet_iterator-find_invalidoctet_iterator-start-octet_iterator-end) + * [const char* find_invalid(const char* str)](#const-char-find_invalidconst-char-str) + * [std::size_t find_invalid(const std::string& s)](#stdsize_t-find_invalidconst-stdstring-s) + * [std::size_t find_invalid(std::string_view s)](#stdsize_t-find_invalidstdstring_view-s) + - [utf8::is_valid](#utf8is_valid) + * [bool is_valid(octet_iterator start, octet_iterator end)](#bool-is_validoctet_iterator-start-octet_iterator-end) + * [bool is_valid(const char* str)](#bool-is_validconst-char-str) + * [bool is_valid(const std::string& s)](#bool-is_validconst-stdstring-s) + * [bool is_valid(std::string_view s)](#bool-is_validstdstring_view-s) + - [utf8::replace_invalid](#utf8replace_invalid) + * [output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, utfchar32_t replacement)](#output_iterator-replace_invalidoctet_iterator-start-octet_iterator-end-output_iterator-out-utfchar32_t-replacement) + * [std::string replace_invalid(const std::string& s, utfchar32_t replacement)](#stdstring-replace_invalidconst-stdstring-s-utfchar32_t-replacement) + * [std::string replace_invalid(std::string_view s, char32_t replacement)](#stdstring-replace_invalidstdstring_view-s-char32_t-replacement) + - [utf8::starts_with_bom](#utf8starts_with_bom) + * [bool starts_with_bom (octet_iterator it, octet_iterator end)](#bool-starts_with_bom-octet_iterator-it-octet_iterator-end) + * [bool starts_with_bom(const std::string& s)](#bool-starts_with_bomconst-stdstring-s) + * [bool starts_with_bom(std::string_view s)](#bool-starts_with_bomstdstring_view-s) + + [Types From utf8 Namespace](#types-from-utf8-namespace) + - [utf8::exception](#utf8exception) + - [utf8::invalid_code_point](#utf8invalid_code_point) + - [utf8::invalid_utf8](#utf8invalid_utf8) + - [utf8::invalid_utf16](#utf8invalid_utf16) + - [utf8::not_enough_room](#utf8not_enough_room) + - [utf8::iterator](#utf8iterator) + * [Member functions](#member-functions) + + [Functions From utf8::unchecked Namespace](#functions-from-utf8unchecked-namespace) + - [utf8::unchecked::append](#utf8uncheckedappend) + - [utf8::unchecked::append16](#utf8uncheckedappend16) + - [utf8::unchecked::next](#utf8uncheckednext) + - [utf8::next16](#utf8next16-1) + - [utf8::unchecked::peek_next](#utf8uncheckedpeek_next) + - [utf8::unchecked::prior](#utf8uncheckedprior) + - [utf8::unchecked::advance](#utf8uncheckedadvance) + - [utf8::unchecked::distance](#utf8uncheckeddistance) + - [utf8::unchecked::utf16to8](#utf8uncheckedutf16to8) + - [utf8::unchecked::utf8to16](#utf8uncheckedutf8to16) + - [utf8::unchecked::utf32to8](#utf8uncheckedutf32to8) + - [utf8::unchecked::utf8to32](#utf8uncheckedutf8to32) + - [utf8::unchecked::replace_invalid](#utf8uncheckedreplace_invalid) + + [Types From utf8::unchecked Namespace](#types-from-utf8unchecked-namespace) + - [utf8::iterator](#utf8iterator-1) + * [Member functions](#member-functions-1) + + + + + +## Installation + +The recommended way to use the library is to download an official release and copy the content of source directory into location of your project's header files. +If you use CMake for your builds, I still recommend just copying the files into your project, but if you want you can use the CMakeList.txt file included in the project. + + ## Examples of use + ### Introductory Sample To illustrate the use of the library, let's start with a small but complete program that opens a file containing UTF-8 encoded text, reads it line by line, checks each line for invalid UTF-8 byte sequences, and converts it to UTF-16 encoding and back to UTF-8: @@ -100,6 +208,7 @@ In case you do not trust the `__cplusplus` macro or, for instance, do not want t the C++ 11 helper functions even with a modern compiler, define `UTF_CPP_CPLUSPLUS` macro before including `utf8.h` and assign it a value for the standard you want to use - the values are the same as for the `__cplusplus` macro. This can be also useful with compilers that are conservative in setting the `__cplusplus` macro even if they have a good support for a recent standard edition - Microsoft's Visual C++ is one example. + ### Checking if a file contains valid UTF-8 text Here is a function that checks whether the content of a file is valid UTF-8 encoded text without reading the content into the memory: @@ -126,6 +235,7 @@ Note that other functions that take input iterator arguments can be used in a si utf8::utf8to16(it, eos, back_inserter(u16string)); ``` + ### Ensure that a string contains valid UTF-8 text If we have some text that "probably" contains UTF-8 encoded text and we want to replace any invalid UTF-8 sequence with a replacement character, something like the following function may be used: @@ -142,8 +252,10 @@ void fix_utf8_string(std::string& str) The function will replace any invalid UTF-8 sequence with a Unicode replacement character. There is an overloaded function that enables the caller to supply their own replacement character. + ## Points of interest + #### Design goals and decisions The library was designed to be: @@ -153,9 +265,10 @@ The library was designed to be: 3. Lightweight: follow the "pay only for what you use" guideline. 4. Unintrusive: avoid forcing any particular design or even programming style on the user. This is a library, not a framework. + #### Alternatives -Here is an article I was made aware of only recently: [The Wonderfully Terrible World of C and C++ Encoding APIs (with Some Rust)](https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape), by JeanHeyd Meneide. In the article, this library is compared with: +For alternatives and comparisons, I recommend the following article: [The Wonderfully Terrible World of C and C++ Encoding APIs (with Some Rust)](https://thephd.dev/the-c-c++-rust-string-text-encoding-api-landscape), by JeanHeyd Meneide. In the article, this library is compared with: - [simdutf](https://github.com/simdutf/simdutf) - [iconv](https://www.gnu.org/software/libiconv/) @@ -167,35 +280,17 @@ Here is an article I was made aware of only recently: [The Wonderfully Terrible The article presents author's view of the quality of the API design, but also some speed benchmarks. + ## Reference + ### Functions From utf8 Namespace + #### utf8::append -Available in version 3.0 and later. Prior to 4.0 it required a C++ 11 compiler; the requirement is lifted with 4.0. - -Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence to a UTF-8 string. - -```cpp -void append(utfchar32_t cp, std::string& s); -``` - -`cp`: a code point to append to the string. -`s`: a utf-8 encoded string to append the code point to. - -Example of use: - -```cpp -std::string u; -append(0x0448, u); -assert (u[0] == char(0xd1) && u[1] == char(0x88) && u.length() == 2); -``` - -In case of an invalid code point, a `utf8::invalid_code_point` exception is thrown. - - -#### utf8::append + +##### octet_iterator append(utfchar32_t cp, octet_iterator result) Available in version 1.0 and later. @@ -223,30 +318,35 @@ Note that `append` does not allocate any memory - it is the burden of the caller In case of an invalid code point, a `utf8::invalid_code_point` exception is thrown. -#### utf8::append16 -Available in version 4.0 and later. Requires a C++11 compliant compiler. + +##### void append(utfchar32_t cp, std::string& s); -Encodes a 32 bit code point as a UTF-16 sequence of words and appends the sequence to a UTF-16 string. +Available in version 3.0 and later. Prior to 4.0 it required a C++ 11 compiler; the requirement is lifted with 4.0. + +Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence to a UTF-8 string. ```cpp -void append(utfchar32_t cp, std::u16string& s); +void append(utfchar32_t cp, std::string& s); ``` `cp`: a code point to append to the string. -`s`: a utf-16 encoded string to append the code point to. +`s`: a utf-8 encoded string to append the code point to. Example of use: ```cpp -std::u16string u; +std::string u; append(0x0448, u); -assert (u[0] == 0x0448 && u.length() == 1); +assert (u[0] == char(0xd1) && u[1] == char(0x88) && u.length() == 2); ``` In case of an invalid code point, a `utf8::invalid_code_point` exception is thrown. + #### utf8::append16 + +##### word_iterator append16(utfchar32_t cp, word_iterator result) Available in version 4.0 and later. @@ -275,6 +375,32 @@ Note that `append16` does not allocate any memory - it is the burden of the call In case of an invalid code point, a `utf8::invalid_code_point` exception is thrown. + +##### void append(utfchar32_t cp, std::u16string& s) + +Available in version 4.0 and later. Requires a C++11 compliant compiler. + +Encodes a 32 bit code point as a UTF-16 sequence of words and appends the sequence to a UTF-16 string. + +```cpp +void append(utfchar32_t cp, std::u16string& s); +``` + +`cp`: a code point to append to the string. +`s`: a utf-16 encoded string to append the code point to. + +Example of use: + +```cpp +std::u16string u; +append(0x0448, u); +assert (u[0] == 0x0448 && u.length() == 1); +``` + +In case of an invalid code point, a `utf8::invalid_code_point` exception is thrown. + + + #### utf8::next Available in version 1.0 and later. @@ -305,6 +431,7 @@ This function is typically used to iterate through a UTF-8 encoded string. In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. + #### utf8::next16 Available in version 4.0 and later. @@ -336,6 +463,7 @@ This function is typically used to iterate through a UTF-16 encoded string. In case of an invalid UTF-16 sequence, a `utf8::invalid_utf8` exception is thrown. + #### utf8::peek_next Available in version 2.1 and later. @@ -365,6 +493,7 @@ assert (w == twochars); In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. + #### utf8::prior Available in version 1.02 and later. @@ -399,6 +528,7 @@ In case `start` is reached before a UTF-8 lead octet is hit, or if an invalid UT In case `start` equals `it`, a `not_enough_room` exception is thrown. + #### utf8::advance Available in version 1.0 and later. @@ -428,6 +558,7 @@ assert (w == twochars); In case of an invalid code point, a `utf8::invalid_code_point` exception is thrown. + #### utf8::distance Available in version 1.0 and later. @@ -456,102 +587,10 @@ This function is used to find the length (in code points) of a UTF-8 encoded str In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. If `last` does not point to the past-of-end of a UTF-8 sequence, a `utf8::not_enough_room` exception is thrown. + #### utf8::utf16to8 - -Available in version 3.0 and later. Requires a C++ 11 compliant compiler. - -Converts a UTF-16 encoded string to UTF-8. - -```cpp -std::string utf16to8(const std::u16string& s); -``` - -`s`: a UTF-16 encoded string. -Return value: A UTF-8 encoded string. - -Example of use: - -```cpp - u16string utf16string = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e}; - string u = utf16to8(utf16string); - assert (u.size() == 10); -``` - -In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown. - -#### utf8::utf16to8 - -Available in version 3.2 and later. Requires a C++ 17 compliant compiler. - -Converts a UTF-16 encoded string to UTF-8. - -```cpp -std::string utf16to8(std::u16string_view s); -``` - -`s`: a UTF-16 encoded string. -Return value: A UTF-8 encoded string. - -Example of use: - -```cpp - u16string utf16string = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e}; - u16string_view utf16stringview(u16string); - string u = utf16to8(utf16string); - assert (u.size() == 10); -``` - -In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown. - -#### utf8::utf16tou8 - -Available in version 4.0 and later. Requires a C++ 20 compliant compiler. - -Converts a UTF-16 encoded string to UTF-8. - -```cpp -std::u8string utf16tou8(const std::u16string& s); -``` - -`s`: a UTF-16 encoded string. -Return value: A UTF-8 encoded string. - -Example of use: - -```cpp - u16string utf16string = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e}; - u8string u = utf16to8(utf16string); - assert (u.size() == 10); -``` - -In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown. - -#### utf8::utf16tou8 - -Available in version 4.0 and later. Requires a C++ 20 compliant compiler. - -Converts a UTF-16 encoded string to UTF-8. - -```cpp -std::u8string utf16tou8(const std::u16string_view& s); -``` - -`s`: a UTF-16 encoded string. -Return value: A UTF-8 encoded string. - -Example of use: - -```cpp - u16string utf16string = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e}; - u16string_view utf16stringview(u16string); - u8string u = utf16to8(utf16string); - assert (u.size() == 10); -``` - -In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown. - - -#### utf8::utf16to8 + +##### octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result) Available in version 1.0 and later. @@ -580,110 +619,111 @@ assert (utf8result.size() == 10); In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown. -#### utf8::utf8to16 + + +##### std::string utf16to8(const std::u16string& s) Available in version 3.0 and later. Requires a C++ 11 compliant compiler. -Converts an UTF-8 encoded string to UTF-16. +Converts a UTF-16 encoded string to UTF-8. ```cpp -std::u16string utf8to16(const std::string& s); +std::string utf16to8(const std::u16string& s); ``` -`s`: an UTF-8 encoded string to convert. -Return value: A UTF-16 encoded string +`s`: a UTF-16 encoded string. +Return value: A UTF-8 encoded string. Example of use: ```cpp -string utf8_with_surrogates = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"; -u16string utf16result = utf8to16(utf8_with_surrogates); -assert (utf16result.length() == 4); -assert (utf16result[2] == 0xd834); -assert (utf16result[3] == 0xdd1e); + u16string utf16string = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e}; + string u = utf16to8(utf16string); + assert (u.size() == 10); ``` -In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. +In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown. -#### utf8::utf8to16 + +##### std::string utf16to8(std::u16string_view s) Available in version 3.2 and later. Requires a C++ 17 compliant compiler. -Converts an UTF-8 encoded string to UTF-16. +Converts a UTF-16 encoded string to UTF-8. ```cpp -std::u16string utf8to16(std::string_view s); +std::string utf16to8(std::u16string_view s); ``` -`s`: an UTF-8 encoded string to convert. -Return value: A UTF-16 encoded string +`s`: a UTF-16 encoded string. +Return value: A UTF-8 encoded string. Example of use: ```cpp -string_view utf8_with_surrogates = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"; -u16string utf16result = utf8to16(utf8_with_surrogates); -assert (utf16result.length() == 4); -assert (utf16result[2] == 0xd834); -assert (utf16result[3] == 0xdd1e); + u16string utf16string = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e}; + u16string_view utf16stringview(u16string); + string u = utf16to8(utf16string); + assert (u.size() == 10); ``` -In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. +In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown. -#### utf8::utf8to16 + +#### utf8::utf16tou8 + +##### std::u8string utf16tou8(const std::u16string& s) Available in version 4.0 and later. Requires a C++ 20 compliant compiler. -Converts an UTF-8 encoded string to UTF-16. +Converts a UTF-16 encoded string to UTF-8. ```cpp -std::u16string utf8to16(std::u8string& s); +std::u8string utf16tou8(const std::u16string& s); ``` -`s`: an UTF-8 encoded string to convert. -Return value: A UTF-16 encoded string +`s`: a UTF-16 encoded string. +Return value: A UTF-8 encoded string. Example of use: ```cpp -std::u8string utf8_with_surrogates = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"; -std::u16string utf16result = utf8to16(utf8_with_surrogates); -assert (utf16result.length() == 4); -assert (utf16result[2] == 0xd834); -assert (utf16result[3] == 0xdd1e); + u16string utf16string = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e}; + u8string u = utf16tou8(utf16string); + assert (u.size() == 10); ``` -In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. +In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown. - -#### utf8::utf8to16 + +##### std::u8string utf16tou8(const std::u16string_view& s) Available in version 4.0 and later. Requires a C++ 20 compliant compiler. -Converts an UTF-8 encoded string to UTF-16. +Converts a UTF-16 encoded string to UTF-8. ```cpp -std::u16string utf8to16(std::u8string_view& s); +std::u8string utf16tou8(const std::u16string_view& s); ``` -`s`: an UTF-8 encoded string to convert. -Return value: A UTF-16 encoded string +`s`: a UTF-16 encoded string. +Return value: A UTF-8 encoded string. Example of use: ```cpp -std::u8string utf8_with_surrogates = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"; -std::u8string_view utf8stringview {utf8_with_surrogates} -std::u16string utf16result = utf8to16(utf8stringview); -assert (utf16result.length() == 4); -assert (utf16result[2] == 0xd834); -assert (utf16result[3] == 0xdd1e); + u16string utf16string = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e}; + u16string_view utf16stringview(u16string); + u8string u = utf16tou8(utf16string); + assert (u.size() == 10); ``` -In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. - +In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown. + #### utf8::utf8to16 + +##### u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result) Available in version 1.0 and later. @@ -713,127 +753,120 @@ assert (utf16result[3] == 0xdd1e); In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. If `end` does not point to the past-of-end of a UTF-8 sequence, a `utf8::not_enough_room` exception is thrown. -#### utf8::utf32to8 + + + +##### std::u16string utf8to16(const std::string& s) Available in version 3.0 and later. Requires a C++ 11 compliant compiler. -Converts a UTF-32 encoded string to UTF-8. +Converts an UTF-8 encoded string to UTF-16. ```cpp -std::string utf32to8(const std::u32string& s); +std::u16string utf8to16(const std::string& s); ``` -`s`: a UTF-32 encoded string. -Return value: a UTF-8 encoded string. +`s`: an UTF-8 encoded string to convert. +Return value: A UTF-16 encoded string Example of use: ```cpp -u32string utf32string = {0x448, 0x65E5, 0x10346}; -string utf8result = utf32to8(utf32string); -assert (utf8result.size() == 9); +string utf8_with_surrogates = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"; +u16string utf16result = utf8to16(utf8_with_surrogates); +assert (utf16result.length() == 4); +assert (utf16result[2] == 0xd834); +assert (utf16result[3] == 0xdd1e); ``` -In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. - -#### utf8::utf32tou8 - -Available in version 4.0 and later. Requires a C++ 20 compliant compiler. - -Converts a UTF-32 encoded string to UTF-8. - -```cpp -std::u8string utf32to8(const std::u32string& s); -``` - -`s`: a UTF-32 encoded string. -Return value: a UTF-8 encoded string. - -Example of use: - -```cpp -u32string utf32string = {0x448, 0x65E5, 0x10346}; -u8string utf8result = utf32to8(utf32string); -assert (utf8result.size() == 9); -``` - -In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. +In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. -#### utf8::utf32tou8 - -Available in version 4.0 and later. Requires a C++ 20 compliant compiler. - -Converts a UTF-32 encoded string to UTF-8. - -```cpp -std::u8string utf32to8(const std::u32string_view& s); -``` - -`s`: a UTF-32 encoded string. -Return value: a UTF-8 encoded string. - -Example of use: - -```cpp -u32string utf32string = {0x448, 0x65E5, 0x10346}; -u32string_view utf32stringview(utf32string); -u8string utf8result = utf32to8(utf32stringview); -assert (utf8result.size() == 9); -``` - -In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. - - -#### utf8::utf32to8 - -Available in version 3.0 and later. Requires a C++ 11 compliant compiler. - -Converts a UTF-32 encoded string to UTF-8. - -```cpp -std::string utf32to8(const std::u32string& s); -``` - -`s`: a UTF-32 encoded string. -Return value: a UTF-8 encoded string. - -Example of use: - -```cpp -u32string utf32string = {0x448, 0x65E5, 0x10346}; -string utf8result = utf32to8(utf32string); -assert (utf8result.size() == 9); -``` - -In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. - -#### utf8::utf32to8 + +##### std::u16string utf8to16(std::string_view s) Available in version 3.2 and later. Requires a C++ 17 compliant compiler. -Converts a UTF-32 encoded string to UTF-8. +Converts an UTF-8 encoded string to UTF-16. ```cpp -std::string utf32to8(std::u32string_view s); +std::u16string utf8to16(std::string_view s); ``` -`s`: a UTF-32 encoded string. -Return value: a UTF-8 encoded string. +`s`: an UTF-8 encoded string to convert. +Return value: A UTF-16 encoded string Example of use: ```cpp -u32string utf32string = {0x448, 0x65E5, 0x10346}; -u32string_view utf32stringview(utf32string); -string utf8result = utf32to8(utf32stringview); -assert (utf8result.size() == 9); +string_view utf8_with_surrogates = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"; +u16string utf16result = utf8to16(utf8_with_surrogates); +assert (utf16result.length() == 4); +assert (utf16result[2] == 0xd834); +assert (utf16result[3] == 0xdd1e); ``` -In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. +In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. + +##### std::u16string utf8to16(std::u8string& s) + +Available in version 4.0 and later. Requires a C++ 20 compliant compiler. + +Converts an UTF-8 encoded string to UTF-16. + +```cpp +std::u16string utf8to16(std::u8string& s); +``` + +`s`: an UTF-8 encoded string to convert. +Return value: A UTF-16 encoded string + +Example of use: + +```cpp +std::u8string utf8_with_surrogates = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"; +std::u16string utf16result = utf8to16(utf8_with_surrogates); +assert (utf16result.length() == 4); +assert (utf16result[2] == 0xd834); +assert (utf16result[3] == 0xdd1e); +``` + +In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. + + + +##### std::u16string utf8to16(std::u8string_view& s) + +Available in version 4.0 and later. Requires a C++ 20 compliant compiler. + +Converts an UTF-8 encoded string to UTF-16. + +```cpp +std::u16string utf8to16(std::u8string_view& s); +``` + +`s`: an UTF-8 encoded string to convert. +Return value: A UTF-16 encoded string + +Example of use: + +```cpp +std::u8string utf8_with_surrogates = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"; +std::u8string_view utf8stringview {utf8_with_surrogates} +std::u16string utf16result = utf8to16(utf8stringview); +assert (utf16result.length() == 4); +assert (utf16result[2] == 0xd834); +assert (utf16result[3] == 0xdd1e); +``` + +In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. + + #### utf8::utf32to8 + +##### octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result) Available in version 1.0 and later. @@ -862,103 +895,136 @@ assert (utf8result.size() == 9); In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. -#### utf8::utf8to32 -Available in version 4.0 and later. Requires a C++ 20 compliant compiler. - -Converts a UTF-8 encoded string to UTF-32. - -```cpp -std::u32string utf8to32(const std::u8string& s); -``` - -`s`: a UTF-8 encoded string. -Return value: a UTF-32 encoded string. - -Example of use: - -```cpp -const std::u8string* twochars = u8"\xe6\x97\xa5\xd1\x88"; -u32string utf32result = utf8to32(twochars); -assert (utf32result.size() == 2); -``` - -In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. - - -#### utf8::utf8to32 - -Available in version 4.0 and later. Requires a C++ 20 compliant compiler. - -Converts a UTF-8 encoded string to UTF-32. - -```cpp -std::u32string utf8to32(const std::u8string_view& s); -``` - -`s`: a UTF-8 encoded string. -Return value: a UTF-32 encoded string. - -Example of use: - -```cpp -const u8string* twochars = u8"\xe6\x97\xa5\xd1\x88"; -const u8string_view stringview{twochars}; -u32string utf32result = utf8to32(stringview); -assert (utf32result.size() == 2); -``` - -In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. - - -#### utf8::utf8to32 + +##### std::string utf32to8(const std::u32string& s) Available in version 3.0 and later. Requires a C++ 11 compliant compiler. -Converts a UTF-8 encoded string to UTF-32. +Converts a UTF-32 encoded string to UTF-8. ```cpp -std::u32string utf8to32(const std::string& s); +std::string utf32to8(const std::u32string& s); ``` -`s`: a UTF-8 encoded string. -Return value: a UTF-32 encoded string. +`s`: a UTF-32 encoded string. +Return value: a UTF-8 encoded string. Example of use: ```cpp -const char* twochars = "\xe6\x97\xa5\xd1\x88"; -u32string utf32result = utf8to32(twochars); -assert (utf32result.size() == 2); +u32string utf32string = {0x448, 0x65E5, 0x10346}; +string utf8result = utf32to8(utf32string); +assert (utf8result.size() == 9); ``` -In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. +In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. -#### utf8::utf8to32 + +##### std::u8string utf32to8(const std::u32string& s) + +Available in version 4.0 and later. Requires a C++ 20 compliant compiler. + +Converts a UTF-32 encoded string to UTF-8. + +```cpp +std::u8string utf32to8(const std::u32string& s); +``` + +`s`: a UTF-32 encoded string. +Return value: a UTF-8 encoded string. + +Example of use: + +```cpp +u32string utf32string = {0x448, 0x65E5, 0x10346}; +u8string utf8result = utf32to8(utf32string); +assert (utf8result.size() == 9); +``` + +In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. + + + +##### std::u8string utf32to8(const std::u32string_view& s) + +Available in version 4.0 and later. Requires a C++ 20 compliant compiler. + +Converts a UTF-32 encoded string to UTF-8. + +```cpp +std::u8string utf32to8(const std::u32string_view& s); +``` + +`s`: a UTF-32 encoded string. +Return value: a UTF-8 encoded string. + +Example of use: + +```cpp +u32string utf32string = {0x448, 0x65E5, 0x10346}; +u32string_view utf32stringview(utf32string); +u8string utf8result = utf32to8(utf32stringview); +assert (utf8result.size() == 9); +``` + +In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. + + + +##### std::string utf32to8(const std::u32string& s) + +Available in version 3.0 and later. Requires a C++ 11 compliant compiler. + +Converts a UTF-32 encoded string to UTF-8. + +```cpp +std::string utf32to8(const std::u32string& s); +``` + +`s`: a UTF-32 encoded string. +Return value: a UTF-8 encoded string. + +Example of use: + +```cpp +u32string utf32string = {0x448, 0x65E5, 0x10346}; +string utf8result = utf32to8(utf32string); +assert (utf8result.size() == 9); +``` + +In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. + + +##### std::string utf32to8(std::u32string_view s) Available in version 3.2 and later. Requires a C++ 17 compliant compiler. -Converts a UTF-8 encoded string to UTF-32. +Converts a UTF-32 encoded string to UTF-8. ```cpp -std::u32string utf8to32(std::string_view s); +std::string utf32to8(std::u32string_view s); ``` -`s`: a UTF-8 encoded string. -Return value: a UTF-32 encoded string. +`s`: a UTF-32 encoded string. +Return value: a UTF-8 encoded string. Example of use: ```cpp -string_view twochars = "\xe6\x97\xa5\xd1\x88"; -u32string utf32result = utf8to32(twochars); -assert (utf32result.size() == 2); +u32string utf32string = {0x448, 0x65E5, 0x10346}; +u32string_view utf32stringview(utf32string); +string utf8result = utf32to8(utf32stringview); +assert (utf8result.size() == 9); ``` -In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. +In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown. + #### utf8::utf8to32 + +##### u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result) Available in version 1.0 and later. @@ -987,77 +1053,111 @@ assert (utf32result.size() == 2); In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. If `end` does not point to the past-of-end of a UTF-8 sequence, a `utf8::not_enough_room` exception is thrown. -#### utf8::find_invalid -Available in version 4.0 and later. -Detects an invalid sequence within a C-style UTF-8 string. + +##### std::u32string utf8to32(const std::u8string& s) + +Available in version 4.0 and later. Requires a C++ 20 compliant compiler. + +Converts a UTF-8 encoded string to UTF-32. ```cpp -const char* find_invalid(const char* str); -``` - -`str`: a UTF-8 encoded string. -Return value: a pointer to the first invalid octet in the UTF-8 string. In case none were found, points to the trailing zero byte. - -Example of use: - -```cpp -const char* utf_invalid = "\xe6\x97\xa5\xd1\x88\xfa"; -const char* invalid = find_invalid(utf_invalid); -assert ((invalid - utf_invalid) == 5); -``` - -This function is typically used to make sure a UTF-8 string is valid before processing it with other functions. It is especially important to call it if before doing any of the _unchecked_ operations on it. - -#### utf8::find_invalid - -Available in version 3.0 and later. Prior to 4.0 it required a C++ 11 compiler; the requirement is lifted with 4.0 - -Detects an invalid sequence within a UTF-8 string. - -```cpp -std::size_t find_invalid(const std::string& s); +std::u32string utf8to32(const std::u8string& s); ``` `s`: a UTF-8 encoded string. -Return value: the index of the first invalid octet in the UTF-8 string. In case none were found, equals `std::string::npos`. +Return value: a UTF-32 encoded string. Example of use: ```cpp -string utf_invalid = "\xe6\x97\xa5\xd1\x88\xfa"; -auto invalid = find_invalid(utf_invalid); -assert (invalid == 5); +const std::u8string* twochars = u8"\xe6\x97\xa5\xd1\x88"; +u32string utf32result = utf8to32(twochars); +assert (utf32result.size() == 2); ``` -This function is typically used to make sure a UTF-8 string is valid before processing it with other functions. It is especially important to call it if before doing any of the _unchecked_ operations on it. +In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. -#### utf8::find_invalid + + +##### std::u32string utf8to32(const std::u8string_view& s) + +Available in version 4.0 and later. Requires a C++ 20 compliant compiler. + +Converts a UTF-8 encoded string to UTF-32. + +```cpp +std::u32string utf8to32(const std::u8string_view& s); +``` + +`s`: a UTF-8 encoded string. +Return value: a UTF-32 encoded string. + +Example of use: + +```cpp +const u8string* twochars = u8"\xe6\x97\xa5\xd1\x88"; +const u8string_view stringview{twochars}; +u32string utf32result = utf8to32(stringview); +assert (utf32result.size() == 2); +``` + +In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. + + + +##### std::u32string utf8to32(const std::string& s) + +Available in version 3.0 and later. Requires a C++ 11 compliant compiler. + +Converts a UTF-8 encoded string to UTF-32. + +```cpp +std::u32string utf8to32(const std::string& s); +``` + +`s`: a UTF-8 encoded string. +Return value: a UTF-32 encoded string. + +Example of use: + +```cpp +const char* twochars = "\xe6\x97\xa5\xd1\x88"; +u32string utf32result = utf8to32(twochars); +assert (utf32result.size() == 2); +``` + +In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. + + +##### std::u32string utf8to32(std::string_view s) Available in version 3.2 and later. Requires a C++ 17 compliant compiler. -Detects an invalid sequence within a UTF-8 string. +Converts a UTF-8 encoded string to UTF-32. ```cpp -std::size_t find_invalid(std::string_view s); +std::u32string utf8to32(std::string_view s); ``` `s`: a UTF-8 encoded string. -Return value: the index of the first invalid octet in the UTF-8 string. In case none were found, equals `std::string_view::npos`. +Return value: a UTF-32 encoded string. Example of use: ```cpp -string_view utf_invalid = "\xe6\x97\xa5\xd1\x88\xfa"; -auto invalid = find_invalid(utf_invalid); -assert (invalid == 5); +string_view twochars = "\xe6\x97\xa5\xd1\x88"; +u32string utf32result = utf8to32(twochars); +assert (utf32result.size() == 2); ``` -This function is typically used to make sure a UTF-8 string is valid before processing it with other functions. It is especially important to call it if before doing any of the _unchecked_ operations on it. - +In case of an invalid UTF-8 sequence, a `utf8::invalid_utf8` exception is thrown. + #### utf8::find_invalid + +##### octet_iterator find_invalid(octet_iterator start, octet_iterator end) Available in version 1.0 and later. @@ -1083,78 +1183,83 @@ assert (invalid == utf_invalid + 5); This function is typically used to make sure a UTF-8 string is valid before processing it with other functions. It is especially important to call it if before doing any of the _unchecked_ operations on it. -#### utf8::is_valid + + +##### const char* find_invalid(const char* str) Available in version 4.0 and later. -Checks whether a C-style string contains valid UTF-8 encoded text. +Detects an invalid sequence within a C-style UTF-8 string. ```cpp -bool is_valid(const char* str); +const char* find_invalid(const char* str); ``` -`str`: a UTF-8 encoded string. -Return value: `true` if the string contains valid UTF-8 encoded text; `false` if not. +`str`: a UTF-8 encoded string. +Return value: a pointer to the first invalid octet in the UTF-8 string. In case none were found, points to the trailing zero byte. Example of use: ```cpp -char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa"; -bool bvalid = is_valid(utf_invalid); -assert (bvalid == false); +const char* utf_invalid = "\xe6\x97\xa5\xd1\x88\xfa"; +const char* invalid = find_invalid(utf_invalid); +assert ((invalid - utf_invalid) == 5); ``` -You may want to use `is_valid` to make sure that a string contains valid UTF-8 text without the need to know where it fails if it is not valid. +This function is typically used to make sure a UTF-8 string is valid before processing it with other functions. It is especially important to call it if before doing any of the _unchecked_ operations on it. - -#### utf8::is_valid + +##### std::size_t find_invalid(const std::string& s) Available in version 3.0 and later. Prior to 4.0 it required a C++ 11 compiler; the requirement is lifted with 4.0 -Checks whether a string object contains valid UTF-8 encoded text. +Detects an invalid sequence within a UTF-8 string. ```cpp -bool is_valid(const std::string& s); +std::size_t find_invalid(const std::string& s); ``` -`s`: a UTF-8 encoded string. -Return value: `true` if the string contains valid UTF-8 encoded text; `false` if not. +`s`: a UTF-8 encoded string. +Return value: the index of the first invalid octet in the UTF-8 string. In case none were found, equals `std::string::npos`. Example of use: ```cpp -char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa"; -bool bvalid = is_valid(utf_invalid); -assert (bvalid == false); +string utf_invalid = "\xe6\x97\xa5\xd1\x88\xfa"; +auto invalid = find_invalid(utf_invalid); +assert (invalid == 5); ``` -You may want to use `is_valid` to make sure that a string contains valid UTF-8 text without the need to know where it fails if it is not valid. +This function is typically used to make sure a UTF-8 string is valid before processing it with other functions. It is especially important to call it if before doing any of the _unchecked_ operations on it. -#### utf8::is_valid + +##### std::size_t find_invalid(std::string_view s) Available in version 3.2 and later. Requires a C++ 17 compliant compiler. -Checks whether a string object contains valid UTF-8 encoded text. +Detects an invalid sequence within a UTF-8 string. ```cpp -bool is_valid(std::string_view s); +std::size_t find_invalid(std::string_view s); ``` -`s`: a UTF-8 encoded string. -Return value: `true` if the string contains valid UTF-8 encoded text; `false` if not. +`s`: a UTF-8 encoded string. +Return value: the index of the first invalid octet in the UTF-8 string. In case none were found, equals `std::string_view::npos`. Example of use: ```cpp string_view utf_invalid = "\xe6\x97\xa5\xd1\x88\xfa"; -bool bvalid = is_valid(utf_invalid); -assert (bvalid == false); +auto invalid = find_invalid(utf_invalid); +assert (invalid == 5); ``` -You may want to use `is_valid` to make sure that a string contains valid UTF-8 text without the need to know where it fails if it is not valid. - +This function is typically used to make sure a UTF-8 string is valid before processing it with other functions. It is especially important to call it if before doing any of the _unchecked_ operations on it. + #### utf8::is_valid + +##### bool is_valid(octet_iterator start, octet_iterator end) Available in version 1.0 and later. @@ -1180,60 +1285,84 @@ assert (bvalid == false); `is_valid` is a shorthand for `find_invalid(start, end) == end;`. You may want to use it to make sure that a byte sequence is a valid UTF-8 string without the need to know where it fails if it is not valid. -#### utf8::replace_invalid + + +##### bool is_valid(const char* str) + +Available in version 4.0 and later. + +Checks whether a C-style string contains valid UTF-8 encoded text. + +```cpp +bool is_valid(const char* str); +``` + +`str`: a UTF-8 encoded string. +Return value: `true` if the string contains valid UTF-8 encoded text; `false` if not. + +Example of use: + +```cpp +char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa"; +bool bvalid = is_valid(utf_invalid); +assert (bvalid == false); +``` + +You may want to use `is_valid` to make sure that a string contains valid UTF-8 text without the need to know where it fails if it is not valid. + + + +##### bool is_valid(const std::string& s) Available in version 3.0 and later. Prior to 4.0 it required a C++ 11 compiler; the requirement is lifted with 4.0 -Replaces all invalid UTF-8 sequences within a string with a replacement marker. +Checks whether a string object contains valid UTF-8 encoded text. ```cpp -std::string replace_invalid(const std::string& s, utfchar32_t replacement); -std::string replace_invalid(const std::string& s); +bool is_valid(const std::string& s); ``` `s`: a UTF-8 encoded string. -`replacement`: A Unicode code point for the replacement marker. The version without this parameter assumes the value `0xfffd` -Return value: A UTF-8 encoded string with replaced invalid sequences. +Return value: `true` if the string contains valid UTF-8 encoded text; `false` if not. Example of use: ```cpp -string invalid_sequence = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z"; -string replace_invalid_result = replace_invalid(invalid_sequence, '?'); -bvalid = is_valid(replace_invalid_result); -assert (bvalid); -const string fixed_invalid_sequence = "a????z"; -assert (fixed_invalid_sequence == replace_invalid_result); +char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa"; +bool bvalid = is_valid(utf_invalid); +assert (bvalid == false); ``` -#### utf8::replace_invalid +You may want to use `is_valid` to make sure that a string contains valid UTF-8 text without the need to know where it fails if it is not valid. + + +##### bool is_valid(std::string_view s) Available in version 3.2 and later. Requires a C++ 17 compliant compiler. -Replaces all invalid UTF-8 sequences within a string with a replacement marker. +Checks whether a string object contains valid UTF-8 encoded text. ```cpp -std::string replace_invalid(std::string_view s, char32_t replacement); -std::string replace_invalid(std::string_view s); +bool is_valid(std::string_view s); ``` `s`: a UTF-8 encoded string. -`replacement`: A Unicode code point for the replacement marker. The version without this parameter assumes the value `0xfffd` -Return value: A UTF-8 encoded string with replaced invalid sequences. +Return value: `true` if the string contains valid UTF-8 encoded text; `false` if not. Example of use: ```cpp -string_view invalid_sequence = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z"; -string replace_invalid_result = replace_invalid(invalid_sequence, '?'); -bool bvalid = is_valid(replace_invalid_result); -assert (bvalid); -const string fixed_invalid_sequence = "a????z"; -assert(fixed_invalid_sequence, replace_invalid_result); +string_view utf_invalid = "\xe6\x97\xa5\xd1\x88\xfa"; +bool bvalid = is_valid(utf_invalid); +assert (bvalid == false); ``` +You may want to use `is_valid` to make sure that a string contains valid UTF-8 text without the need to know where it fails if it is not valid. + #### utf8::replace_invalid + +##### output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, utfchar32_t replacement) Available in version 2.0 and later. @@ -1268,7 +1397,93 @@ assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), `replace_invalid` does not perform in-place replacement of invalid sequences. Rather, it produces a copy of the original string with the invalid sequences replaced with a replacement marker. Therefore, `out` must not be in the `[start, end]` range. + + +##### std::string replace_invalid(const std::string& s, utfchar32_t replacement) + +Available in version 3.0 and later. Prior to 4.0 it required a C++ 11 compiler; the requirement is lifted with 4.0 + +Replaces all invalid UTF-8 sequences within a string with a replacement marker. + +```cpp +std::string replace_invalid(const std::string& s, utfchar32_t replacement); +std::string replace_invalid(const std::string& s); +``` + +`s`: a UTF-8 encoded string. +`replacement`: A Unicode code point for the replacement marker. The version without this parameter assumes the value `0xfffd` +Return value: A UTF-8 encoded string with replaced invalid sequences. + +Example of use: + +```cpp +string invalid_sequence = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z"; +string replace_invalid_result = replace_invalid(invalid_sequence, '?'); +bvalid = is_valid(replace_invalid_result); +assert (bvalid); +const string fixed_invalid_sequence = "a????z"; +assert (fixed_invalid_sequence == replace_invalid_result); +``` + + +##### std::string replace_invalid(std::string_view s, char32_t replacement) + +Available in version 3.2 and later. Requires a C++ 17 compliant compiler. + +Replaces all invalid UTF-8 sequences within a string with a replacement marker. + +```cpp +std::string replace_invalid(std::string_view s, char32_t replacement); +std::string replace_invalid(std::string_view s); +``` + +`s`: a UTF-8 encoded string. +`replacement`: A Unicode code point for the replacement marker. The version without this parameter assumes the value `0xfffd` +Return value: A UTF-8 encoded string with replaced invalid sequences. + +Example of use: + +```cpp +string_view invalid_sequence = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z"; +string replace_invalid_result = replace_invalid(invalid_sequence, '?'); +bool bvalid = is_valid(replace_invalid_result); +assert (bvalid); +const string fixed_invalid_sequence = "a????z"; +assert(fixed_invalid_sequence, replace_invalid_result); +``` + + #### utf8::starts_with_bom + +##### bool starts_with_bom (octet_iterator it, octet_iterator end) + +Available in version 2.3 and later. + +Checks whether an octet sequence starts with a UTF-8 byte order mark (BOM) + +```cpp +template +bool starts_with_bom (octet_iterator it, octet_iterator end); +``` + +`octet_iterator`: an input iterator. +`it`: beginning of the octet sequence to check +`end`: pass-end of the sequence to check +Return value: `true` if the sequence starts with a UTF-8 byte order mark; `false` if not. + +Example of use: + +```cpp +unsigned char byte_order_mark[] = {0xef, 0xbb, 0xbf}; +bool bbom = starts_with_bom(byte_order_mark, byte_order_mark + sizeof(byte_order_mark)); +assert (bbom == true); +``` + +The typical use of this function is to check the first three bytes of a file. If they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 encoded text. + + + +##### bool starts_with_bom(const std::string& s) Available in version 3.0 and later. Prior to 4.0 it required a C++ 11 compiler; the requirement is lifted with 4.0 @@ -1295,7 +1510,8 @@ assert (no_bbom == false); The typical use of this function is to check the first three bytes of a file. If they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 encoded text. -#### utf8::starts_with_bom + +##### bool starts_with_bom(std::string_view s) Available in version 3.2 and later. Requires a C++ 17 compliant compiler. @@ -1323,34 +1539,10 @@ assert (!no_bbom); The typical use of this function is to check the first three bytes of a file. If they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 encoded text. -#### utf8::starts_with_bom - -Available in version 2.3 and later. - -Checks whether an octet sequence starts with a UTF-8 byte order mark (BOM) - -```cpp -template -bool starts_with_bom (octet_iterator it, octet_iterator end); -``` - -`octet_iterator`: an input iterator. -`it`: beginning of the octet sequence to check -`end`: pass-end of the sequence to check -Return value: `true` if the sequence starts with a UTF-8 byte order mark; `false` if not. - -Example of use: - -```cpp -unsigned char byte_order_mark[] = {0xef, 0xbb, 0xbf}; -bool bbom = starts_with_bom(byte_order_mark, byte_order_mark + sizeof(byte_order_mark)); -assert (bbom == true); -``` - -The typical use of this function is to check the first three bytes of a file. If they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 encoded text. - + ### Types From utf8 Namespace + #### utf8::exception Available in version 2.3 and later. @@ -1372,6 +1564,7 @@ catch(const utf8::exception& utfcpp_ex) { } ``` + #### utf8::invalid_code_point Available in version 1.0 and later. @@ -1387,6 +1580,7 @@ public: Member function `code_point()` can be used to determine the invalid code point that caused the exception to be thrown. + #### utf8::invalid_utf8 Available in version 1.0 and later. @@ -1402,6 +1596,7 @@ public: Member function `utf8_octet()` can be used to determine the beginning of the byte sequence that caused the exception to be thrown. + #### utf8::invalid_utf16 Available in version 1.0 and later. @@ -1417,6 +1612,7 @@ public: Member function `utf16_word()` can be used to determine the UTF-16 code unit that caused the exception to be thrown. + #### utf8::not_enough_room Available in version 1.0 and later. @@ -1427,6 +1623,7 @@ Thrown by UTF8 CPP functions such as `next` if the end of the decoded UTF-8 sequ class not_enough_room : public exception {}; ``` + #### utf8::iterator Available in version 2.0 and later. @@ -1438,6 +1635,7 @@ template class iterator; ``` + ##### Member functions `iterator();` the deafult constructor; the underlying octet_iterator is constructed with its default constructor. @@ -1490,8 +1688,10 @@ std::string s = "example"; utf8::iterator i (s.begin(), s.begin(), s.end()); ``` + ### Functions From utf8::unchecked Namespace + #### utf8::unchecked::append Available in version 1.0 and later. @@ -1517,6 +1717,7 @@ assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 && u[3] == 0 && u[4] == 0); This is a faster but less safe version of `utf8::append`. It does not check for validity of the supplied code point, and may produce an invalid UTF-8 sequence. + #### utf8::unchecked::append16 Available in version 4.0 and later. @@ -1544,6 +1745,7 @@ assert(u[1], 0x0000); This is a faster but less safe version of `utf8::append`. It does not check for validity of the supplied code point, and may produce an invalid UTF-8 sequence. + #### utf8::unchecked::next Available in version 1.0 and later. @@ -1570,6 +1772,7 @@ assert (w == twochars + 3); This is a faster but less safe version of `utf8::next`. It does not check for validity of the supplied UTF-8 sequence. + #### utf8::next16 Available in version 4.0 and later. @@ -1601,6 +1804,7 @@ This function is typically used to iterate through a UTF-16 encoded string. This is a faster but less safe version of `utf8::next16`. It does not check for validity of the supplied UTF-8 sequence. + #### utf8::unchecked::peek_next Available in version 2.1 and later. @@ -1627,6 +1831,7 @@ assert (w == twochars); This is a faster but less safe version of `utf8::peek_next`. It does not check for validity of the supplied UTF-8 sequence. + #### utf8::unchecked::prior Available in version 1.02 and later. @@ -1653,6 +1858,7 @@ assert (w == twochars); This is a faster but less safe version of `utf8::prior`. It does not check for validity of the supplied UTF-8 sequence and offers no boundary checking. + #### utf8::unchecked::advance Available in version 1.0 and later. @@ -1678,6 +1884,7 @@ assert (w == twochars + 5); This is a faster but less safe version of `utf8::advance`. It does not check for validity of the supplied UTF-8 sequence and offers no boundary checking. + #### utf8::unchecked::distance Available in version 1.0 and later. @@ -1703,6 +1910,7 @@ assert (dist == 2); This is a faster but less safe version of `utf8::distance`. It does not check for validity of the supplied UTF-8 sequence. + #### utf8::unchecked::utf16to8 Available in version 1.0 and later. @@ -1730,6 +1938,7 @@ assert (utf8result.size() == 10); This is a faster but less safe version of `utf8::utf16to8`. It does not check for validity of the supplied UTF-16 sequence. + #### utf8::unchecked::utf8to16 Available in version 1.0 and later. @@ -1758,6 +1967,7 @@ assert (utf16result[3] == 0xdd1e); This is a faster but less safe version of `utf8::utf8to16`. It does not check for validity of the supplied UTF-8 sequence. + #### utf8::unchecked::utf32to8 Available in version 1.0 and later. @@ -1785,6 +1995,7 @@ assert (utf8result.size() == 9); This is a faster but less safe version of `utf8::utf32to8`. It does not check for validity of the supplied UTF-32 sequence. + #### utf8::unchecked::utf8to32 Available in version 1.0 and later. @@ -1812,6 +2023,7 @@ assert (utf32result.size() == 2); This is a faster but less safe version of `utf8::utf8to32`. It does not check for validity of the supplied UTF-8 sequence. + #### utf8::unchecked::replace_invalid Available in version 3.1 and later. @@ -1849,8 +2061,10 @@ assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), Unlike `utf8::replace_invalid`, this function does not verify validity of the replacement marker. + ### Types From utf8::unchecked Namespace + #### utf8::iterator Available in version 2.0 and later. @@ -1862,6 +2076,7 @@ template class iterator; ``` + ##### Member functions `iterator();` the deafult constructor; the underlying octet_iterator is constructed with its default constructor. @@ -1907,9 +2122,3 @@ assert (*un_it == 0x10346); This is an unchecked version of `utf8::iterator`. It is faster in many cases, but offers no validity or range checks. -## Links - -1. [The Unicode Consortium](http://www.unicode.org/). -2. [ICU Library](http://icu.sourceforge.net/). -3. [UTF-8 at Wikipedia](http://en.wikipedia.org/wiki/UTF-8) -4. [UTF-8 and Unicode FAQ for Unix/Linux](http://www.cl.cam.ac.uk/~mgk25/unicode.html)