namespace
#include <Utf8.h>
Utf8 Unicode (UTF-8. UTF-16 and UTF-32) utilities.
Functions
-
auto GetLength(Containers::
ArrayView<const char> text) -> std:: size_t - Number of characters in a UTF-8 string.
-
auto NextChar(Containers::
ArrayView<const char> text, std:: size_t cursor) -> Containers:: Pair<char32_t, std:: size_t> - Next UTF-8 character.
-
template<std::auto NextChar(const char(&text)[size], const std::
size_t size> size_t cursor) -> Containers:: Pair<char32_t, std:: size_t> -
auto PrevChar(Containers::
ArrayView<const char> text, std:: size_t cursor) -> Containers:: Pair<char32_t, std:: size_t> - Previous UTF-8 character.
-
template<std::auto PrevChar(const char(&text)[size], const std::
size_t size> size_t cursor) -> Containers:: Pair<char32_t, std:: size_t> -
auto FromCodePoint(char32_t character,
Containers::
StaticArrayView<4, char> result) -> std:: size_t - Converts a UTF-32 character to UTF-8.
-
auto ToUtf16(const char* source,
std::
int32_t sourceSize) -> Containers:: Array<wchar_t> - Widens a UTF-8 string to UTF-16 for use with Windows® Unicode APIs.
-
auto ToUtf16(Containers::
StringView source) -> Containers:: Array<wchar_t> -
auto ToUtf16(wchar_t* destination,
std::
int32_t destinationSize, const char* source, std:: int32_t sourceSize = -1) -> std:: int32_t -
template<std::auto ToUtf16(wchar_t(&destination)[size], const char* source, std::
int32_t size> int32_t sourceSize = -1) -> std:: int32_t -
auto FromUtf16(const wchar_t* source,
std::
int32_t sourceSize) -> Containers:: String - Narrows a UTF-16 string to UTF-8 for use with Windows® Unicode APIs.
-
auto FromUtf16(Containers::
ArrayView<const wchar_t> source) -> Containers:: String -
auto FromUtf16(char* destination,
std::
int32_t destinationSize, const wchar_t* source, std:: int32_t sourceSize = -1) -> std:: int32_t -
template<std::auto FromUtf16(char(&destination)[size], const wchar_t* source, std::
int32_t size> int32_t sourceSize = -1) -> std:: int32_t
Properties
-
const Containers::
StaticArray<256, std:: uint8_t> BytesOfLead - Lookup table mapping each possible UTF-8 lead byte (
0x00
–0xFF
) to the expected number of bytes in the encoded UTF-8 sequence.
Function documentation
Containers:: Pair<char32_t, std:: size_t> Death:: Utf8:: NextChar(Containers:: ArrayView<const char> text,
std:: size_t cursor)
Next UTF-8 character.
Returns Unicode codepoint of character on the cursor and position of the following character. If the character is invalid, returns 0xffffffffu
as the codepoint and position of the next byte, it's then up to the caller whether it gets treated as a fatal error or if the invalid character is simply skipped or replaced.
template<std:: size_t size>
Containers:: Pair<char32_t, std:: size_t> Death:: Utf8:: NextChar(const char(&text)[size],
const std:: size_t cursor)
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Containers:: Pair<char32_t, std:: size_t> Death:: Utf8:: PrevChar(Containers:: ArrayView<const char> text,
std:: size_t cursor)
Previous UTF-8 character.
Returns a Unicode codepoint of a character before cursor
and its position. If the character is invalid, returns 0xffffffffu
as the codepoint and position of the previous byte, it's then up to the caller whether it gets treated as a fatal error or if the invalid character is simply skipped or replaced.
template<std:: size_t size>
Containers:: Pair<char32_t, std:: size_t> Death:: Utf8:: PrevChar(const char(&text)[size],
const std:: size_t cursor)
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
std:: size_t Death:: Utf8:: FromCodePoint(char32_t character,
Containers:: StaticArrayView<4, char> result)
Converts a UTF-32 character to UTF-8.
Parameters | |
---|---|
character in | UTF-32 character to convert |
result out | Where to put the UTF-8 result |
Returns length of the encoding (1, 2, 3 or 4). If character
is outside of the UTF-32 range, returns 0
.
Containers:: Array<wchar_t> Death:: Utf8:: ToUtf16(const char* source,
std:: int32_t sourceSize)
Widens a UTF-8 string to UTF-16 for use with Windows® Unicode APIs.
Converts a UTF-8 string to a wide-string (UTF-16) representation. The primary purpose of this API is easy interaction with Windows® Unicode APIs, thus the function doesn't return char16_t
but rather a wchar_t
. If the text is not empty, the returned array contains a sentinel null terminator (i.e., not counted into its size).
Containers:: Array<wchar_t> Death:: Utf8:: ToUtf16(Containers:: StringView source)
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
std:: int32_t Death:: Utf8:: ToUtf16(wchar_t* destination,
std:: int32_t destinationSize,
const char* source,
std:: int32_t sourceSize = -1)
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload is suitable if the destination memory is already preallocated (e.g., on the stack). The return value represents the number of converted UTF-16 characters. The required destinationSize
is never larger than sourceSize
. If sourceSize
is not provided, the source string must be null-terminated.
template<std:: int32_t size>
std:: int32_t Death:: Utf8:: ToUtf16(wchar_t(&destination)[size],
const char* source,
std:: int32_t sourceSize = -1)
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload is suitable if the destination memory is already preallocated (e.g., on the stack). The return value represents the number of converted UTF-16 characters. The required destinationSize
is never larger than sourceSize
. If sourceSize
is not provided, the source string must be null-terminated.
Containers:: String Death:: Utf8:: FromUtf16(const wchar_t* source,
std:: int32_t sourceSize)
Narrows a UTF-16 string to UTF-8 for use with Windows® Unicode APIs.
Converts a wide-string (UTF-16) to a UTF-8 representation. The primary purpose is easy interaction with Windows® Unicode APIs, thus the function doesn't take char16_t
but rather a wchar_t
.
Containers:: String Death:: Utf8:: FromUtf16(Containers:: ArrayView<const wchar_t> source)
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
std:: int32_t Death:: Utf8:: FromUtf16(char* destination,
std:: int32_t destinationSize,
const wchar_t* source,
std:: int32_t sourceSize = -1)
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload is suitable if the destination memory is already preallocated (e.g., on the stack). The return value represents the number of converted UTF-8 characters. The required destinationSize
is never larger than 4× sourceSize
. If sourceSize
is not provided, the source string must be null-terminated.
template<std:: int32_t size>
std:: int32_t Death:: Utf8:: FromUtf16(char(&destination)[size],
const wchar_t* source,
std:: int32_t sourceSize = -1)
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
This overload is suitable if the destination memory is already preallocated (e.g., on the stack). The return value represents the number of converted UTF-8 characters. The required destinationSize
is never larger than 4× sourceSize
. If sourceSize
is not provided, the source string must be null-terminated.
Variable documentation
const Containers:: StaticArray<256, std:: uint8_t> Death:: Utf8:: BytesOfLead
Lookup table mapping each possible UTF-8 lead byte (0x00
–0xFF
) to the expected number of bytes in the encoded UTF-8 sequence.
Each entry corresponds to one value:
- Values
0x00
–0x7F
: ASCII (single-byte characters) - Values
0x80
–0xBF
: Continuation bytes (not valid as lead bytes) - Values
0xC0
–0xDF
: Start of 2-byte sequences - Values
0xE0
–0xEF
: Start of 3-byte sequences - Values
0xF0
–0xF4
: Start of 4-byte sequences - Values
0xF5
–0xFF
: Invalid in UTF-8 (beyond Unicode range)
This table allows lookup of the sequence length given the first byte of a UTF-8 encoded character.