Death::Utf8 namespace
#include <Utf8.h>

Unicode (UTF-8. UTF-16 and UTF-32) utilities.

Functions

auto GetLength(Containers::ArrayView<const char> text) -> std::size_t
Number of characters in a UTF-8 string.
auto NextChar(Containers::ArrayView<const char> text, std::size_t cursor) -> Containers::Pair<char32_t, std::size_t>
Next UTF-8 character.
template<std::size_t size>
auto NextChar(const char(&text)[size], const std::size_t cursor) -> Containers::Pair<char32_t, std::size_t>
auto PrevChar(Containers::ArrayView<const char> text, std::size_t cursor) -> Containers::Pair<char32_t, std::size_t>
Previous UTF-8 character.
template<std::size_t size>
auto PrevChar(const char(&text)[size], const std::size_t cursor) -> Containers::Pair<char32_t, std::size_t>
auto FromCodePoint(char32_t character, Containers::StaticArrayView<4, char> result) -> std::size_t
Converts a UTF-32 character to UTF-8.
auto ToUtf16(const char* source, std::int32_t sourceSize) -> Containers::Array<wchar_t>
Widens a UTF-8 string to UTF-16 for use with Windows® Unicode APIs.
auto ToUtf16(Containers::StringView source) -> Containers::Array<wchar_t>
auto ToUtf16(wchar_t* destination, std::int32_t destinationSize, const char* source, std::int32_t sourceSize = -1) -> std::int32_t
template<std::int32_t size>
auto ToUtf16(wchar_t(&destination)[size], const char* source, std::int32_t sourceSize = -1) -> std::int32_t
auto FromUtf16(const wchar_t* source, std::int32_t sourceSize) -> Containers::String
Narrows a UTF-16 string to UTF-8 for use with Windows® Unicode APIs.
auto FromUtf16(Containers::ArrayView<const wchar_t> source) -> Containers::String
auto FromUtf16(char* destination, std::int32_t destinationSize, const wchar_t* source, std::int32_t sourceSize = -1) -> std::int32_t
template<std::int32_t size>
auto FromUtf16(char(&destination)[size], const wchar_t* source, std::int32_t sourceSize = -1) -> std::int32_t

Properties

const Containers::StaticArray<256, std::uint8_t> BytesOfLead
Lookup table mapping each possible UTF-8 lead byte (0x000xFF) to the expected number of bytes in the encoded UTF-8 sequence.

Function documentation

Containers::Pair<char32_t, std::size_t> Death::Utf8::NextChar(Containers::ArrayView<const char> text, std::size_t cursor)

Next UTF-8 character.

Returns Unicode codepoint of character on the cursor and position of the following character. If the character is invalid, returns 0xffffffffu as the codepoint and position of the next byte, it's then up to the caller whether it gets treated as a fatal error or if the invalid character is simply skipped or replaced.

template<std::size_t size>
Containers::Pair<char32_t, std::size_t> Death::Utf8::NextChar(const char(&text)[size], const std::size_t cursor)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Containers::Pair<char32_t, std::size_t> Death::Utf8::PrevChar(Containers::ArrayView<const char> text, std::size_t cursor)

Previous UTF-8 character.

Returns a Unicode codepoint of a character before cursor and its position. If the character is invalid, returns 0xffffffffu as the codepoint and position of the previous byte, it's then up to the caller whether it gets treated as a fatal error or if the invalid character is simply skipped or replaced.

template<std::size_t size>
Containers::Pair<char32_t, std::size_t> Death::Utf8::PrevChar(const char(&text)[size], const std::size_t cursor)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

std::size_t Death::Utf8::FromCodePoint(char32_t character, Containers::StaticArrayView<4, char> result)

Converts a UTF-32 character to UTF-8.

Parameters
character in UTF-32 character to convert
result out Where to put the UTF-8 result

Returns length of the encoding (1, 2, 3 or 4). If character is outside of the UTF-32 range, returns 0.

Containers::Array<wchar_t> Death::Utf8::ToUtf16(const char* source, std::int32_t sourceSize)

Widens a UTF-8 string to UTF-16 for use with Windows® Unicode APIs.

Converts a UTF-8 string to a wide-string (UTF-16) representation. The primary purpose of this API is easy interaction with Windows® Unicode APIs, thus the function doesn't return char16_t but rather a wchar_t. If the text is not empty, the returned array contains a sentinel null terminator (i.e., not counted into its size).

Containers::Array<wchar_t> Death::Utf8::ToUtf16(Containers::StringView source)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

std::int32_t Death::Utf8::ToUtf16(wchar_t* destination, std::int32_t destinationSize, const char* source, std::int32_t sourceSize = -1)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

This overload is suitable if the destination memory is already preallocated (e.g., on the stack). The return value represents the number of converted UTF-16 characters. The required destinationSize is never larger than sourceSize. If sourceSize is not provided, the source string must be null-terminated.

template<std::int32_t size>
std::int32_t Death::Utf8::ToUtf16(wchar_t(&destination)[size], const char* source, std::int32_t sourceSize = -1)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

This overload is suitable if the destination memory is already preallocated (e.g., on the stack). The return value represents the number of converted UTF-16 characters. The required destinationSize is never larger than sourceSize. If sourceSize is not provided, the source string must be null-terminated.

Containers::String Death::Utf8::FromUtf16(const wchar_t* source, std::int32_t sourceSize)

Narrows a UTF-16 string to UTF-8 for use with Windows® Unicode APIs.

Converts a wide-string (UTF-16) to a UTF-8 representation. The primary purpose is easy interaction with Windows® Unicode APIs, thus the function doesn't take char16_t but rather a wchar_t.

Containers::String Death::Utf8::FromUtf16(Containers::ArrayView<const wchar_t> source)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

std::int32_t Death::Utf8::FromUtf16(char* destination, std::int32_t destinationSize, const wchar_t* source, std::int32_t sourceSize = -1)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

This overload is suitable if the destination memory is already preallocated (e.g., on the stack). The return value represents the number of converted UTF-8 characters. The required destinationSize is never larger than 4× sourceSize. If sourceSize is not provided, the source string must be null-terminated.

template<std::int32_t size>
std::int32_t Death::Utf8::FromUtf16(char(&destination)[size], const wchar_t* source, std::int32_t sourceSize = -1)

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

This overload is suitable if the destination memory is already preallocated (e.g., on the stack). The return value represents the number of converted UTF-8 characters. The required destinationSize is never larger than 4× sourceSize. If sourceSize is not provided, the source string must be null-terminated.

Variable documentation

const Containers::StaticArray<256, std::uint8_t> Death::Utf8::BytesOfLead

Lookup table mapping each possible UTF-8 lead byte (0x000xFF) to the expected number of bytes in the encoded UTF-8 sequence.

Each entry corresponds to one value:

  • Values 0x000x7F: ASCII (single-byte characters)
  • Values 0x800xBF: Continuation bytes (not valid as lead bytes)
  • Values 0xC00xDF: Start of 2-byte sequences
  • Values 0xE00xEF: Start of 3-byte sequences
  • Values 0xF00xF4: Start of 4-byte sequences
  • Values 0xF50xFF: Invalid in UTF-8 (beyond Unicode range)

This table allows $ \mathcal{O}(1) $ lookup of the sequence length given the first byte of a UTF-8 encoded character.