Quick ‘n’ Dirty Tutorial

Setup

Use of this library is officially supported through the use of CMake. Getting an updated CMake is difficult on non-Windows machines, especially if they come from your system’s package manager distribution which tends to be several (dozen?) minor revisions out of date, or an entire major revision behind on CMake. To get a very close to up-to-date CMake, Python maintains an version that works across all systems. You can get it (and the ninja build system) by using the following command in your favorite command line application (assuming Python is already installed):

python -m pip install --user --update cmake ninja

If you depend on calling these executables using shorthand and not their full path, make sure that the Python “downloaded binaries” folder is contained with the PATH environment variable. Usually this is already done, but if you have trouble invoking cmake --version on your typical command line, please see the Python pip install documentation for more details for more information, in particular about the --user option.

If you do not have Python or CMake/ninja, you must get a recent enough version directly from CMake and build/install it.

Using CMake

Here’s a sample of the CMakeLists.txt to create a new project and pull in ztd.text in the simplest possible way:

project(my_app
	VERSION 1.0.0
	DESCRIPTION "My application."
	HOMEPAGE_URL "https://ztdtext.readthedocs.io/en/latest/quick.html"
	LANGUAGES C CPP
)

include(FetchContent)

FetchContent_Declare(ztd.text
	GIT_REPOSITORY https://github.com/soasis/text.git
	GIT_SHALLOW    ON
	GIT_TAG        main)
FetchContent_MakeAvailable(ztd.text)

This will automatically download and set up all the dependencies ztd.text needs (in this case, simply ztd.cmake, ztd.platform, ztd.idk, and ztd.cuneicode ). You can override how ztd.text gets these dependencies using the standard FetchContent described in the CMake FetchContent Documentation. One that happens, simply use CMake’s target_link_libraries(…) to add it to the code:

# …

file(GLOB_RECURSE my_app_sources
	LIST_DIRECTORIES OFF
	CONFIGURE_DEPENDS
	source/*.cpp
)

add_executable(my_app ${my_app_sources})

target_link_libraries(my_app PRIVATE ztd::text)

Once you have everything configured and set up the way you like, you can then use ztd.text in your code, as shown below:

#include <ztd/text.hpp>

int main(int, char*[]) {
	// overlong encoded null
	// (https://ztdtext.rtfd.io/en/latest/api/encodings/mutf8.html)
	const char mutf8_text[]
	     = { 'm', 'e', 'o', 'w', '\xc0', '\x80', 'm', 'e', 'o', 'w', '!' };
	const auto is_valid_mutf8_text
	     = ztd::text::validate_decodable_as(mutf8_text, ztd::text::compat_mutf8);

	std::cout << "The input text is "
	          << (is_valid_mutf8_text.valid ? "valid " : "not valid ")
	          << "MUTF-8 text!" << std::endl;

	return 0;
}

Let’s get started by digging into some examples!

Note

If you would like to see more examples and additional changes besides what is covered below, please do feel free to make requests for them here! This is not a very full-on tutorial and there is a lot of functionality that, still, needs explanation!

Transcoding

Transcoding is the action of taking data in one encoding and transforming it to another. ztd.text offers many ways to do this; here are a few different ways that have different expectations, needs, meanings, and tradeoffs.

Transcode between Unicode Encodings

Going from a Unicode Encoding to another Unicode Encoding just requires going through the ztd::text::transcode API. All you have to do after that is provide the appropriate ztd::text::utf8, ztd::text::utf16, or ztd::text::utf32 encoding object:

#include <ztd/text.hpp>

#include <string>

int main(int, char*[]) {
	constexpr const auto& input      = U"🐶🐶";
	constexpr const auto& wide_input = L"안녕하세요";
	// properly-typed input picks the right encoding automatically
	std::u16string utf16_emoji_string
	     = ztd::text::transcode(input, ztd::text::utf16);
	// explicitly pick the input encoding
	std::u16string utf16_emoji_string_explicit
	     = ztd::text::transcode(input, ztd::text::utf32, ztd::text::utf16);
	// must use explicit handler because "wide execution" may be
	// a lossy encoding! See:
	// https://ztdtext.rtfd.io/en/latest/design/error%20handling/lossy%20protection.html
	std::u16string utf16_korean_string_explicit
	     = ztd::text::transcode(wide_input, ztd::text::wide_execution,
	          ztd::text::utf16, ztd::text::replacement_handler);
	// result in the same strings, but different encodings!
	ZTD_TEXT_ASSERT(utf16_emoji_string == utf16_emoji_string_explicit);
	ZTD_TEXT_ASSERT(utf16_emoji_string == u"🐶🐶");
	ZTD_TEXT_ASSERT(utf16_korean_string_explicit == u"안녕하세요");
	return 0;
}

Transcode from Execution Encoding to UTF-8

The execution encoding is the encoding that comes with the system. It is typically the encoding that all locale data comes in, especially for e.g. command line parameters on Windows. To encode from such an encoding to the highly successful and popular UTF-8, you may use the same ztd::text::transcode as above with the appropriate ztd::text::(compat_)utf8:

#include <ztd/text.hpp>

#include <string>
#include <string_view>
#include <iostream>

int main(int argc, char* argv[]) {
	if (argc < 1) {
		return 0;
	}
	for (int i = 0; i < argc; ++i) {
		// print each argument as its UTF-8 version
		// the default error handler is the "replacement" error handler:
		// anything unrecognized will use the usual replacement "�".
		std::string_view input = argv[i];
		std::string utf8_string
		     = ztd::text::transcode(input, ztd::text::compat_utf8,
		          ztd::text::execution, ztd::text::replacement_handler);
		// directly write to ouput (terminal) to prevent any internal conversions
		// to/from an internal encoding while writing output
		std::cout.write(utf8_string.data(), utf8_string.size());
		// newline + flush
		std::cout << std::endl;
	}
	return 0;
}

The compat_ prefix is to make sure we are using the typedef definition of the templated ztd::text::basic_utf8 that uses char units. This is helpful for working with legacy data streams. We use std::cout.write(…) explicitly to prevent as much direct interface from the terminal or locales as possible to write the data to the terminal, ensuring that on competent systems with reasonably up-to-date terminals will display out UTF-8 data untouched (and, hopefully, properly).

Transcoding with Output Container Controls

Occasionally, you need to:

serialize to a container that isn’t a std::basic_string/std::(u8/16/32)string;
OR, you need to serialize to a container but you need to know if anything went wrong.

This is where the functions that are suffixed _to come into play, and where the template argument provided to the non-suffixed ztd::text::transcode<…>(…) come into play.

#include <ztd/text.hpp>

#include <vector>
#include <list>
#include <deque>
#include <string>
#include <string_view>

int main(int, char*[]) {
	constexpr const char32_t input[] = U"🐶⛄🐶🔔";
	constexpr const std::u16string_view utf16_expected_output = u"🐶⛄🐶🔔";

	// a vector instead of a std::u16string
	std::vector<char16_t> utf16_emoji_vector
	     = ztd::text::transcode<std::vector<char16_t>>(input, ztd::text::utf16);

	// a list (doubly-linked list) instead of a std::u16string
	std::list<char16_t> utf16_emoji_list
	     = ztd::text::transcode<std::list<char16_t>>(input, ztd::text::utf16);

	// insert into a std::deque, with additional return information
	auto utf16_emoji_deque_result
	     = ztd::text::transcode_to<std::deque<char16_t>>(input, ztd::text::utf16);
	// transcode_into_raw returns a ztd::text::transcode_result<…>
	// which we can inspect for error codes and more!
	// the error_code should be "ok"
	ZTD_TEXT_ASSERT(
	     utf16_emoji_deque_result.error_code == ztd::text::encoding_error::ok);
	// No errors should have occured, even if they were "handled" and still
	// returned "ok"
	ZTD_TEXT_ASSERT(!utf16_emoji_deque_result.errors_were_handled());
	// The input should be completely empty
	ZTD_TEXT_ASSERT(utf16_emoji_deque_result.input.empty());

	// The results should all be the same, despite the container!
	ZTD_TEXT_ASSERT(
	     ztd::ranges::equal(utf16_emoji_vector, utf16_expected_output));
	ZTD_TEXT_ASSERT(ztd::ranges::equal(utf16_emoji_list, utf16_expected_output));
	ZTD_TEXT_ASSERT(ztd::ranges::equal(
	     utf16_emoji_deque_result.output, utf16_expected_output));
	return 0;
}

The returned ztd::text::transcode_result from the _to-suffixed function gives more information about what went wrong, including the error count and any other pertinent information.

Transcoding into any Output View/Range

Sometimes, just picking the container to serialize into isn’t enough. After all, in the above examples, space will be automatically allocated as the container is added to. This may not be desirable for memory-constrained environments, for places with strict performance requirements that cannot risk touching an allocator, and within tight loops even under normal desktop and server environments.

Therefore, the _into suffixed functions allow explicitly passing in a range to be written into that will keep writing into the available space between the range’s begin and end (e.g., from a std::vector’s .data() to it’s .data() + .size()).

#include <ztd/text.hpp>

#include <ztd/idk/span.hpp>

#include <string>
#include <string_view>
#include <deque>

int main(int, char*[]) {
	constexpr const ztd_char8_t input[] = u8"bark🐶⛄🐶🔔bark!";
	constexpr const std::u16string_view expected_output = u"bark🐶⛄🐶🔔bark!";

	// Get a deque with a pre-ordained size.
	std::deque<char16_t> utf16_deque(expected_output.size());
	// Subrange indicating available space to write into
	auto utf16_deque_output_view
	     = ztd::ranges::make_subrange(utf16_deque.begin(), utf16_deque.end());
	// SAFE by default: if the container runs out of space, will not write more!
	auto utf16_deque_result = ztd::text::transcode_into(input, ztd::text::utf8,
	     utf16_deque_output_view, ztd::text::utf16, ztd::text::pass_handler,
	     ztd::text::pass_handler);

	// Ensure that the error code indicates success.
	ZTD_TEXT_ASSERT(
	     utf16_deque_result.error_code == ztd::text::encoding_error::ok);
	// there were no errors handled for us while processing
	ZTD_TEXT_ASSERT(!utf16_deque_result.errors_were_handled());
	// We had (exactly enough) space.
	ZTD_TEXT_ASSERT(ztd::ranges::equal(expected_output, utf16_deque));
	// There is no more input or output space left
	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf16_deque_result.input));
	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf16_deque_result.output));

	return 0;
}

The returned ztd::text::transcode_result from the _into-suffixed function gives more information about what went wrong, including the error count and any other pertinent information. If a pivot is not used (described in a below section), it will return a ztd::text::pivotless_transcode_result or a ztd::text::stateless_transcode_result, which just has a few less data members to describe what happened.

If there is not enough space, then extra writing will not be done and it will stop and return an error of ztd::text::encoding_error::insufficient_output_space:

#include <ztd/text.hpp>

#include <ztd/idk/span.hpp>

#include <string>
#include <string_view>

int main(int, char*[]) {
	constexpr const ztd_char8_t input[] = u8"bark🐶⛄🐶🔔bark!";
	constexpr std::size_t input_last_exclamation_mark_index
	     = ztdc_c_string_array_size(input) - 1;
	constexpr const std::u16string_view full_expected_output
	     = u"bark🐶⛄🐶🔔bark!";
	constexpr std::size_t truncated_input_size = 15;
	// string_view containing: "bark🐶⛄🐶🔔bark" (no ending exclamation point)
	constexpr const std::u16string_view truncated_expected_ouput
	     = full_expected_output.substr(0, truncated_input_size);

	// SAFE by default: if the string runs out of space, will not write more!
	std::u16string truncated_utf16_string(truncated_input_size, u'\0');
	// Span indicating available space to write into
	ztd::span<char16_t> truncated_utf16_string_output(truncated_utf16_string);
	auto truncated_utf16_string_result = ztd::text::transcode_into(input,
	     ztd::text::compat_utf8, truncated_utf16_string_output, ztd::text::utf16,
	     ztd::text::pass_handler, ztd::text::pass_handler);

	// We only had space for sixteen UTF-16 code units; expect as much from output
	ZTD_TEXT_ASSERT(truncated_expected_ouput == truncated_utf16_string);
	// The sequence was correct, but there wasn't enough output space for the full
	// sequence!
	ZTD_TEXT_ASSERT(truncated_utf16_string_result.error_code
	     == ztd::text::encoding_error::insufficient_output_space);
	ZTD_TEXT_ASSERT(truncated_utf16_string_result.errors_were_handled());
	// There is no more output space
	ZTD_TEXT_ASSERT(ztd::ranges::empty(truncated_utf16_string_result.output));
	// There is still input left
	ZTD_TEXT_ASSERT(!ztd::ranges::empty(truncated_utf16_string_result.input));
	// We left only enough space for everything except the last '\0':
	// check to see if that's what happened in the input
	ZTD_TEXT_ASSERT(truncated_utf16_string_result.input[0] == '!');
	ZTD_TEXT_ASSERT(truncated_utf16_string_result.input[0]
	     == input[input_last_exclamation_mark_index]);
	// No copies of the input were made:
	// points to the same data as it was given.
	ZTD_TEXT_ASSERT(&truncated_utf16_string_result.input[0]
	     == &input[input_last_exclamation_mark_index]);

	return 0;
}

Transcoding with Errors

Very often, text contains errors. Whether it’s being interpreted as the wrong encoding or it contains file names or data mangled during a system crash, or it’s just plain incorrect, bad data is a firm staple and constant reality for text processing. ztd.text offers many kinds of error handlers. They have many different behaviors, from doing nothing and stopping the desired encoding operation, to skipping over bad text and not doing anything, to adding replacement characters, and more.

The ztd::text::default_handler, unless configured differently, is to use replacement characters:

#include <ztd/text.hpp>

#include <string>
#include <string_view>

int main(int, char*[]) {
	constexpr const char32_t input[] = U"Ba\xD800rk!";
	// Equivalent to: u8"Ba�rk!"
	constexpr const char expected_default_output[] = "Ba\xef\xbf\xbdrk!";

	std::string utf8_string_with_default
	     = ztd::text::transcode(input, ztd::text::compat_utf8);

	ZTD_TEXT_ASSERT(utf8_string_with_default == expected_default_output);

	auto utf8_string_with_default_result
	     = ztd::text::transcode_to(input, ztd::text::compat_utf8);
	ZTD_TEXT_ASSERT(utf8_string_with_default_result.error_code
	     == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(utf8_string_with_default_result.errors_were_handled());
	ZTD_TEXT_ASSERT(utf8_string_with_default_result.error_count == 1);
	ZTD_TEXT_ASSERT(
	     utf8_string_with_default_result.output == expected_default_output);
	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf8_string_with_default_result.input));

	return 0;
}

The ztd::text::replacement_handler explicitly inserts replacement characters where the failure occurs:

#include <ztd/text.hpp>

#include <string>
#include <string_view>

int main(int, char*[]) {
	constexpr const char32_t input[]                   = U"Ba\xD800rk!";
	constexpr const char expected_replacement_output[] = "Ba\xef\xbf\xbdrk!";

	std::string utf8_string_with_replacement
	     = ztd::text::transcode(input, ztd::text::utf32, ztd::text::compat_utf8,
	          ztd::text::replacement_handler);

	ZTD_TEXT_ASSERT(utf8_string_with_replacement == expected_replacement_output);

	auto utf8_string_with_replacement_result
	     = ztd::text::transcode_to(input, ztd::text::utf32,
	          ztd::text::compat_utf8, ztd::text::replacement_handler);

	ZTD_TEXT_ASSERT(utf8_string_with_replacement_result.error_code
	     == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(utf8_string_with_replacement_result.errors_were_handled());
	ZTD_TEXT_ASSERT(utf8_string_with_replacement_result.error_count == 1);
	ZTD_TEXT_ASSERT(utf8_string_with_replacement_result.output
	     == expected_replacement_output);
	ZTD_TEXT_ASSERT(
	     ztd::ranges::empty(utf8_string_with_replacement_result.input));

	return 0;
}

To simply skip over bad input without outputting any replacement characters, use ztd::text::skip_handler:

#include <ztd/text.hpp>

#include <string>
#include <string_view>

int main(int, char*[]) {
	constexpr const char32_t input[]            = U"Ba\xD800rk!";
	constexpr const char expected_skip_output[] = "Bark!";

	std::string utf8_string_with_skip = ztd::text::transcode(input,
	     ztd::text::utf32, ztd::text::compat_utf8, ztd::text::skip_handler);
	ZTD_TEXT_ASSERT(utf8_string_with_skip == expected_skip_output);

	auto utf8_string_with_skip_result = ztd::text::transcode_to(input,
	     ztd::text::utf32, ztd::text::compat_utf8, ztd::text::skip_handler);
	ZTD_TEXT_ASSERT(utf8_string_with_skip_result.error_code
	     == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(utf8_string_with_skip_result.errors_were_handled());
	ZTD_TEXT_ASSERT(utf8_string_with_skip_result.error_count == 1);
	ZTD_TEXT_ASSERT(utf8_string_with_skip_result.output == expected_skip_output);
	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf8_string_with_skip_result.input));

	return 0;
}

To stop in the middle of the operation and return immediately, employ the ztd::text::pass_handler. This will leave text unprocessed, but offer a chance to inspect what is left and any corrective action that might need to be taken afterwards:

#include <ztd/text.hpp>

#include <string>
#include <string_view>

int main(int, char*[]) {
	constexpr const char32_t input[]                        = U"Ba\xD800rk!";
	constexpr const char expected_pass_output[]             = "Ba";
	constexpr const char32_t expected_pass_leftover_input[] = U"\xD800rk!";

	std::string utf8_string_with_pass = ztd::text::transcode(input,
	     ztd::text::utf32, ztd::text::compat_utf8, ztd::text::pass_handler);

	ZTD_TEXT_ASSERT(utf8_string_with_pass == expected_pass_output);

	auto utf8_string_with_pass_result = ztd::text::transcode_to(input,
	     ztd::text::utf32, ztd::text::compat_utf8, ztd::text::pass_handler);

	ZTD_TEXT_ASSERT(utf8_string_with_pass_result.error_code
	     == ztd::text::encoding_error::invalid_sequence);
	ZTD_TEXT_ASSERT(utf8_string_with_pass_result.errors_were_handled());
	ZTD_TEXT_ASSERT(utf8_string_with_pass_result.error_count == 1);
	ZTD_TEXT_ASSERT(utf8_string_with_pass_result.output == expected_pass_output);
	ZTD_TEXT_ASSERT(ztd::ranges::equal(utf8_string_with_pass_result.input,
	     std::u32string(expected_pass_leftover_input)));

	return 0;
}

Error handlers like the ztd::text::skip_handler and ztd::text::replacement_handler (and potentially the ztd::text::default_handler) are smart enough to not output multiple replacement characters for every single 8, 16, or 32-bit unit that contains an error, folding them down into one replacement character per distinct failure location:

#include <ztd/text.hpp>

#include <vector>
#include <list>
#include <deque>
#include <string>
#include <string_view>

int main(int, char*[]) {
	// Scuffed UTF-8 input: 'C0' is not a legal sequence starter
	// for regular, pure UTF-8
	constexpr const char input[]                    = "Me\xC0\x9F\x90\xB1ow!";
	constexpr const char32_t expected_skip_output[] = U"Meow!";

	std::u32string utf32_string_with_skip = ztd::text::transcode(input,
	     ztd::text::compat_utf8, ztd::text::utf32, ztd::text::skip_handler);

	ZTD_TEXT_ASSERT(utf32_string_with_skip == expected_skip_output);

	auto utf32_string_with_skip_result = ztd::text::transcode_to(input,
	     ztd::text::compat_utf8, ztd::text::utf32, ztd::text::skip_handler);
	ZTD_TEXT_ASSERT(utf32_string_with_skip_result.error_code
	     == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(utf32_string_with_skip_result.errors_were_handled());
	ZTD_TEXT_ASSERT(utf32_string_with_skip_result.error_count == 1);
	ZTD_TEXT_ASSERT(utf32_string_with_skip_result.output == expected_skip_output);
	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf32_string_with_skip_result.input));

	return 0;
}

#include <ztd/text.hpp>

#include <vector>
#include <list>
#include <deque>
#include <string>
#include <string_view>

int main(int, char*[]) {
	// Scuffed UTF-8 input: 'C0' is not a legal sequence starter
	// for regular, pure UTF-8!
	constexpr const char input[]                       = "Me\xC0\x9F\x90\xB1ow!";
	constexpr const char32_t expected_default_output[] = U"Me�ow!";

	std::u32string utf32_string_with_default
	     = ztd::text::transcode(input, ztd::text::compat_utf8, ztd::text::utf32);

	ZTD_TEXT_ASSERT(utf32_string_with_default == expected_default_output);

	auto utf32_string_with_default_result = ztd::text::transcode_to(
	     input, ztd::text::compat_utf8, ztd::text::utf32);
	ZTD_TEXT_ASSERT(utf32_string_with_default_result.error_code
	     == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(utf32_string_with_default_result.errors_were_handled());
	ZTD_TEXT_ASSERT(utf32_string_with_default_result.error_count == 1);
	ZTD_TEXT_ASSERT(
	     utf32_string_with_default_result.output == expected_default_output);
	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf32_string_with_default_result.input));

	return 0;
}

#include <ztd/text.hpp>

#include <vector>
#include <list>
#include <deque>
#include <string>
#include <string_view>

int main(int, char*[]) {
	// Scuffed UTF-8 input: 'C0' is not a legal sequence starter
	// for regular, pure UTF-8
	constexpr const char input[] = "Me\xC0\x9F\x90\xB1ow!";
	constexpr const char32_t expected_replacement_output[] = U"Me�ow!";

	std::u32string utf32_string_with_replacement
	     = ztd::text::transcode(input, ztd::text::compat_utf8, ztd::text::utf32,
	          ztd::text::replacement_handler);

	ZTD_TEXT_ASSERT(utf32_string_with_replacement == expected_replacement_output);

	auto utf32_string_with_replacement_result
	     = ztd::text::transcode_to(input, ztd::text::compat_utf8,
	          ztd::text::utf32, ztd::text::replacement_handler);
	ZTD_TEXT_ASSERT(utf32_string_with_replacement_result.error_code
	     == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(utf32_string_with_replacement_result.errors_were_handled());
	ZTD_TEXT_ASSERT(utf32_string_with_replacement_result.error_count == 1);
	ZTD_TEXT_ASSERT(utf32_string_with_replacement_result.output
	     == expected_replacement_output);
	ZTD_TEXT_ASSERT(
	     ztd::ranges::empty(utf32_string_with_replacement_result.input));

	return 0;
}

Compared to the ztd::text::pass_handler, which will stop at the first potential error:

#include <ztd/text.hpp>

#include <vector>
#include <list>
#include <deque>
#include <string>
#include <string_view>

int main(int, char*[]) {
	// Scuffed UTF-8 input: 'C0' is not a legal sequence starter
	// for regular, pure UTF-8
	constexpr const char input[]                        = "Me\xC0\x9F\x90\xB1ow!";
	constexpr const char32_t expected_pass_output[]     = U"Me";
	constexpr const char expected_pass_leftover_input[] = "\xC0\x9F\x90\xB1ow!";

	std::u32string utf32_string_with_pass = ztd::text::transcode(input,
	     ztd::text::compat_utf8, ztd::text::utf32, ztd::text::pass_handler);

	ZTD_TEXT_ASSERT(utf32_string_with_pass == expected_pass_output);

	auto utf32_string_with_pass_result = ztd::text::transcode_to(input,
	     ztd::text::compat_utf8, ztd::text::utf32, ztd::text::pass_handler);
	ZTD_TEXT_ASSERT(utf32_string_with_pass_result.error_code
	     == ztd::text::encoding_error::invalid_sequence);
	ZTD_TEXT_ASSERT(utf32_string_with_pass_result.errors_were_handled());
	ZTD_TEXT_ASSERT(utf32_string_with_pass_result.error_count == 1);
	ZTD_TEXT_ASSERT(utf32_string_with_pass_result.output == expected_pass_output);
	ZTD_TEXT_ASSERT(ztd::ranges::equal(utf32_string_with_pass_result.input,
	     std::string_view(expected_pass_leftover_input)));

	return 0;
}

You can even write your own custom error handlers.

Transcoding with Input, Output and Pivot Controls

Occasionally, you need to perform a transcoding operation that has no extension point and needs to go through an intermediate transition phase first (like UTF-8 ➡ Intermediate UTF-32 ➡ Shift-JIS). Normally, ztd.text will create an internal, stack-based buffer (controllable with preprocessor definitions) to use as the intermediate. But sometimes you need to exercise control over even that, to keep memory usage predictable and stable in all situations. Enter the pivot parameter, which a user can use to give a custom buffer (any custom range) as the intermediate data holder:

#include <ztd/text.hpp>

#include <string>
#include <iterator>

int main(int, char*[]) {
	constexpr const char16_t u16_data[]
	     = u"この国の歴史は世界がまだ未完成で、神様すらいなかったところから始まる"
	       u"。現在の日本は『神様の住む天界』『人間の住む地上』『死者の住む冥界』"
	       u"の三層に分かれているけれど、その頃はまだ気体と固体すら分かれていなく"
	       u"て、カオス状態の世界がどこまでもどこまでも広がっているだけだった。そ"
	       u"れから長ーーい長ーーーい時間が経ったある日、ふいに天と地が分かれた。"
	       u"すると、どこからともなく天に1人の神様が、なりなりと生まれてきた。";

	// must provide all arguments to get to the "pivot" part.
	// decode and encode states to use
	auto utf16_decode_state = ztd::text::make_decode_state(ztd::text::utf16);
	auto shift_jis_encode_state
	     = ztd::text::make_encode_state(ztd::text::shift_jis_x0208);
	// the output we're going to serialize into! We're using a std::back_inserter
	// to just fill up our desired container (in this case, a std::string)
	std::string shift_jis_string;
	auto output_view
	     = ztd::ranges::unbounded_view(std::back_inserter(shift_jis_string));
	// we're going to use a static buffer, but anything
	// would work just fine, really, as the "pivot"
	char32_t my_intermediate_buffer[256];
	ztd::span<char32_t> pivot(my_intermediate_buffer);

	// Perform the conversion!
	auto shift_jis_result = ztd::text::transcode_into(u16_data, ztd::text::utf16,
	     output_view, ztd::text::shift_jis_x0208, ztd::text::replacement_handler,
	     ztd::text::replacement_handler, utf16_decode_state,
	     shift_jis_encode_state, pivot);


	// Verify everything is in a state we expect it to be in!
	// A Shift-JIS encoded character string.
	constexpr const char expected_shift_jis_string[]
	     = "\x82\xb1\x82\xcc\x8d\x91\x82\xcc\x97\xf0\x8e\x6a\x82\xcd\x90\xa2\x8a"
	       "\x45\x82\xaa\x82\xdc\x82\xbe\x96\xa2\x8a\xae\x90\xac\x82\xc5\x81\x41"
	       "\x90\x5f\x97\x6c\x82\xb7\x82\xe7\x82\xa2\x82\xc8\x82\xa9\x82\xc1\x82"
	       "\xbd\x82\xc6\x82\xb1\x82\xeb\x82\xa9\x82\xe7\x8e\x6e\x82\xdc\x82\xe9"
	       "\x81\x42\x8c\xbb\x8d\xdd\x82\xcc\x93\xfa\x96\x7b\x82\xcd\x81\x77\x90"
	       "\x5f\x97\x6c\x82\xcc\x8f\x5a\x82\xde\x93\x56\x8a\x45\x81\x78\x81\x77"
	       "\x90\x6c\x8a\xd4\x82\xcc\x8f\x5a\x82\xde\x92\x6e\x8f\xe3\x81\x78\x81"
	       "\x77\x8e\x80\x8e\xd2\x82\xcc\x8f\x5a\x82\xde\x96\xbb\x8a\x45\x81\x78"
	       "\x82\xcc\x8e\x4f\x91\x77\x82\xc9\x95\xaa\x82\xa9\x82\xea\x82\xc4\x82"
	       "\xa2\x82\xe9\x82\xaf\x82\xea\x82\xc7\x81\x41\x82\xbb\x82\xcc\x8d\xa0"
	       "\x82\xcd\x82\xdc\x82\xbe\x8b\x43\x91\xcc\x82\xc6\x8c\xc5\x91\xcc\x82"
	       "\xb7\x82\xe7\x95\xaa\x82\xa9\x82\xea\x82\xc4\x82\xa2\x82\xc8\x82\xad"
	       "\x82\xc4\x81\x41\x83\x4a\x83\x49\x83\x58\x8f\xf3\x91\xd4\x82\xcc\x90"
	       "\xa2\x8a\x45\x82\xaa\x82\xc7\x82\xb1\x82\xdc\x82\xc5\x82\xe0\x82\xc7"
	       "\x82\xb1\x82\xdc\x82\xc5\x82\xe0\x8d\x4c\x82\xaa\x82\xc1\x82\xc4\x82"
	       "\xa2\x82\xe9\x82\xbe\x82\xaf\x82\xbe\x82\xc1\x82\xbd\x81\x42\x82\xbb"
	       "\x82\xea\x82\xa9\x82\xe7\x92\xb7\x81\x5b\x81\x5b\x82\xa2\x92\xb7\x81"
	       "\x5b\x81\x5b\x81\x5b\x82\xa2\x8e\x9e\x8a\xd4\x82\xaa\x8c\x6f\x82\xc1"
	       "\x82\xbd\x82\xa0\x82\xe9\x93\xfa\x81\x41\x82\xd3\x82\xa2\x82\xc9\x93"
	       "\x56\x82\xc6\x92\x6e\x82\xaa\x95\xaa\x82\xa9\x82\xea\x82\xbd\x81\x42"
	       "\x82\xb7\x82\xe9\x82\xc6\x81\x41\x82\xc7\x82\xb1\x82\xa9\x82\xe7\x82"
	       "\xc6\x82\xe0\x82\xc8\x82\xad\x93\x56\x82\xc9\x31\x90\x6c\x82\xcc\x90"
	       "\x5f\x97\x6c\x82\xaa\x81\x41\x82\xc8\x82\xe8\x82\xc8\x82\xe8\x82\xc6"
	       "\x90\xb6\x82\xdc\x82\xea\x82\xc4\x82\xab\x82\xbd\x81\x42";

	ZTD_TEXT_ASSERT(shift_jis_result.error_code == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(!shift_jis_result.errors_were_handled());
	ZTD_TEXT_ASSERT(
	     shift_jis_result.pivot_error_code == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(shift_jis_result.pivot_error_count == 0);
	ZTD_TEXT_ASSERT(ztd::ranges::empty(shift_jis_result.input));
	ZTD_TEXT_ASSERT(shift_jis_string == expected_shift_jis_string);
	return 0;
}

Here, we use an exceptionally small buffer to keep memory usage down. Note that the buffer should be at least as large as ztd::text::max_code_points_v<FromEncoding> (FromEncoding in this case being the ztd::text::compat_utf8 encoding) so that no "insufficient output size" errors occur during translation. (For ztd::text::encode operations, the buffer should be at least as large as ztd::text::max_code_units_v<FromEncoding>) If the pivot buffer is too small this can produce unpredictable failures and unexpected behavior from unanticipated errors, so make sure to always provide a suitably-sized pivot buffer! Or, alternatively, just let the implementation use its defaults, which are (generally) tuned to work out well enough for most conversion routines and platforms.

Encoding & Decoding

Encoding and decoding look identical to Transcoding, just using the functions ztd::text::decode and ztd::text::encode functions. ztd::text::decode will always produce a sequence of the encoding’s code point type (ztd::text::code_point_t<some_encoding_type>). ztd::text::encode will always produce a sequence of the encoding’s code unit type (ztd::text::code_unit_t<some_encoding_type>), and the lower-level functions ending in _to and _into will produce a ztd::text::encode_result (for encoding) or ztd::text::decode_result (for decoding):

#include <ztd/text.hpp>

#include <string>

int main(int, char*[]) {
	const char input[]
	     = "\xbe\xc8\xb3\xe7\x2c\x20\xbf\xc0\xb4\xc3\xc0\xba\x20\xc1\xc1\xc0\xba"
	       "\x20\xb3\xaf\xc0\xcc\xbf\xa1\xbf\xe4\x21";

	// Decode, with result to check!
	auto korean_decoded_output_result
	     = ztd::text::decode_to(input, ztd::text::euc_kr_uhc);
	ZTD_TEXT_ASSERT(korean_decoded_output_result.error_code
	     == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(!korean_decoded_output_result.errors_were_handled());
	ZTD_TEXT_ASSERT(ztd::ranges::empty(korean_decoded_output_result.input));
	const std::u32string& korean_decoded_output
	     = korean_decoded_output_result.output;

	// Take decoded Unicode code points and encode it into UTF-8
	auto korean_utf8_output_result
	     = ztd::text::encode_to(korean_decoded_output, ztd::text::compat_utf8);
	ZTD_TEXT_ASSERT(
	     korean_utf8_output_result.error_code == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(!korean_utf8_output_result.errors_were_handled());
	ZTD_TEXT_ASSERT(ztd::ranges::empty(korean_utf8_output_result.input));
	const std::string& korean_utf8_output = korean_utf8_output_result.output;
	// verify that what we got out in UTF-8 would be the same if we converted
	// it back to EUC-KR.
	ZTD_TEXT_ASSERT(ztd::ranges::equal(std::string_view(input),
	     ztd::text::transcode(korean_utf8_output, ztd::text::compat_utf8,
	          ztd::text::euc_kr_uhc, ztd::text::pass_handler)));
	// A korean greeting!
	std::cout.write(korean_utf8_output.data(), korean_utf8_output.size());
	std::cout << std::endl;

	return 0;
}

Encode and decode operations are part of each encoding, represented by its encoding type. Every encoding object natively understands how to go from a sequence of its encoded data to its decoded data, and vice-versa, with the encode_one and decode_one functions. One should not call these functions directly, however, and instead used the above-provided functions. Because decode and encode operations do not feature intermediate steps, there is no ztd::text::pivot<…> for these functions.

Counting

Counting is done using the ztd::text::count_as_decoded, ztd::text::count_as_encoded, and ztd::text::count_as_transcoded. As the names imply, it yields the number of code points or code units that will result from an attempted encode, decode, or transcode operation in a sequence of text. It will return a ztd::text::count_result detailing that information:

#include <ztd/text.hpp>

#include <iostream>

int main(int, char*[]) {
	const char input[]
	     = " OSSL   s  s  RFC ss   "
	       "-cqj0qgheba6zgdehhb85bfc31d5m2evf4423k0a7nd6abq3flcampfa17ac5froq64c0"
	       "a2a7nbcyjnb1b7yp96t0e31nkf95i";
	std::vector<char32_t> output(256);
	auto counting_result
	     = ztd::text::count_as_decoded(input, ztd::text::punycode);
	ZTD_TEXT_ASSERT(counting_result.error_code == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(!counting_result.errors_were_handled());
	if (counting_result.count > 256) {
		std::cerr << "The input punycode exceeeds the IDNA limited size buffer: "
		             "change parameters to allocate a larger one!"
		          << std::endl;
		return 1;
	}
	output.resize(counting_result.count);
	auto decoding_result = ztd::text::decode_into_raw(
	     input, ztd::text::punycode, ztd::span<char32_t>(output));
	std::size_t decoding_result_count
	     = decoding_result.output.data() - output.data();
	ZTD_TEXT_ASSERT(decoding_result.error_code == ztd::text::encoding_error::ok);
	ZTD_TEXT_ASSERT(!decoding_result.errors_were_handled());
	ZTD_TEXT_ASSERT(ztd::ranges::empty(decoding_result.input));
	ZTD_TEXT_ASSERT(decoding_result_count == counting_result.count);

	// Show decoded punycode (translate to UTF-8 to print to console)
	std::cout << "Decoded punycode code points:\n\t";
	for (const auto& code_point : output) {
		std::cout << std::hex << std::showbase
		          << static_cast<uint_least32_t>(code_point) << " ";
	}
	std::cout << "\n" << std::endl;

	std::cout << "Decoded punycode as UTF-8:\n\t";
	ztd::text::encode_view<ztd::text::utf8_t> print_view(
	     std::u32string_view(output.data(), decoding_result_count));
	for (auto u8_code_unit : print_view) {
		std::cout.write(reinterpret_cast<const char*>(&u8_code_unit), 1);
	}
	std::cout << std::endl;

	return 0;
}

Getting counts is essential to being able to size allocated buffers for exactly what is necessary, or make use of small buffer optimizations by checking sizes before potentially spilling over into larger allocations.

Validation

Validation is checking whether or not the input sequence can be encoded, decoded, or transcoded to by (and to) the given encoding. It works through the ztd::text::validate_decodable, ztd::text::validate_encodable, and ztd::text::validate_transcodable functions.

#include <ztd/text.hpp>

#include <string_view>

int main(int, char*[]) {
	constexpr const std::u32string_view input = U"meow🐱moew🐱!";

	// At compile-time: returns a structure with (explicit) operator bool
	// to allow it to be used with ! and if() statements
	static_assert(!ztd::text::validate_encodable_as(input, ztd::text::ascii),
	     "Unfortunately, ASCII does not support emoji.");

	// At run-time: returns a structure
	auto validate_result
	     = ztd::text::validate_encodable_as(input, ztd::text::ascii);

	// Check if the result is valid (should not be valid).
	if (validate_result) {
		// Everyyhing was verified (not expected! ❌)
		std::cerr << "Unexpectedly, the input text was all valid ASCII."
		          << std::endl;
		return 1;
	}

	// Otherwise, everything was not verified (expected! ✅)
	std::cout << "As expected, the input text was not valid ASCII."
	          << "\n"
	          << "Here are the unicode hex values of the unvalidated UTF-32 code "
	             "points:\n\t";
	// use the structure to know where we left off.
	std::u32string_view unused_input(
	     validate_result.input.data(), validate_result.input.size());
	for (const auto& u32_codepoint : unused_input) {
		std::cout << "0x" << std::hex
		          << static_cast<uint_least32_t>(u32_codepoint);
		if (&u32_codepoint != &unused_input.back()) {
			std::cout << " ";
		}
	}
	std::cout << std::endl;
	return 0;
}

There’s More!

There is more you can do with this library, from authoring your own encoding objects/types to taking control of the performance of conversions. More will be added to this Getting Started as time goes on, but if you have any inkling of something that should work, give it a try! If it fails in a way you don’t think is helpful, please let us known through any of our available communication channels so we can assist you!