Quick ‘n’ Dirty Tutorial

Setup

Use of this library is officially supported through the use of CMake. Getting an updated CMake is difficult on non-Windows machines, especially if they come from your system’s package manager distribution which tends to be several (dozen?) minor revisions out of date, or an entire major revision behind on CMake. To get a very close to up-to-date CMake, Python maintains an version that works across all systems. You can get it (and the ninja build system) by using the following command in your favorite command line application (assuming Python is already installed):

1python -m pip install --user --update cmake ninja

If you depend on calling these executables using shorthand and not their full path, make sure that the Python “downloaded binaries” folder is contained with the PATH environment variable. Usually this is already done, but if you have trouble invoking cmake --version on your typical command line, please see the Python pip install documentation for more details for more information, in particular about the --user option.

If you do not have Python or CMake/ninja, you must get a recent enough version directly from CMake and build/install it.

Using CMake

Here’s a sample of the CMakeLists.txt to create a new project and pull in ztd.text in the simplest possible way:

 1project(my_app
 2	VERSION 1.0.0
 3	DESCRIPTION "My application."
 4	HOMEPAGE_URL "https://ztdtext.readthedocs.io/en/latest/quick.html"
 5	LANGUAGES C CPP
 6)
 7
 8include(FetchContent)
 9
10FetchContent_Declare(ztd.text
11	GIT_REPOSITORY https://github.com/soasis/text.git
12	GIT_SHALLOW    ON
13	GIT_TAG        main)
14FetchContent_MakeAvailable(ztd.text)

This will automatically download and set up all the dependencies ztd.text needs (in this case, simply ztd.cmake, ztd.platform, ztd.idk, and ztd.cuneicode ). You can override how ztd.text gets these dependencies using the standard FetchContent described in the CMake FetchContent Documentation. One that happens, simply use CMake’s target_link_libraries(…) to add it to the code:

 1# …
 2
 3file(GLOB_RECURSE my_app_sources
 4	LIST_DIRECTORIES OFF
 5	CONFIGURE_DEPENDS
 6	source/*.cpp
 7)
 8
 9add_executable(my_app ${my_app_sources})
10
11target_link_libraries(my_app PRIVATE ztd::text)

Once you have everything configured and set up the way you like, you can then use ztd.text in your code, as shown below:

 1#include <ztd/text.hpp>
 2
 3int main(int, char*[]) {
 4	// overlong encoded null
 5	// (https://ztdtext.rtfd.io/en/latest/api/encodings/mutf8.html)
 6	const char mutf8_text[]
 7	     = { 'm', 'e', 'o', 'w', '\xc0', '\x80', 'm', 'e', 'o', 'w', '!' };
 8	const auto is_valid_mutf8_text
 9	     = ztd::text::validate_decodable_as(mutf8_text, ztd::text::compat_mutf8);
10
11	std::cout << "The input text is "
12	          << (is_valid_mutf8_text.valid ? "valid " : "not valid ")
13	          << "MUTF-8 text!" << std::endl;
14
15	return 0;
16}

Let’s get started by digging into some examples!

Note

If you would like to see more examples and additional changes besides what is covered below, please do feel free to make requests for them here! This is not a very full-on tutorial and there is a lot of functionality that, still, needs explanation!

Transcoding

Transcoding is the action of taking data in one encoding and transforming it to another. ztd.text offers many ways to do this; here are a few different ways that have different expectations, needs, meanings, and tradeoffs.

Transcode between Unicode Encodings

Going from a Unicode Encoding to another Unicode Encoding just requires going through the ztd::text::transcode API. All you have to do after that is provide the appropriate ztd::text::utf8, ztd::text::utf16, or ztd::text::utf32 encoding object:

 1#include <ztd/text.hpp>
 2
 3#include <string>
 4
 5int main(int, char*[]) {
 6	constexpr const auto& input      = U"🐶🐶";
 7	constexpr const auto& wide_input = L"안녕하세요";
 8	// properly-typed input picks the right encoding automatically
 9	std::u16string utf16_emoji_string
10	     = ztd::text::transcode(input, ztd::text::utf16);
11	// explicitly pick the input encoding
12	std::u16string utf16_emoji_string_explicit
13	     = ztd::text::transcode(input, ztd::text::utf32, ztd::text::utf16);
14	// must use explicit handler because "wide execution" may be
15	// a lossy encoding! See:
16	// https://ztdtext.rtfd.io/en/latest/design/error%20handling/lossy%20protection.html
17	std::u16string utf16_korean_string_explicit
18	     = ztd::text::transcode(wide_input, ztd::text::wide_execution,
19	          ztd::text::utf16, ztd::text::replacement_handler);
20	// result in the same strings, but different encodings!
21	ZTD_TEXT_ASSERT(utf16_emoji_string == utf16_emoji_string_explicit);
22	ZTD_TEXT_ASSERT(utf16_emoji_string == u"🐶🐶");
23	ZTD_TEXT_ASSERT(utf16_korean_string_explicit == u"안녕하세요");
24	return 0;
25}

Transcode from Execution Encoding to UTF-8

The execution encoding is the encoding that comes with the system. It is typically the encoding that all locale data comes in, especially for e.g. command line parameters on Windows. To encode from such an encoding to the highly successful and popular UTF-8, you may use the same ztd::text::transcode as above with the appropriate ztd::text::(compat_)utf8:

 1#include <ztd/text.hpp>
 2
 3#include <string>
 4#include <string_view>
 5#include <iostream>
 6
 7int main(int argc, char* argv[]) {
 8	if (argc < 1) {
 9		return 0;
10	}
11	for (int i = 0; i < argc; ++i) {
12		// print each argument as its UTF-8 version
13		// the default error handler is the "replacement" error handler:
14		// anything unrecognized will use the usual replacement "�".
15		std::string_view input = argv[i];
16		std::string utf8_string
17		     = ztd::text::transcode(input, ztd::text::compat_utf8,
18		          ztd::text::execution, ztd::text::replacement_handler);
19		// directly write to ouput (terminal) to prevent any internal conversions
20		// to/from an internal encoding while writing output
21		std::cout.write(utf8_string.data(), utf8_string.size());
22		// newline + flush
23		std::cout << std::endl;
24	}
25	return 0;
26}

The compat_ prefix is to make sure we are using the typedef definition of the templated ztd::text::basic_utf8 that uses char units. This is helpful for working with legacy data streams. We use std::cout.write(…) explicitly to prevent as much direct interface from the terminal or locales as possible to write the data to the terminal, ensuring that on competent systems with reasonably up-to-date terminals will display out UTF-8 data untouched (and, hopefully, properly).

Transcoding with Output Container Controls

Occasionally, you need to:

  • serialize to a container that isn’t a std::basic_string/std::(u8/16/32)string;

  • OR, you need to serialize to a container but you need to know if anything went wrong.

This is where the functions that are suffixed _to come into play, and where the template argument provided to the non-suffixed ztd::text::transcode<…>(…) come into play.

 1#include <ztd/text.hpp>
 2
 3#include <vector>
 4#include <list>
 5#include <deque>
 6#include <string>
 7#include <string_view>
 8
 9int main(int, char*[]) {
10	constexpr const char32_t input[] = U"🐶⛄🐶🔔";
11	constexpr const std::u16string_view utf16_expected_output = u"🐶⛄🐶🔔";
12
13	// a vector instead of a std::u16string
14	std::vector<char16_t> utf16_emoji_vector
15	     = ztd::text::transcode<std::vector<char16_t>>(input, ztd::text::utf16);
16
17	// a list (doubly-linked list) instead of a std::u16string
18	std::list<char16_t> utf16_emoji_list
19	     = ztd::text::transcode<std::list<char16_t>>(input, ztd::text::utf16);
20
21	// insert into a std::deque, with additional return information
22	auto utf16_emoji_deque_result
23	     = ztd::text::transcode_to<std::deque<char16_t>>(input, ztd::text::utf16);
24	// transcode_into_raw returns a ztd::text::transcode_result<…>
25	// which we can inspect for error codes and more!
26	// the error_code should be "ok"
27	ZTD_TEXT_ASSERT(
28	     utf16_emoji_deque_result.error_code == ztd::text::encoding_error::ok);
29	// No errors should have occured, even if they were "handled" and still
30	// returned "ok"
31	ZTD_TEXT_ASSERT(!utf16_emoji_deque_result.errors_were_handled());
32	// The input should be completely empty
33	ZTD_TEXT_ASSERT(utf16_emoji_deque_result.input.empty());
34
35	// The results should all be the same, despite the container!
36	ZTD_TEXT_ASSERT(
37	     ztd::ranges::equal(utf16_emoji_vector, utf16_expected_output));
38	ZTD_TEXT_ASSERT(ztd::ranges::equal(utf16_emoji_list, utf16_expected_output));
39	ZTD_TEXT_ASSERT(ztd::ranges::equal(
40	     utf16_emoji_deque_result.output, utf16_expected_output));
41	return 0;
42}

The returned ztd::text::transcode_result from the _to-suffixed function gives more information about what went wrong, including the error count and any other pertinent information.

Transcoding into any Output View/Range

Sometimes, just picking the container to serialize into isn’t enough. After all, in the above examples, space will be automatically allocated as the container is added to. This may not be desirable for memory-constrained environments, for places with strict performance requirements that cannot risk touching an allocator, and within tight loops even under normal desktop and server environments.

Therefore, the _into suffixed functions allow explicitly passing in a range to be written into that will keep writing into the available space between the range’s begin and end (e.g., from a std::vector’s .data() to it’s .data() + .size()).

 1#include <ztd/text.hpp>
 2
 3#include <ztd/idk/span.hpp>
 4
 5#include <string>
 6#include <string_view>
 7#include <deque>
 8
 9int main(int, char*[]) {
10	constexpr const ztd_char8_t input[] = u8"bark🐶⛄🐶🔔bark!";
11	constexpr const std::u16string_view expected_output = u"bark🐶⛄🐶🔔bark!";
12
13	// Get a deque with a pre-ordained size.
14	std::deque<char16_t> utf16_deque(expected_output.size());
15	// Subrange indicating available space to write into
16	auto utf16_deque_output_view
17	     = ztd::ranges::make_subrange(utf16_deque.begin(), utf16_deque.end());
18	// SAFE by default: if the container runs out of space, will not write more!
19	auto utf16_deque_result = ztd::text::transcode_into(input, ztd::text::utf8,
20	     utf16_deque_output_view, ztd::text::utf16, ztd::text::pass_handler,
21	     ztd::text::pass_handler);
22
23	// Ensure that the error code indicates success.
24	ZTD_TEXT_ASSERT(
25	     utf16_deque_result.error_code == ztd::text::encoding_error::ok);
26	// there were no errors handled for us while processing
27	ZTD_TEXT_ASSERT(!utf16_deque_result.errors_were_handled());
28	// We had (exactly enough) space.
29	ZTD_TEXT_ASSERT(ztd::ranges::equal(expected_output, utf16_deque));
30	// There is no more input or output space left
31	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf16_deque_result.input));
32	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf16_deque_result.output));
33
34	return 0;
35}

The returned ztd::text::transcode_result from the _into-suffixed function gives more information about what went wrong, including the error count and any other pertinent information. If a pivot is not used (described in a below section), it will return a ztd::text::pivotless_transcode_result or a ztd::text::stateless_transcode_result, which just has a few less data members to describe what happened.

If there is not enough space, then extra writing will not be done and it will stop and return an error of ztd::text::encoding_error::insufficient_output_space:

 1#include <ztd/text.hpp>
 2
 3#include <ztd/idk/span.hpp>
 4
 5#include <string>
 6#include <string_view>
 7
 8int main(int, char*[]) {
 9	constexpr const ztd_char8_t input[] = u8"bark🐶⛄🐶🔔bark!";
10	constexpr std::size_t input_last_exclamation_mark_index
11	     = ztdc_c_string_array_size(input) - 1;
12	constexpr const std::u16string_view full_expected_output
13	     = u"bark🐶⛄🐶🔔bark!";
14	constexpr std::size_t truncated_input_size = 15;
15	// string_view containing: "bark🐶⛄🐶🔔bark" (no ending exclamation point)
16	constexpr const std::u16string_view truncated_expected_ouput
17	     = full_expected_output.substr(0, truncated_input_size);
18
19	// SAFE by default: if the string runs out of space, will not write more!
20	std::u16string truncated_utf16_string(truncated_input_size, u'\0');
21	// Span indicating available space to write into
22	ztd::span<char16_t> truncated_utf16_string_output(truncated_utf16_string);
23	auto truncated_utf16_string_result = ztd::text::transcode_into(input,
24	     ztd::text::compat_utf8, truncated_utf16_string_output, ztd::text::utf16,
25	     ztd::text::pass_handler, ztd::text::pass_handler);
26
27	// We only had space for sixteen UTF-16 code units; expect as much from output
28	ZTD_TEXT_ASSERT(truncated_expected_ouput == truncated_utf16_string);
29	// The sequence was correct, but there wasn't enough output space for the full
30	// sequence!
31	ZTD_TEXT_ASSERT(truncated_utf16_string_result.error_code
32	     == ztd::text::encoding_error::insufficient_output_space);
33	ZTD_TEXT_ASSERT(truncated_utf16_string_result.errors_were_handled());
34	// There is no more output space
35	ZTD_TEXT_ASSERT(ztd::ranges::empty(truncated_utf16_string_result.output));
36	// There is still input left
37	ZTD_TEXT_ASSERT(!ztd::ranges::empty(truncated_utf16_string_result.input));
38	// We left only enough space for everything except the last '\0':
39	// check to see if that's what happened in the input
40	ZTD_TEXT_ASSERT(truncated_utf16_string_result.input[0] == '!');
41	ZTD_TEXT_ASSERT(truncated_utf16_string_result.input[0]
42	     == input[input_last_exclamation_mark_index]);
43	// No copies of the input were made:
44	// points to the same data as it was given.
45	ZTD_TEXT_ASSERT(&truncated_utf16_string_result.input[0]
46	     == &input[input_last_exclamation_mark_index]);
47
48	return 0;
49}

Transcoding with Errors

Very often, text contains errors. Whether it’s being interpreted as the wrong encoding or it contains file names or data mangled during a system crash, or it’s just plain incorrect, bad data is a firm staple and constant reality for text processing. ztd.text offers many kinds of error handlers. They have many different behaviors, from doing nothing and stopping the desired encoding operation, to skipping over bad text and not doing anything, to adding replacement characters, and more.

The ztd::text::default_handler, unless configured differently, is to use replacement characters:

 1#include <ztd/text.hpp>
 2
 3#include <string>
 4#include <string_view>
 5
 6int main(int, char*[]) {
 7	constexpr const char32_t input[] = U"Ba\xD800rk!";
 8	// Equivalent to: u8"Ba�rk!"
 9	constexpr const char expected_default_output[] = "Ba\xef\xbf\xbdrk!";
10
11	std::string utf8_string_with_default
12	     = ztd::text::transcode(input, ztd::text::compat_utf8);
13
14	ZTD_TEXT_ASSERT(utf8_string_with_default == expected_default_output);
15
16	auto utf8_string_with_default_result
17	     = ztd::text::transcode_to(input, ztd::text::compat_utf8);
18	ZTD_TEXT_ASSERT(utf8_string_with_default_result.error_code
19	     == ztd::text::encoding_error::ok);
20	ZTD_TEXT_ASSERT(utf8_string_with_default_result.errors_were_handled());
21	ZTD_TEXT_ASSERT(utf8_string_with_default_result.error_count == 1);
22	ZTD_TEXT_ASSERT(
23	     utf8_string_with_default_result.output == expected_default_output);
24	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf8_string_with_default_result.input));
25
26	return 0;
27}

The ztd::text::replacement_handler explicitly inserts replacement characters where the failure occurs:

 1#include <ztd/text.hpp>
 2
 3#include <string>
 4#include <string_view>
 5
 6int main(int, char*[]) {
 7	constexpr const char32_t input[]                   = U"Ba\xD800rk!";
 8	constexpr const char expected_replacement_output[] = "Ba\xef\xbf\xbdrk!";
 9
10	std::string utf8_string_with_replacement
11	     = ztd::text::transcode(input, ztd::text::utf32, ztd::text::compat_utf8,
12	          ztd::text::replacement_handler);
13
14	ZTD_TEXT_ASSERT(utf8_string_with_replacement == expected_replacement_output);
15
16	auto utf8_string_with_replacement_result
17	     = ztd::text::transcode_to(input, ztd::text::utf32,
18	          ztd::text::compat_utf8, ztd::text::replacement_handler);
19
20	ZTD_TEXT_ASSERT(utf8_string_with_replacement_result.error_code
21	     == ztd::text::encoding_error::ok);
22	ZTD_TEXT_ASSERT(utf8_string_with_replacement_result.errors_were_handled());
23	ZTD_TEXT_ASSERT(utf8_string_with_replacement_result.error_count == 1);
24	ZTD_TEXT_ASSERT(utf8_string_with_replacement_result.output
25	     == expected_replacement_output);
26	ZTD_TEXT_ASSERT(
27	     ztd::ranges::empty(utf8_string_with_replacement_result.input));
28
29	return 0;
30}

To simply skip over bad input without outputting any replacement characters, use ztd::text::skip_handler:

 1#include <ztd/text.hpp>
 2
 3#include <string>
 4#include <string_view>
 5
 6int main(int, char*[]) {
 7	constexpr const char32_t input[]            = U"Ba\xD800rk!";
 8	constexpr const char expected_skip_output[] = "Bark!";
 9
10	std::string utf8_string_with_skip = ztd::text::transcode(input,
11	     ztd::text::utf32, ztd::text::compat_utf8, ztd::text::skip_handler);
12	ZTD_TEXT_ASSERT(utf8_string_with_skip == expected_skip_output);
13
14	auto utf8_string_with_skip_result = ztd::text::transcode_to(input,
15	     ztd::text::utf32, ztd::text::compat_utf8, ztd::text::skip_handler);
16	ZTD_TEXT_ASSERT(utf8_string_with_skip_result.error_code
17	     == ztd::text::encoding_error::ok);
18	ZTD_TEXT_ASSERT(utf8_string_with_skip_result.errors_were_handled());
19	ZTD_TEXT_ASSERT(utf8_string_with_skip_result.error_count == 1);
20	ZTD_TEXT_ASSERT(utf8_string_with_skip_result.output == expected_skip_output);
21	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf8_string_with_skip_result.input));
22
23	return 0;
24}

To stop in the middle of the operation and return immediately, employ the ztd::text::pass_handler. This will leave text unprocessed, but offer a chance to inspect what is left and any corrective action that might need to be taken afterwards:

 1#include <ztd/text.hpp>
 2
 3#include <string>
 4#include <string_view>
 5
 6int main(int, char*[]) {
 7	constexpr const char32_t input[]                        = U"Ba\xD800rk!";
 8	constexpr const char expected_pass_output[]             = "Ba";
 9	constexpr const char32_t expected_pass_leftover_input[] = U"\xD800rk!";
10
11	std::string utf8_string_with_pass = ztd::text::transcode(input,
12	     ztd::text::utf32, ztd::text::compat_utf8, ztd::text::pass_handler);
13
14	ZTD_TEXT_ASSERT(utf8_string_with_pass == expected_pass_output);
15
16	auto utf8_string_with_pass_result = ztd::text::transcode_to(input,
17	     ztd::text::utf32, ztd::text::compat_utf8, ztd::text::pass_handler);
18
19	ZTD_TEXT_ASSERT(utf8_string_with_pass_result.error_code
20	     == ztd::text::encoding_error::invalid_sequence);
21	ZTD_TEXT_ASSERT(utf8_string_with_pass_result.errors_were_handled());
22	ZTD_TEXT_ASSERT(utf8_string_with_pass_result.error_count == 1);
23	ZTD_TEXT_ASSERT(utf8_string_with_pass_result.output == expected_pass_output);
24	ZTD_TEXT_ASSERT(ztd::ranges::equal(utf8_string_with_pass_result.input,
25	     std::u32string(expected_pass_leftover_input)));
26
27	return 0;
28}

Error handlers like the ztd::text::skip_handler and ztd::text::replacement_handler (and potentially the ztd::text::default_handler) are smart enough to not output multiple replacement characters for every single 8, 16, or 32-bit unit that contains an error, folding them down into one replacement character per distinct failure location:

 1#include <ztd/text.hpp>
 2
 3#include <vector>
 4#include <list>
 5#include <deque>
 6#include <string>
 7#include <string_view>
 8
 9int main(int, char*[]) {
10	// Scuffed UTF-8 input: 'C0' is not a legal sequence starter
11	// for regular, pure UTF-8
12	constexpr const char input[]                    = "Me\xC0\x9F\x90\xB1ow!";
13	constexpr const char32_t expected_skip_output[] = U"Meow!";
14
15	std::u32string utf32_string_with_skip = ztd::text::transcode(input,
16	     ztd::text::compat_utf8, ztd::text::utf32, ztd::text::skip_handler);
17
18	ZTD_TEXT_ASSERT(utf32_string_with_skip == expected_skip_output);
19
20	auto utf32_string_with_skip_result = ztd::text::transcode_to(input,
21	     ztd::text::compat_utf8, ztd::text::utf32, ztd::text::skip_handler);
22	ZTD_TEXT_ASSERT(utf32_string_with_skip_result.error_code
23	     == ztd::text::encoding_error::ok);
24	ZTD_TEXT_ASSERT(utf32_string_with_skip_result.errors_were_handled());
25	ZTD_TEXT_ASSERT(utf32_string_with_skip_result.error_count == 1);
26	ZTD_TEXT_ASSERT(utf32_string_with_skip_result.output == expected_skip_output);
27	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf32_string_with_skip_result.input));
28
29	return 0;
30}
 1#include <ztd/text.hpp>
 2
 3#include <vector>
 4#include <list>
 5#include <deque>
 6#include <string>
 7#include <string_view>
 8
 9int main(int, char*[]) {
10	// Scuffed UTF-8 input: 'C0' is not a legal sequence starter
11	// for regular, pure UTF-8!
12	constexpr const char input[]                       = "Me\xC0\x9F\x90\xB1ow!";
13	constexpr const char32_t expected_default_output[] = U"Me�ow!";
14
15	std::u32string utf32_string_with_default
16	     = ztd::text::transcode(input, ztd::text::compat_utf8, ztd::text::utf32);
17
18	ZTD_TEXT_ASSERT(utf32_string_with_default == expected_default_output);
19
20	auto utf32_string_with_default_result = ztd::text::transcode_to(
21	     input, ztd::text::compat_utf8, ztd::text::utf32);
22	ZTD_TEXT_ASSERT(utf32_string_with_default_result.error_code
23	     == ztd::text::encoding_error::ok);
24	ZTD_TEXT_ASSERT(utf32_string_with_default_result.errors_were_handled());
25	ZTD_TEXT_ASSERT(utf32_string_with_default_result.error_count == 1);
26	ZTD_TEXT_ASSERT(
27	     utf32_string_with_default_result.output == expected_default_output);
28	ZTD_TEXT_ASSERT(ztd::ranges::empty(utf32_string_with_default_result.input));
29
30	return 0;
31}
 1#include <ztd/text.hpp>
 2
 3#include <vector>
 4#include <list>
 5#include <deque>
 6#include <string>
 7#include <string_view>
 8
 9int main(int, char*[]) {
10	// Scuffed UTF-8 input: 'C0' is not a legal sequence starter
11	// for regular, pure UTF-8
12	constexpr const char input[] = "Me\xC0\x9F\x90\xB1ow!";
13	constexpr const char32_t expected_replacement_output[] = U"Me�ow!";
14
15	std::u32string utf32_string_with_replacement
16	     = ztd::text::transcode(input, ztd::text::compat_utf8, ztd::text::utf32,
17	          ztd::text::replacement_handler);
18
19	ZTD_TEXT_ASSERT(utf32_string_with_replacement == expected_replacement_output);
20
21	auto utf32_string_with_replacement_result
22	     = ztd::text::transcode_to(input, ztd::text::compat_utf8,
23	          ztd::text::utf32, ztd::text::replacement_handler);
24	ZTD_TEXT_ASSERT(utf32_string_with_replacement_result.error_code
25	     == ztd::text::encoding_error::ok);
26	ZTD_TEXT_ASSERT(utf32_string_with_replacement_result.errors_were_handled());
27	ZTD_TEXT_ASSERT(utf32_string_with_replacement_result.error_count == 1);
28	ZTD_TEXT_ASSERT(utf32_string_with_replacement_result.output
29	     == expected_replacement_output);
30	ZTD_TEXT_ASSERT(
31	     ztd::ranges::empty(utf32_string_with_replacement_result.input));
32
33	return 0;
34}

Compared to the ztd::text::pass_handler, which will stop at the first potential error:

 1#include <ztd/text.hpp>
 2
 3#include <vector>
 4#include <list>
 5#include <deque>
 6#include <string>
 7#include <string_view>
 8
 9int main(int, char*[]) {
10	// Scuffed UTF-8 input: 'C0' is not a legal sequence starter
11	// for regular, pure UTF-8
12	constexpr const char input[]                        = "Me\xC0\x9F\x90\xB1ow!";
13	constexpr const char32_t expected_pass_output[]     = U"Me";
14	constexpr const char expected_pass_leftover_input[] = "\xC0\x9F\x90\xB1ow!";
15
16	std::u32string utf32_string_with_pass = ztd::text::transcode(input,
17	     ztd::text::compat_utf8, ztd::text::utf32, ztd::text::pass_handler);
18
19	ZTD_TEXT_ASSERT(utf32_string_with_pass == expected_pass_output);
20
21	auto utf32_string_with_pass_result = ztd::text::transcode_to(input,
22	     ztd::text::compat_utf8, ztd::text::utf32, ztd::text::pass_handler);
23	ZTD_TEXT_ASSERT(utf32_string_with_pass_result.error_code
24	     == ztd::text::encoding_error::invalid_sequence);
25	ZTD_TEXT_ASSERT(utf32_string_with_pass_result.errors_were_handled());
26	ZTD_TEXT_ASSERT(utf32_string_with_pass_result.error_count == 1);
27	ZTD_TEXT_ASSERT(utf32_string_with_pass_result.output == expected_pass_output);
28	ZTD_TEXT_ASSERT(ztd::ranges::equal(utf32_string_with_pass_result.input,
29	     std::string_view(expected_pass_leftover_input)));
30
31	return 0;
32}

You can even write your own custom error handlers.

Transcoding with Input, Output and Pivot Controls

Occasionally, you need to perform a transcoding operation that has no extension point and needs to go through an intermediate transition phase first (like UTF-8 ➡ Intermediate UTF-32 ➡ Shift-JIS). Normally, ztd.text will create an internal, stack-based buffer (controllable with preprocessor definitions) to use as the intermediate. But sometimes you need to exercise control over even that, to keep memory usage predictable and stable in all situations. Enter the pivot parameter, which a user can use to give a custom buffer (any custom range) as the intermediate data holder:

 1#include <ztd/text.hpp>
 2
 3#include <string>
 4#include <iterator>
 5
 6int main(int, char*[]) {
 7	constexpr const char16_t u16_data[]
 8	     = u"この国の歴史は世界がまだ未完成で、神様すらいなかったところから始まる"
 9	       u"。現在の日本は『神様の住む天界』『人間の住む地上』『死者の住む冥界』"
10	       u"の三層に分かれているけれど、その頃はまだ気体と固体すら分かれていなく"
11	       u"て、カオス状態の世界がどこまでもどこまでも広がっているだけだった。そ"
12	       u"れから長ーーい長ーーーい時間が経ったある日、ふいに天と地が分かれた。"
13	       u"すると、どこからともなく天に1人の神様が、なりなりと生まれてきた。";
14
15	// must provide all arguments to get to the "pivot" part.
16	// decode and encode states to use
17	auto utf16_decode_state = ztd::text::make_decode_state(ztd::text::utf16);
18	auto shift_jis_encode_state
19	     = ztd::text::make_encode_state(ztd::text::shift_jis_x0208);
20	// the output we're going to serialize into! We're using a std::back_inserter
21	// to just fill up our desired container (in this case, a std::string)
22	std::string shift_jis_string;
23	auto output_view
24	     = ztd::ranges::unbounded_view(std::back_inserter(shift_jis_string));
25	// we're going to use a static buffer, but anything
26	// would work just fine, really, as the "pivot"
27	char32_t my_intermediate_buffer[256];
28	ztd::span<char32_t> pivot(my_intermediate_buffer);
29
30	// Perform the conversion!
31	auto shift_jis_result = ztd::text::transcode_into(u16_data, ztd::text::utf16,
32	     output_view, ztd::text::shift_jis_x0208, ztd::text::replacement_handler,
33	     ztd::text::replacement_handler, utf16_decode_state,
34	     shift_jis_encode_state, pivot);
35
36
37	// Verify everything is in a state we expect it to be in!
38	// A Shift-JIS encoded character string.
39	constexpr const char expected_shift_jis_string[]
40	     = "\x82\xb1\x82\xcc\x8d\x91\x82\xcc\x97\xf0\x8e\x6a\x82\xcd\x90\xa2\x8a"
41	       "\x45\x82\xaa\x82\xdc\x82\xbe\x96\xa2\x8a\xae\x90\xac\x82\xc5\x81\x41"
42	       "\x90\x5f\x97\x6c\x82\xb7\x82\xe7\x82\xa2\x82\xc8\x82\xa9\x82\xc1\x82"
43	       "\xbd\x82\xc6\x82\xb1\x82\xeb\x82\xa9\x82\xe7\x8e\x6e\x82\xdc\x82\xe9"
44	       "\x81\x42\x8c\xbb\x8d\xdd\x82\xcc\x93\xfa\x96\x7b\x82\xcd\x81\x77\x90"
45	       "\x5f\x97\x6c\x82\xcc\x8f\x5a\x82\xde\x93\x56\x8a\x45\x81\x78\x81\x77"
46	       "\x90\x6c\x8a\xd4\x82\xcc\x8f\x5a\x82\xde\x92\x6e\x8f\xe3\x81\x78\x81"
47	       "\x77\x8e\x80\x8e\xd2\x82\xcc\x8f\x5a\x82\xde\x96\xbb\x8a\x45\x81\x78"
48	       "\x82\xcc\x8e\x4f\x91\x77\x82\xc9\x95\xaa\x82\xa9\x82\xea\x82\xc4\x82"
49	       "\xa2\x82\xe9\x82\xaf\x82\xea\x82\xc7\x81\x41\x82\xbb\x82\xcc\x8d\xa0"
50	       "\x82\xcd\x82\xdc\x82\xbe\x8b\x43\x91\xcc\x82\xc6\x8c\xc5\x91\xcc\x82"
51	       "\xb7\x82\xe7\x95\xaa\x82\xa9\x82\xea\x82\xc4\x82\xa2\x82\xc8\x82\xad"
52	       "\x82\xc4\x81\x41\x83\x4a\x83\x49\x83\x58\x8f\xf3\x91\xd4\x82\xcc\x90"
53	       "\xa2\x8a\x45\x82\xaa\x82\xc7\x82\xb1\x82\xdc\x82\xc5\x82\xe0\x82\xc7"
54	       "\x82\xb1\x82\xdc\x82\xc5\x82\xe0\x8d\x4c\x82\xaa\x82\xc1\x82\xc4\x82"
55	       "\xa2\x82\xe9\x82\xbe\x82\xaf\x82\xbe\x82\xc1\x82\xbd\x81\x42\x82\xbb"
56	       "\x82\xea\x82\xa9\x82\xe7\x92\xb7\x81\x5b\x81\x5b\x82\xa2\x92\xb7\x81"
57	       "\x5b\x81\x5b\x81\x5b\x82\xa2\x8e\x9e\x8a\xd4\x82\xaa\x8c\x6f\x82\xc1"
58	       "\x82\xbd\x82\xa0\x82\xe9\x93\xfa\x81\x41\x82\xd3\x82\xa2\x82\xc9\x93"
59	       "\x56\x82\xc6\x92\x6e\x82\xaa\x95\xaa\x82\xa9\x82\xea\x82\xbd\x81\x42"
60	       "\x82\xb7\x82\xe9\x82\xc6\x81\x41\x82\xc7\x82\xb1\x82\xa9\x82\xe7\x82"
61	       "\xc6\x82\xe0\x82\xc8\x82\xad\x93\x56\x82\xc9\x31\x90\x6c\x82\xcc\x90"
62	       "\x5f\x97\x6c\x82\xaa\x81\x41\x82\xc8\x82\xe8\x82\xc8\x82\xe8\x82\xc6"
63	       "\x90\xb6\x82\xdc\x82\xea\x82\xc4\x82\xab\x82\xbd\x81\x42";
64
65	ZTD_TEXT_ASSERT(shift_jis_result.error_code == ztd::text::encoding_error::ok);
66	ZTD_TEXT_ASSERT(!shift_jis_result.errors_were_handled());
67	ZTD_TEXT_ASSERT(
68	     shift_jis_result.pivot_error_code == ztd::text::encoding_error::ok);
69	ZTD_TEXT_ASSERT(shift_jis_result.pivot_error_count == 0);
70	ZTD_TEXT_ASSERT(ztd::ranges::empty(shift_jis_result.input));
71	ZTD_TEXT_ASSERT(shift_jis_string == expected_shift_jis_string);
72	return 0;
73}

Here, we use an exceptionally small buffer to keep memory usage down. Note that the buffer should be at least as large as ztd::text::max_code_points_v<FromEncoding> (FromEncoding in this case being the ztd::text::compat_utf8 encoding) so that no "insufficient output size" errors occur during translation. (For ztd::text::encode operations, the buffer should be at least as large as ztd::text::max_code_units_v<FromEncoding>) If the pivot buffer is too small this can produce unpredictable failures and unexpected behavior from unanticipated errors, so make sure to always provide a suitably-sized pivot buffer! Or, alternatively, just let the implementation use its defaults, which are (generally) tuned to work out well enough for most conversion routines and platforms.

Encoding & Decoding

Encoding and decoding look identical to Transcoding, just using the functions ztd::text::decode and ztd::text::encode functions. ztd::text::decode will always produce a sequence of the encoding’s code point type (ztd::text::code_point_t<some_encoding_type>). ztd::text::encode will always produce a sequence of the encoding’s code unit type (ztd::text::code_unit_t<some_encoding_type>), and the lower-level functions ending in _to and _into will produce a ztd::text::encode_result (for encoding) or ztd::text::decode_result (for decoding):

 1#include <ztd/text.hpp>
 2
 3#include <string>
 4
 5int main(int, char*[]) {
 6	const char input[]
 7	     = "\xbe\xc8\xb3\xe7\x2c\x20\xbf\xc0\xb4\xc3\xc0\xba\x20\xc1\xc1\xc0\xba"
 8	       "\x20\xb3\xaf\xc0\xcc\xbf\xa1\xbf\xe4\x21";
 9
10	// Decode, with result to check!
11	auto korean_decoded_output_result
12	     = ztd::text::decode_to(input, ztd::text::euc_kr_uhc);
13	ZTD_TEXT_ASSERT(korean_decoded_output_result.error_code
14	     == ztd::text::encoding_error::ok);
15	ZTD_TEXT_ASSERT(!korean_decoded_output_result.errors_were_handled());
16	ZTD_TEXT_ASSERT(ztd::ranges::empty(korean_decoded_output_result.input));
17	const std::u32string& korean_decoded_output
18	     = korean_decoded_output_result.output;
19
20	// Take decoded Unicode code points and encode it into UTF-8
21	auto korean_utf8_output_result
22	     = ztd::text::encode_to(korean_decoded_output, ztd::text::compat_utf8);
23	ZTD_TEXT_ASSERT(
24	     korean_utf8_output_result.error_code == ztd::text::encoding_error::ok);
25	ZTD_TEXT_ASSERT(!korean_utf8_output_result.errors_were_handled());
26	ZTD_TEXT_ASSERT(ztd::ranges::empty(korean_utf8_output_result.input));
27	const std::string& korean_utf8_output = korean_utf8_output_result.output;
28	// verify that what we got out in UTF-8 would be the same if we converted
29	// it back to EUC-KR.
30	ZTD_TEXT_ASSERT(ztd::ranges::equal(std::string_view(input),
31	     ztd::text::transcode(korean_utf8_output, ztd::text::compat_utf8,
32	          ztd::text::euc_kr_uhc, ztd::text::pass_handler)));
33	// A korean greeting!
34	std::cout.write(korean_utf8_output.data(), korean_utf8_output.size());
35	std::cout << std::endl;
36
37	return 0;
38}

Encode and decode operations are part of each encoding, represented by its encoding type. Every encoding object natively understands how to go from a sequence of its encoded data to its decoded data, and vice-versa, with the encode_one and decode_one functions. One should not call these functions directly, however, and instead used the above-provided functions. Because decode and encode operations do not feature intermediate steps, there is no ztd::text::pivot<…> for these functions.

Counting

Counting is done using the ztd::text::count_as_decoded, ztd::text::count_as_encoded, and ztd::text::count_as_transcoded. As the names imply, it yields the number of code points or code units that will result from an attempted encode, decode, or transcode operation in a sequence of text. It will return a ztd::text::count_result detailing that information:

 1#include <ztd/text.hpp>
 2
 3#include <iostream>
 4
 5int main(int, char*[]) {
 6	const char input[]
 7	     = " OSSL   s  s  RFC ss   "
 8	       "-cqj0qgheba6zgdehhb85bfc31d5m2evf4423k0a7nd6abq3flcampfa17ac5froq64c0"
 9	       "a2a7nbcyjnb1b7yp96t0e31nkf95i";
10	std::vector<char32_t> output(256);
11	auto counting_result
12	     = ztd::text::count_as_decoded(input, ztd::text::punycode);
13	ZTD_TEXT_ASSERT(counting_result.error_code == ztd::text::encoding_error::ok);
14	ZTD_TEXT_ASSERT(!counting_result.errors_were_handled());
15	if (counting_result.count > 256) {
16		std::cerr << "The input punycode exceeeds the IDNA limited size buffer: "
17		             "change parameters to allocate a larger one!"
18		          << std::endl;
19		return 1;
20	}
21	output.resize(counting_result.count);
22	auto decoding_result = ztd::text::decode_into_raw(
23	     input, ztd::text::punycode, ztd::span<char32_t>(output));
24	std::size_t decoding_result_count
25	     = decoding_result.output.data() - output.data();
26	ZTD_TEXT_ASSERT(decoding_result.error_code == ztd::text::encoding_error::ok);
27	ZTD_TEXT_ASSERT(!decoding_result.errors_were_handled());
28	ZTD_TEXT_ASSERT(ztd::ranges::empty(decoding_result.input));
29	ZTD_TEXT_ASSERT(decoding_result_count == counting_result.count);
30
31	// Show decoded punycode (translate to UTF-8 to print to console)
32	std::cout << "Decoded punycode code points:\n\t";
33	for (const auto& code_point : output) {
34		std::cout << std::hex << std::showbase
35		          << static_cast<uint_least32_t>(code_point) << " ";
36	}
37	std::cout << "\n" << std::endl;
38
39	std::cout << "Decoded punycode as UTF-8:\n\t";
40	ztd::text::encode_view<ztd::text::utf8_t> print_view(
41	     std::u32string_view(output.data(), decoding_result_count));
42	for (auto u8_code_unit : print_view) {
43		std::cout.write(reinterpret_cast<const char*>(&u8_code_unit), 1);
44	}
45	std::cout << std::endl;
46
47	return 0;
48}

Getting counts is essential to being able to size allocated buffers for exactly what is necessary, or make use of small buffer optimizations by checking sizes before potentially spilling over into larger allocations.

Validation

Validation is checking whether or not the input sequence can be encoded, decoded, or transcoded to by (and to) the given encoding. It works through the ztd::text::validate_decodable, ztd::text::validate_encodable, and ztd::text::validate_transcodable functions.

 1#include <ztd/text.hpp>
 2
 3#include <string_view>
 4
 5int main(int, char*[]) {
 6	constexpr const std::u32string_view input = U"meow🐱moew🐱!";
 7
 8	// At compile-time: returns a structure with (explicit) operator bool
 9	// to allow it to be used with ! and if() statements
10	static_assert(!ztd::text::validate_encodable_as(input, ztd::text::ascii),
11	     "Unfortunately, ASCII does not support emoji.");
12
13	// At run-time: returns a structure
14	auto validate_result
15	     = ztd::text::validate_encodable_as(input, ztd::text::ascii);
16
17	// Check if the result is valid (should not be valid).
18	if (validate_result) {
19		// Everyyhing was verified (not expected! ❌)
20		std::cerr << "Unexpectedly, the input text was all valid ASCII."
21		          << std::endl;
22		return 1;
23	}
24
25	// Otherwise, everything was not verified (expected! ✅)
26	std::cout << "As expected, the input text was not valid ASCII."
27	          << "\n"
28	          << "Here are the unicode hex values of the unvalidated UTF-32 code "
29	             "points:\n\t";
30	// use the structure to know where we left off.
31	std::u32string_view unused_input(
32	     validate_result.input.data(), validate_result.input.size());
33	for (const auto& u32_codepoint : unused_input) {
34		std::cout << "0x" << std::hex
35		          << static_cast<uint_least32_t>(u32_codepoint);
36		if (&u32_codepoint != &unused_input.back()) {
37			std::cout << " ";
38		}
39	}
40	std::cout << std::endl;
41	return 0;
42}

There’s More!

There is more you can do with this library, from authoring your own encoding objects/types to taking control of the performance of conversions. More will be added to this Getting Started as time goes on, but if you have any inkling of something that should work, give it a try! If it fails in a way you don’t think is helpful, please let us known through any of our available communication channels so we can assist you!