State, Completion, Runtime Data, and More

Some states need extra functionality or additional information to function properly. This can manifest as:

needing extra data on a per-conversion basis that you can maintain yourself;
needing 2 different types for encode/decode operations;
OR, needing runtime-dependent, conversion-dependent information for a specific conversion.

Extra Data and Completion

State objects are always passed into the function by non-const l-value reference (e.g. void f(state_type& state);). This means that, once a state is created, it can be used to influence how a specific algorithm works. While most encodings strive to have little to no meaningful state, others can have very meaningful state that should not be discarded between function calls or that may contribute meaningfully to the encoding or decoding process.

To aid with this, a state type can have a callable function of the form is_complete():

class encode_state {
public:
        encode_state_handle_t handle;

        bool is_complete() const noexcept {
                return state_handle_has_no_more_output(handle);
        }
};

The state_handle_t and state_handle_has_no_more_output are fictitious, but they represent how the given encode_state type would signal that it has no more work to be done. This is useful for algorithms which may need to be signaled that a stream has no more data and should thus produce an error if the final bits of data do not form a complete sequence, or if there are encoding algorithms (such as punycode) that need to collect all input before doing output operations. When the state has this function present, a user can use ztd::text::is_state_complete(some_state) as part of a condition to check if a given conversion sequence and its state have fully serialized all possible data.

Separate Encode/Decode States

It is no secret that encoding and decoding may carrying with them separate states. While converting from a legacy encoding to Unicode may require maintenance of a shift state or code unit modifier, the opposite direction may not need any at all. Therefore, as an optimization, an encoding object can define both an encode_state and a decode_state, separate from each other. As an example, here is a (simplified) version of how ztd::text::execution, the encoding for the Locale-based Runtime Execution Encoding, has two separate states that need to be initialized in different manners:

class runtime_locale {

	struct decode_state {
		ztd_mbstate_t c_stdlib_state;

		decode_state() noexcept : c_stdlib_state() {
			// properly set for mbrtoc32 state
			code_point ghost_ouput[2] {};
			UCHAR_ACCESS mbrtoc32(ghost_ouput, "\0", 1, &c_stdlib_state);
		}

		bool is_complete() const noexcept {
			return UCHAR_ACCESS mbsinit(&c_stdlib_state) != 0;
		}
	};

	struct encode_state {
		ztd_mbstate_t c_stdlib_state;

		encode_state() noexcept : c_stdlib_state() {
			// properly set for c32rtomb state
	if (argc < 1) {

This is the proper way to initialize a std::mbstate_t from the C standard library. Then, you can use it! Here’s a complete implementation using the new encode_state and decode_state types:

			// Uh... well, it probably has this? ¯\_(ツ)_/¯
			static const char replacement[1] = { '?' };
			return replacement;
		}
	}

private:
	using rtl_decode_result = ztd::text::decode_result<ztd::span<const code_unit>,
	     ztd::span<code_point>, decode_state>;
	using rtl_encode_result = ztd::text::encode_result<ztd::span<const code_point>,
	     ztd::span<code_unit>, encode_state>;
	using rtl_decode_error_handler
	     = std::function<rtl_decode_result(const runtime_locale&, rtl_decode_result,
	          ztd::span<const char>, ztd::span<const char32_t>)>;
	using rtl_encode_error_handler
	     = std::function<rtl_encode_result(const runtime_locale&, rtl_encode_result,
	          ztd::span<const char32_t>, ztd::span<const char>)>;

	using empty_code_unit_span  = ztd::span<const code_unit, 0>;
	using empty_code_point_span = ztd::span<const code_point, 0>;

public:
	rtl_decode_result decode_one(ztd::span<const code_unit> input,
	     ztd::span<code_point> output, rtl_decode_error_handler error_handler,
	     decode_state& current // decode-based state
	) const {
		if (output.size() < 1) {
			return error_handler(*this,
			     rtl_decode_result(input, output, current,
			          ztd::text::encoding_error::insufficient_output_space),
			     empty_code_unit_span(), empty_code_point_span());
		}
		std::size_t result = UCHAR_ACCESS mbrtoc32(
		     output.data(), input.data(), input.size(), &current.c_stdlib_state);
		switch (result) {
		case (std::size_t)0:
			// '\0' was encountered in the input
			// current.c_stdlib_state was "cleared"
			// '\0' character was written to output
			return rtl_decode_result(
			     input.subspan(1), output.subspan(1), current);
			break;
		case (std::size_t)-3:
			// no input read, pre-stored character
			// was written out
			return rtl_decode_result(input, output.subspan(1), current);
		case (std::size_t)-2:
			// input was an incomplete sequence
			return error_handler(*this,
			     rtl_decode_result(input, output, current,
			          ztd::text::encoding_error::incomplete_sequence),
			     empty_code_unit_span(), empty_code_point_span());
			break;
		case (std::size_t)-1:
			// invalid sequence!
			return error_handler(*this,
			     rtl_decode_result(input, output, current,
			          ztd::text::encoding_error::invalid_sequence),
			     empty_code_unit_span(), empty_code_point_span());
		}
		// everything as fine, then
		return rtl_decode_result(
		     input.subspan(result), output.subspan(1), current);
	}

	rtl_encode_result encode_one(ztd::span<const code_point> input,
	     ztd::span<code_unit> output, rtl_encode_error_handler error_handler,
	     encode_state& current // encode-based state
	) const {
		// saved, in case we need to go
		// around mulitple times to get
		// an output character
		ztd::span<const code_point> original_input = input;
		// The C standard library assumes
		// it can write out MB_CUR_MAX characters to the buffer:
		// we have no guarantee our output buffer is that big, so it
		// needs to go into an intermediate buffer instead
		code_unit intermediate_buffer[MB_LEN_MAX];

		for ([[maybe_unused]] int times_around = 0;; ++times_around) {
			if (input.size() < 1) {
				// no more input: everything is fine
				return rtl_encode_result(input, output, current);
			}
			std::size_t result = UCHAR_ACCESS c32rtomb(
			     intermediate_buffer, *input.data(), &current.c_stdlib_state);
			if (result == (std::size_t)-1) {
				// invalid sequence!
				return error_handler(*this,
				     rtl_encode_result(original_input, output, current,
				          ztd::text::encoding_error::invalid_sequence),
				     empty_code_point_span(), empty_code_unit_span());
			}
			else if (result == (std::size_t)0) {
				// this means nothing was output
				// we should probably go-around again,
				// after modifying input
				input = input.subspan(1);
				continue;
			}
			// otherwise, we got something written out!
			if (output.size() < result) {
				// can't fit!!
				return error_handler(*this,
				     rtl_encode_result(original_input, output, current,
				          ztd::text::encoding_error::insufficient_output_space),
				     empty_code_point_span(), empty_code_unit_span());
			}
			::std::memcpy(output.data(), intermediate_buffer,
			     sizeof(*intermediate_buffer) * result);
			input  = input.subspan(1);
			output = output.subspan(result);
			break;
		}
		return rtl_encode_result(input, output, current);
	}
};

int main(int argc, char* argv[]) {
	if (argc < 1) {

This allows you to maintain 2 different states, initialized in 2 different ways, one for each of the encode_one and decode_one function paths.

Encoding-Dependent States

Some states need additional information in order to be constructed and used properly. This can be the case when the encoding has stored some type-erased information, as ztd::text::any_encoding does, or as if you wrote a variant_encoding<utf8le, utf16be, ...>. For example, given a type_erased_encoding like so:

class type_erased_encoding {
private:
        struct erased_state {
                virtual ~erased_state () {}
        };

        struct erased_encoding {
                virtual std::unique_ptr<erased_state> create_decode_state() = 0;
                virtual std::unique_ptr<erased_state> create_encode_state() = 0;

                virtual ~erased_encoding () {}
        };

        template <typename Encoding>
        struct typed_encoding : erased_encoding {
                Encoding encoding;

                struct decode_state : erased_state {
                        using state_type = ztd::text::decode_state_t<Encoding>;
                        state_type state;

                        decode_state(const Encoding& some_encoding)
                        : state(ztd::text::make_decode_state(some_encoding)) {
                                // get a decode state from the given encoding
                        }
                };

                struct encode_state : erased_state {
                        using state_type = ztd::text::encode_state_t<Encoding>;
                        state_type state;

                        decode_state(const Encoding& some_encoding)
                        : state(ztd::text::make_encode_state(some_encoding)) {
                                // get a decode state from the given encoding
                        }
                };

                typed_encoding(Encoding&& some_encoding)
                : encoding(std::move(some_encoding)) {
                        // move encoding in
                }

                typed_encoding(const Encoding& some_encoding)
                : encoding(some_encoding) {
                        // copy encoding in
                }

                virtual std::unique_ptr<erased_state> create_decode_state() override {
                        return std::make_unique<decode_state>(encoding);
                }

                virtual std::unique_ptr<erased_state> create_encode_state() override {
                        return std::make_unique<encode_state>(encoding);
                }
        };

        std::unique_ptr<erased_encoding> stored;

public:
        template <typename AnyEncoding>
        type_erased(AnyEncoding&& some_encoding)
        : stored_ptr(std::make_unique<typed_encoding<std::remove_cvref_t<AnyEncoding>>>(
                std::forward<AnyEncoding>(some_encoding))
        ) {
                // store any encoding in the member unique pointer
        }

        // ... rest of the implementation
};

We can see that creating a state with a default constructor no longer works, because the state itself requires more information than can be known by just the constructor itself. It needs access to the wrapped encoding. The solution to this problem is an opt-in when creating your state types by giving your state type a constructor that takes the encoding type:

class type_erased_encoding {
        // from above, etc. …
public:
        // public-facing wrappers
        struct type_erased_decode_state {
        public:
                // special constructor!!
                type_erased_state (const type_erased_encoding& encoding)
                : stored(encoding.stored->create_decode_state()) {

                }
        private:
                std::unique_ptr<erased_state> stored;
        };

        struct type_erased_encode_state {
        public:
                // special constructor!!
                type_erased_state (const type_erased_encoding& encoding)
                : stored(encoding.stored->create_encode_state()) {
                        // hold onto type-erased state
                }
        private:
                std::unique_ptr<erased_state> stored;
        };

        using decode_state = type_erased_state;
        using encode_state = type_erased_state;

        // ... rest of the Lucky 7 members
};

These special constructors will create the necessary state using information from the type_erased_encoding to do it properly. This will allow us to have states that properly reflect what was erased when we perform a given higher-level conversion operation or algorithm.

This encoding-aware state-construction behavior is detected by the ztd::text::is_state_independent, ztd::text::is_decode_state_independent, and ztd::text::is_encode_state_independent classifications.

These classifications are used in the ztd::text::make_decode_state and ztd::text::make_encode_state function calls to correctly construct a state object, which is what the API uses to make states for its higher-level function calls. If you are working in a generic context, you should use these functions too when working in this minute details. However, if you’re not working with templates, consider simply using the already-provided ztd::text::any_encoding to do exactly what this example shows, with some extra attention to detail and internal optimizations done on your behalf.