State, Completion, Runtime Data, and More

Some states need extra functionality or additional information to function properly. This can manifest as:

  • needing extra data on a per-conversion basis that you can maintain yourself;

  • needing 2 different types for encode/decode operations;

  • OR, needing runtime-dependent, conversion-dependent information for a specific conversion.

Extra Data and Completion

State objects are always passed into the function by non-const l-value reference (e.g. void f(state_type& state);). This means that, once a state is created, it can be used to influence how a specific algorithm works. While most encodings strive to have little to no meaningful state, others can have very meaningful state that should not be discarded between function calls or that may contribute meaningfully to the encoding or decoding process.

To aid with this, a state type can have a callable function of the form is_complete():

1class encode_state {
2public:
3        encode_state_handle_t handle;
4
5        bool is_complete() const noexcept {
6                return state_handle_has_no_more_output(handle);
7        }
8};

The state_handle_t and state_handle_has_no_more_output are fictitious, but they represent how the given encode_state type would signal that it has no more work to be done. This is useful for algorithms which may need to be signaled that a stream has no more data and should thus produce an error if the final bits of data do not form a complete sequence, or if there are encoding algorithms (such as punycode) that need to collect all input before doing output operations. When the state has this function present, a user can use ztd::text::is_state_complete(some_state) as part of a condition to check if a given conversion sequence and its state have fully serialized all possible data.

Separate Encode/Decode States

It is no secret that encoding and decoding may carrying with them separate states. While converting from a legacy encoding to Unicode may require maintenance of a shift state or code unit modifier, the opposite direction may not need any at all. Therefore, as an optimization, an encoding object can define both an encode_state and a decode_state, separate from each other. As an example, here is a (simplified) version of how ztd::text::execution, the encoding for the Locale-based Runtime Execution Encoding, has two separate states that need to be initialized in different manners:

 1class runtime_locale {
 2public:
 3	struct decode_state {
 4		std::mbstate_t c_stdlib_state;
 5
 6		decode_state() noexcept : c_stdlib_state() {
 7			// properly set for mbrtoc32 state
 8			code_point ghost_ouput[2] {};
 9			UCHAR_ACCESS mbrtoc32(ghost_ouput, "\0", 1, &c_stdlib_state);
10		}
11	};
12
13	struct encode_state {
14		std::mbstate_t c_stdlib_state;
15
16		encode_state() noexcept : c_stdlib_state() {
17			// properly set for c32rtomb state
18			code_unit ghost_ouput[MB_LEN_MAX] {};
19			UCHAR_ACCESS c32rtomb(ghost_ouput, U'\0', &c_stdlib_state);
20		}
21	};

This is the proper way to initialize a std::mbstate_t from the C standard library. Then, you can use it! Here’s a complete implementation using the new encode_state and decode_state types:

  1class runtime_locale {
  2	using rtl_encode_result = ztd::text::encode_result<ztd::span<const code_point>,
  3	     ztd::span<code_unit>, encode_state>;
  4	using rtl_decode_error_handler
  5	     = std::function<rtl_decode_result(const runtime_locale&, rtl_decode_result,
  6	          ztd::span<const char>, ztd::span<const char32_t>)>;
  7	using rtl_encode_error_handler
  8	     = std::function<rtl_encode_result(const runtime_locale&, rtl_encode_result,
  9	          ztd::span<const char32_t>, ztd::span<const char>)>;
 10
 11	using empty_code_unit_span  = ztd::span<const code_unit, 0>;
 12	using empty_code_point_span = ztd::span<const code_point, 0>;
 13
 14public:
 15	rtl_decode_result decode_one(ztd::span<const code_unit> input,
 16	     ztd::span<code_point> output, rtl_decode_error_handler error_handler,
 17	     decode_state& current // decode-based state
 18	) const {
 19		if (output.size() < 1) {
 20			return error_handler(*this,
 21			     rtl_decode_result(input, output, current,
 22			          ztd::text::encoding_error::insufficient_output_space),
 23			     empty_code_unit_span(), empty_code_point_span());
 24		}
 25		std::size_t result = UCHAR_ACCESS mbrtoc32(
 26		     output.data(), input.data(), input.size(), &current.c_stdlib_state);
 27		switch (result) {
 28		case (std::size_t)0:
 29			// '\0' was encountered in the input
 30			// current.c_stdlib_state was "cleared"
 31			// '\0' character was written to output
 32			return rtl_decode_result(
 33			     input.subspan(1), output.subspan(1), current);
 34			break;
 35		case (std::size_t)-3:
 36			// no input read, pre-stored character
 37			// was written out
 38			return rtl_decode_result(input, output.subspan(1), current);
 39		case (std::size_t)-2:
 40			// input was an incomplete sequence
 41			return error_handler(*this,
 42			     rtl_decode_result(input, output, current,
 43			          ztd::text::encoding_error::incomplete_sequence),
 44			     empty_code_unit_span(), empty_code_point_span());
 45			break;
 46		case (std::size_t)-1:
 47			// invalid sequence!
 48			return error_handler(*this,
 49			     rtl_decode_result(input, output, current,
 50			          ztd::text::encoding_error::invalid_sequence),
 51			     empty_code_unit_span(), empty_code_point_span());
 52		}
 53		// everything as fine, then
 54		return rtl_decode_result(
 55		     input.subspan(result), output.subspan(1), current);
 56	}
 57
 58	rtl_encode_result encode_one(ztd::span<const code_point> input,
 59	     ztd::span<code_unit> output, rtl_encode_error_handler error_handler,
 60	     encode_state& current // encode-based state
 61	) const {
 62		// saved, in case we need to go
 63		// around mulitple times to get
 64		// an output character
 65		ztd::span<const code_point> original_input = input;
 66		// The C standard library assumes
 67		// it can write out MB_CUR_MAX characters to the buffer:
 68		// we have no guarantee our output buffer is that big, so it
 69		// needs to go into an intermediate buffer instead
 70		code_unit intermediate_buffer[MB_LEN_MAX];
 71
 72		for ([[maybe_unused]] int times_around = 0;; ++times_around) {
 73			if (input.size() < 1) {
 74				// no more input: everything is fine
 75				return rtl_encode_result(input, output, current);
 76			}
 77			std::size_t result = UCHAR_ACCESS c32rtomb(
 78			     intermediate_buffer, *input.data(), &current.c_stdlib_state);
 79			if (result == (std::size_t)-1) {
 80				// invalid sequence!
 81				return error_handler(*this,
 82				     rtl_encode_result(original_input, output, current,
 83				          ztd::text::encoding_error::invalid_sequence),
 84				     empty_code_point_span(), empty_code_unit_span());
 85			}
 86			else if (result == (std::size_t)0) {
 87				// this means nothing was output
 88				// we should probably go-around again,
 89				// after modifying input
 90				input = input.subspan(1);
 91				continue;
 92			}
 93			// otherwise, we got something written out!
 94			if (output.size() < result) {
 95				// can't fit!!
 96				return error_handler(*this,
 97				     rtl_encode_result(original_input, output, current,
 98				          ztd::text::encoding_error::insufficient_output_space),
 99				     empty_code_point_span(), empty_code_unit_span());
100			}
101			::std::memcpy(output.data(), intermediate_buffer,
102			     sizeof(*intermediate_buffer) * result);
103			input  = input.subspan(1);
104			output = output.subspan(result);
105			break;
106		}
107		return rtl_encode_result(input, output, current);
108	}
109};
110
111int main(int argc, char* argv[]) {
112	if (argc < 1) {
113		return 0;
114	}
115	// Text coming in from the command line / program arguments
116	// is (usually) encoded by the runtime locale
117	runtime_locale encoding {};
118	std::string_view first_arg       = argv[0];
119	std::u32string decoded_first_arg = ztd::text::decode(
120	     first_arg, encoding, ztd::text::replacement_handler_t {});

This allows you to maintain 2 different states, initialized in 2 different ways, one for each of the encode_one and decode_one function paths.

Encoding-Dependent States

Some states need additional information in order to be constructed and used properly. This can be the case when the encoding has stored some type-erased information, as ztd::text::any_encoding does, or as if you wrote a variant_encoding<utf8le, utf16be, ...>. For example, given a type_erased_encoding like so:

 1class type_erased_encoding {
 2private:
 3        struct erased_state {
 4                virtual ~erased_state () {}
 5        };
 6
 7        struct erased_encoding {
 8                virtual std::unique_ptr<erased_state> create_decode_state() = 0;
 9                virtual std::unique_ptr<erased_state> create_encode_state() = 0;
10
11                virtual ~erased_encoding () {}
12        };
13
14        template <typename Encoding>
15        struct typed_encoding : erased_encoding {
16                Encoding encoding;
17
18                struct decode_state : erased_state {
19                        using state_type = ztd::text::decode_state_t<Encoding>;
20                        state_type state;
21
22                        decode_state(const Encoding& some_encoding)
23                        : state(ztd::text::make_decode_state(some_encoding)) {
24                                // get a decode state from the given encoding
25                        }
26                };
27
28                struct encode_state : erased_state {
29                        using state_type = ztd::text::encode_state_t<Encoding>;
30                        state_type state;
31
32                        decode_state(const Encoding& some_encoding)
33                        : state(ztd::text::make_encode_state(some_encoding)) {
34                                // get a decode state from the given encoding
35                        }
36                };
37
38                typed_encoding(Encoding&& some_encoding)
39                : encoding(std::move(some_encoding)) {
40                        // move encoding in
41                }
42
43                typed_encoding(const Encoding& some_encoding)
44                : encoding(some_encoding) {
45                        // copy encoding in
46                }
47
48                virtual std::unique_ptr<erased_state> create_decode_state() override {
49                        return std::make_unique<decode_state>(encoding);
50                }
51
52                virtual std::unique_ptr<erased_state> create_encode_state() override {
53                        return std::make_unique<encode_state>(encoding);
54                }
55        };
56
57        std::unique_ptr<erased_encoding> stored;
58
59public:
60        template <typename AnyEncoding>
61        type_erased(AnyEncoding&& some_encoding)
62        : stored_ptr(std::make_unique<typed_encoding<std::remove_cvref_t<AnyEncoding>>>(
63                std::forward<AnyEncoding>(some_encoding))
64        ) {
65                // store any encoding in the member unique pointer
66        }
67
68        // ... rest of the implementation
69};

We can see that creating a state with a default constructor no longer works, because the state itself requires more information than can be known by just the constructor itself. It needs access to the wrapped encoding. The solution to this problem is an opt-in when creating your state types by giving your state type a constructor that takes the encoding type:

 1class type_erased_encoding {
 2        // from above, etc. …
 3public:
 4        // public-facing wrappers
 5        struct type_erased_decode_state {
 6        public:
 7                // special constructor!!
 8                type_erased_state (const type_erased_encoding& encoding)
 9                : stored(encoding.stored->create_decode_state()) {
10
11                }
12        private:
13                std::unique_ptr<erased_state> stored;
14        };
15
16        struct type_erased_encode_state {
17        public:
18                // special constructor!!
19                type_erased_state (const type_erased_encoding& encoding)
20                : stored(encoding.stored->create_encode_state()) {
21                        // hold onto type-erased state
22                }
23        private:
24                std::unique_ptr<erased_state> stored;
25        };
26
27        using decode_state = type_erased_state;
28        using encode_state = type_erased_state;
29
30        // ... rest of the Lucky 7 members
31};

These special constructors will create the necessary state using information from the type_erased_encoding to do it properly. This will allow us to have states that properly reflect what was erased when we perform a given higher-level conversion operation or algorithm.

This encoding-aware state-construction behavior is detected by the ztd::text::is_state_independent, ztd::text::is_decode_state_independent, and ztd::text::is_encode_state_independent classifications.

These classifications are used in the ztd::text::make_decode_state and ztd::text::make_encode_state function calls to correctly construct a state object, which is what the API uses to make states for its higher-level function calls. If you are working in a generic context, you should use these functions too when working in this minute details. However, if you’re not working with templates, consider simply using the already-provided ztd::text::any_encoding to do exactly what this example shows, with some extra attention to detail and internal optimizations done on your behalf.