Modified UTF-8

Modified Unicode Transformation Format 8 (MUTF-8) is a UTF-8 format employed by some Android components and other ecosystems. It’s special property is that it encodes the NULL character in C-style strings ('\0') as an overlong sequence. This is normally illegal in UTF-8, but allowed here to allow easier interoperation with these systems.

Aliases

constexpr mutf8_t ztd::text::mutf8 = {}

An instance of the MUTF-8 type for ease of use.

using ztd::text::mutf8_t = basic_mutf8<uchar8_t>

A Modified UTF-8 Encoding that traffics in char8_t. See ztd::text::basic_mutf8 for more details.

Base Template

template<typename _CodeUnit, typename _CodePoint = unicode_code_point>
class basic_mutf8 : public __utf8_with<basic_mutf8<_CodeUnit, unicode_code_point>, _CodeUnit, unicode_code_point, __txt_detail::__empty_state, __txt_detail::__empty_state, true, false, true>

A Modified UTF-8 Encoding that traffics in, specifically, the desired code unit type provided as a template argument.

Remark

This type as a maximum of 6 input code points and a maximum of 1 output code point. Null values are encoded as an overlong sequence to specifically avoid problems with C-style strings, which is useful for working with bad implementations sitting on top of POSIX or other Operating System APIs. For a strict, Unicode-compliant UTF-8 Encoding, see ztd::text::basic_utf8 .

Template Parameters:
  • _CodeUnit – The code unit type to use.

  • _CodePoint – The code point type to use.

Public Types

using is_unicode_encoding = ::std::true_type

Whether or not this encoding that can encode all of Unicode.

using self_synchronizing_code = ::std::true_type

The start of a sequence can be found unambiguously when dropped into the middle of a sequence or after an error in reading as occurred for encoded text.

Remark

Unicode has definitive bit patterns which resemble start and end sequences. The bit pattern 0xxxxxxx indicates a lone bit, and 1xxxxxx indicates a potential start bit for UTF-8. In particular, if 0 is not the first bit, it must be a sequence of 1s followed immediately by a 0 (e.g., 10xxxxxx, 110xxxxx, 1110xxxx, or 11110xxx).

using decode_state = __txt_detail::__empty_state

The state that can be used between calls to the encoder and decoder. It is normally an empty struct because there is no shift state to preserve between complete units of encoded information.

using encode_state = __txt_detail::__empty_state

The state that can be used between calls to the encoder and decoder. It is normally an empty struct because there is no shift state to preserve between complete units of encoded information.

using code_unit = _CodeUnit

The individual units that result from an encode operation or are used as input to a decode operation. For UTF-8 formats, this is usually char8_t, but this can change (see ztd::text::basic_utf8).

using code_point = unicode_code_point

The individual units that result from a decode operation or as used as input to an encode operation. For most encodings, this is going to be a Unicode Code Point or a Unicode Scalar Value.

using is_decode_injective = ::std::true_type

Whether or not the decode operation can process all forms of input into code point values. Thsi is true for all Unicode Transformation Formats (UTFs), which can encode and decode without a loss of information from a valid collection of code units.

using is_encode_injective = ::std::true_type

Whether or not the encode operation can process all forms of input into code unit values. This is true for all Unicode Transformation Formats (UTFs), which can encode and decode without loss of information from a valid input code point.

Public Static Functions

static inline constexpr ::ztd::span<const code_unit, 3> replacement_code_units() noexcept

Returns the replacement code units to use for the ztd::text::replacement_handler_t error handler.

static inline constexpr ::ztd::span<const code_point, 1> replacement_code_points() noexcept

Returns the replacement code point to use for the ztd::text::replacement_handler_t error handler.

static inline constexpr auto skip_input_error(decode_result<_Input, _Output, _State> __result, const _InputProgress &__input_progress, const _OutputProgress &__output_progress) noexcept

Allows an encoding to discard input characters if an error occurs, taking in both the state and the input sequence to modify through the result type.

Remark

This will skip every input value until a proper starting byte is found.

static inline constexpr auto skip_input_error(encode_result<_Input, _Output, _State> __result, const _InputProgress &__input_progress, const _OutputProgress &__output_progress) noexcept

Allows an encoding to discard input characters if an error occurs, taking in both the state and the input sequence (by reference) to modify.

Remark

This will skip every input value until a proper UTF-32 unicode scalar value (or code point) is found.

static inline constexpr auto encode_one(_Input &&__input, _Output &&__output, _ErrorHandler &&__error_handler, encode_state &__s)

Encodes a single complete unit of information as code units and produces a result with the input and output ranges moved past what was successfully read and written; or, produces an error and returns the input and output ranges untouched.

Remark

To the best ability of the implementation, the iterators will be returned untouched (e.g., the input models at least a view and a forward_range). If it is not possible, returned ranges may be incremented even if an error occurs due to the semantics of any view that models an input_range.

Parameters:
  • __input – [in] The input view to read code points from.

  • __output – [in] The output view to write code units into.

  • __error_handler – [in] The error handler to invoke if encoding fails.

  • __s – [inout] The necessary state information. For this encoding, the state is empty and means very little.

Returns:

A ztd::text::encode_result object that contains the reconstructed input range, reconstructed output range, error handler, and a reference to the passed-in state.

static inline constexpr auto decode_one(_Input &&__input, _Output &&__output, _ErrorHandler &&__error_handler, decode_state &__s)

Decodes a single complete unit of information as code points and produces a result with the input and output ranges moved past what was successfully read and written; or, produces an error and returns the input and output ranges untouched.

Remark

To the best ability of the implementation, the iterators will be returned untouched (e.g., the input models at least a view and a forward_range). If it is not possible, returned ranges may be incremented even if an error occurs due to the semantics of any view that models an input_range.

Parameters:
  • __input – [in] The input view to read code uunits from.

  • __output – [in] The output view to write code points into.

  • __error_handler – [in] The error handler to invoke if encoding fails.

  • __s – [inout] The necessary state information. For this encoding, the state is empty and means very little.

Returns:

A ztd::text::decode_result object that contains the reconstructed input range, reconstructed output range, error handler, and a reference to the passed-in state.

Public Static Attributes

static constexpr ::std::size_t max_code_points

The maximum number of code points a single complete operation of decoding can produce. This is 1 for all Unicode Transformation Format (UTF) encodings.

static constexpr ::std::size_t max_code_units

The maximum code units a single complete operation of encoding can produce. If overlong sequence allowed, this is 6: otherwise, this is 4.

static constexpr ::ztd::text_encoding_id encoded_id

The encoding ID for this type. Used for optimization purposes.

static constexpr ::ztd::text_encoding_id decoded_id

The encoding ID for this type. Used for optimization purposes.