Replacement Characters

Replacement characters are a way to communicate to the end-user that something went wrong, without having to throw an exception that may stop the world or stop the encoding/decoding process altogether. The default error handler for text (ztd::text::default_handler, unless configured otherwise) provides room for you to provide your own encoding types, and it does so in two ways that is recognized by the library:

Always Has A Replacement

If your type always has a replacement character, regardless of the situation, it can signal this by writing one of two functions:

  • replacement_code_units() (for any failed encode step)

  • replacement_code_points() (for any failed decode step)

These functions return a contiguous range of either code_units or code_points, typically a std::span<const code_unit> or a std::span<const code_point>.

 1class runtime_locale {
 2public:
 3	ztd::span<const code_unit> replacement_code_units() const noexcept {
 4		if (this->contains_unicode_encoding()) {
 5			// Probably CESU-8 or UTF-8!
 6			static const char replacement[3] = { '\xEF', '\xBF', '\xBD' };
 7			return replacement;
 8		}
 9		else {
10			// Uh... well, it probably has this? ¯\_(ツ)_/¯
11			static const char replacement[1] = { '?' };
12			return replacement;
13		}
14	}

If the given replacement range is empty, then nothing is inserted at all (as this is a deliberate choice from the user. See the next section for how to have this function but graciously return “no replacements” for given runtime conditions).

This is employed, for example, in the ztd::text::ascii encoding, which uses a ‘?’ as its replacement code_unit and code_point value.

Maybe Has A Replacement

If your type might not have a range of replacement characters but you will not know that until run time, regardless of the situation, the encoding type can signal this by writing different functions:

  • maybe_replacement_code_units() (for any failed encode step)

  • maybe_replacement_code_points() (for any failed decode step)

These functions return a std::optional of a contiguous range of either code_units or code_points, typically a std::optional<std::span<const code_unit>> or a std::optional<std::span<const code_point>>. If the optional is not engaged (it does not have a value stored), then the replacement algorithm uses its default logic to insert a replacement character, if possible. Otherwise, if it does have a value, it uses that range. If it has a value but the range is empty, it uses that empty range (and inserts nothing).

This is useful for encodings which provide runtime-erased wrappers or that wrap platform APIs like Win32, whose CPINFOEXW structure contains both a WCHAR UnicodeDefaultChar; and a BYTE DefaultChar[MAX_DEFAULTCHAR];. These can be provided as the range values after being stored on the encoding, or similar.

The Default

When none of the above can happen, the ztd::text::replacement_handler_t will attempt to insert a Unicode Replacement Character (�, U'\uFFFD') or the ‘?’ character into the stream, in various ways. See ztd::text::replacement_handler_t for more details on that process!