Need for Speed: Extension Points

The core encoding/decoding loops and the Lucky 7 design, while flexible, can come with performance degradation due to its one-by-one nature. There are many well-researched speedups to validating, counting, and converting UTF and other kinds of text. In order to accommodate these, ztd.text has a number of places to overload the core behavior by way of named Argument Dependent Lookup (ADL or Koenig Lookup, named after Andrew Koenig) functions that serve as extension points. They are listed, with their expected argument forms / counts, here.

Extension points: Arguments

For all extension points, arguments are given based on what was input to one of the original higher-level functions. They have these forms and general requimrents:

  • tag - The first argument to every extension point that takes a single encoding. The tag type is ztd::text::tag<decltype(encoding)> with any const, volatile, or references (& and &&) removed from the decltype of the encoding.

  • duo_tag - The first argument to every extension point that takes 2 encodings. The tag type is ztd::text::tag<decltype(from_encoding), decltype(to_encoding)> with any const, volatile, or references (& and &&) removed from the decltype of the two encodings.

  • encoding - The encoding used to perform the operation. Can be prefixed with from_ or to_ in the argument list to show it is one of two encodings used to perform e.g. a transcode operation.

  • input - The input range. Can be of any type. Most encodings should at the very least handle basic iterator-iterator pairs correctly. These are allowed to have const-correct iterators that produce const-correct references, so never assume you can write to the input, and appropriately const-qualify any std::spans you use.

  • output - The output range. Can be of any output range type, such as a unbounded_view<> with a back_inserter or a std::span for direct memory writes. The types only requirement is that you can write to it by getting an iterator from begin(...), and calling *it = value;.

  • handler - The error handler used to perform the operation. Can be prefixed with from_ or to_ in the argument list to show it is one of two error handlers used to perform e.g. a transcode operation.

  • state - The state objects used to perform the operation. States are always passed by non-const, l-value reference. Can be prefixed with from_ or to_ in the argument list to show it is one of two states associated with an encoding with the same prefix.

Extension Points: Forms & Return Types

Overriding any one of these extension points allows you to hook that behavior. It is very much required that you either use concrete types to provide these ADL extension points, or heavily constrain them using SFINAE (preferred for C++17 and below) or Concepts (only C++20 and above).

text_decode

Form: text_decode(tag, input, encoding, output, handler, state).

An extension point to speed up decoding operations for a given encoding, its input and outpuut ranges, and the associated error handler and state. This can be helpful for encodings which may need to hide certain parts of their state.

Must return a ztd::text::decode_result.

text_encode

Form: text_encode(input, encoding, output, handler, state).

An extension point to speed up encoding operations for a given encoding, its input and outpuut ranges, and the associated error handler and state. This can be helpful for encodings which may need to hide certain parts of their state.

Must return a ztd::text::encode_result.

text_transcode

Form: text_transcode(input, from_encoding, output, to_encoding, from_handler, to_handler, from_state, to_state)

An extension point to speed up transcoding in bulk, for a given encoding pair, its input and output ranges, and its error handlers and states. Useful for known encoding pairs that have faster conversion paths between them.

Must return a ztd::text::transcode_result.

text_transcode_one

Form: text_transcode_one(input, from_encoding, output, to_encoding, from_handler, to_handler, from_state, to_state)

An extension point to provide faster one-by-one encoding transformations for a given encoding pair, its input and output ranges, and its error handlers and states. This is not a bulk extension point conversion. It is used in the ztd::text::transcode_view type to increase the speed of iteration, where possible.

Must return a ztd::text::transcode_result.

text_validate_encodable_as_one

Form: text_validate_encodable_as_one(input, encoding, state)

An extension point to provide faster one-by-one validation. Provides a shortcut to not needing to perform both a decode_one and an encode_one step during the basic validation loop.

Must return a ztd::text::validate_result.

text_validate_decodable_as_one

Form: text_validate_decodable_as_one(input, encoding, state)

An extension point to provide faster one-by-one validation. Provides a shortcut to not needing to perform both a encode_one and an decode_one step during the basic validation loop.

Must return a ztd::text::validate_result.

text_validate_transcodable_as_one

Form: text_validate_decodable_as_one(input, from_encoding, to_encoding, decode_state, encode_state)

An extension point to provide faster one-by-one validation. Provides a shortcut to not needing to perform both a encode_one and an decode_one step during the basic validation loop.

Must return a ztd::text::validate_transcode_result.

text_validate_encodable_as

Form: text_validate_encodable_as(input, encoding, state)

An extension point to provide faster bulk code point validation. There are many tricks to speed up validationg of text using bit twiddling of the input sequence and more.

Must return a ztd::text::validate_result.

text_validate_decodable_as

Form: text_validate_decodable_as(input, encoding, state)

An extension point to provide faster bulk code unit validation. There are many tricks to speed up validationg of text using bit twiddling of the input sequence and more.

Must return a ztd::text::validate_result.

text_count_as_encoded_one

Form: text_count_as_encoded_one(input, encoding, handler, state)

An extension point to provide faster one-by-one counting. Computation cycles can be saved by only needing to check a subset of things. For example, specific code point ranges can be used to get a count for UTF-16 faster than by encoding into an empty buffer.

Must return a ztd::text::count_result.

text_count_as_decoded_one

Form: text_count_as_decoded_one(input, encoding, handler, state)

An extension point to provide faster one-by-one counting. Computation cycles can be saved by only needing to check a subset of things. For example, the leading byte in UTF-8 can provide an immediate count for how many trailing bytes, leading to a faster counting algorithm.

Must return a ztd::text::count_result.

text_count_as_encoded

Form: text_count_as_encoded(input, encoding, handler, state)

An extension point for faster bulk code point validation.

Must return a ztd::text::count_result.

text_count_as_decoded

Form: text_count_as_decoded(input, encoding, handler, state)

An extension point for faster bulk code point validation.

Must return a ztd::text::count_result.

That’s All of Them

Each of these extension points are important to one person, or another. For example, Daniel Lemire spends a lot of time optimizing UTF-8 routines for fast validation or Fast Deterministic Finite Automata (DFA) decoding of UTF-8 and more. There are many more sped up counting, validating, encoding, and decoding routines: therefore it is critical that any library writer or application developer can produce those for their encodings and, on occassion, override the base behavior and implementation-defined internal speed up written by ztd.text itself.