Need for Speed: Extension Points
The core encoding/decoding loops and the Lucky 7 design, while flexible, can come with performance degradation due to its one-by-one nature. There are many well-researched speedups to validating, counting, and converting UTF and other kinds of text. In order to accommodate these, ztd.text has a number of places to overload the core behavior by way of named Argument Dependent Lookup (ADL or Koenig Lookup, named after Andrew Koenig) functions that serve as extension points. They are listed, with their expected argument forms / counts, here.
Extension points: Arguments
For all extension points, arguments are given based on what was input to one of the original higher-level functions. They have these forms and general requirements:
tag
- The first argument to every extension point that takes a single encoding. Thetag
type isztd::tag<...>
with anyconst
,volatile
, or references (&
and&&
) removed from thedecltype
of theencoding
.duo_tag
- The first argument to every extension point that takes 2 encodings. Thetag
type isztd::tag<...>
with anyconst
,volatile
, or references (&
and&&
) removed from thedecltype
of the twoencoding
s.encoding
- The encoding used to perform the operation. Can be prefixed withfrom_
orto_
in the argument list to show it is one of two encodings used to perform e.g. a transcode operation.input
- The input range. Can be of any type. Most encodings should at the very least handle basic iterator-iterator pairs correctly. These are allowed to haveconst
-correct iterators that produceconst
-correct references, so never assume you can write to the input, and appropriatelyconst
-qualify anystd::span
s you use.output
- The output range. Can be of any output range type, such as aunbounded_view<>
with aback_inserter
or astd::span
for direct memory writes. The types only requirement is that you can write to it by getting an iterator frombegin(...)
, and calling*it = value;
.handler
- The error handler used to perform the operation. Can be prefixed withfrom_
orto_
in the argument list to show it is one of two error handlers used to perform e.g. a transcode operation.state
- The state objects used to perform the operation. States are always passed by non-const
, l-value reference. Can be prefixed withfrom_
orto_
in the argument list to show it is one of two states associated with an encoding with the same prefix.pivot
- The pivot range that can be used for any intermediary storage; should be used in place of allocating data within a routine so an end-user can avoid allocation if he/she desires for any intermediate productions.
Extension Points: Forms & Return Types
Overriding any one of these extension points allows you to hook that behavior. It is very much required that you either use concrete types to provide these ADL extension points, or heavily constrain them using SFINAE (preferred for C++17 and below) or Concepts (only C++20 and above).
text_decode
Form: text_decode(tag, input, encoding, output, handler, state)
.
An extension point to speed up decoding operations for a given encoding, its input and output ranges, and the associated error handler and state. This can be helpful for encodings which may need to hide certain parts of their state.
Must return a ztd::text::decode_result.
text_encode
Form: text_encode(input, encoding, output, handler, state)
.
An extension point to speed up encoding operations for a given encoding, its input and output ranges, and the associated error handler and state. This can be helpful for encodings which may need to hide certain parts of their state.
Must return a ztd::text::encode_result.
text_transcode
Form: text_transcode(input, from_encoding, output, to_encoding, from_handler, to_handler,
from_state, to_state, pivot)
An extension point to speed up transcoding in bulk, for a given encoding pair, its input and output ranges, and its error handlers and states. Useful for known encoding pairs that have faster conversion paths between them.
Must return a ztd::text::transcode_result.
text_transcode_one
Form: text_transcode_one(input, from_encoding, output, to_encoding, from_handler, to_handler,
from_state, to_state)
An extension point to provide faster one-by-one encoding transformations for a given encoding pair, its input and output ranges, and its error handlers and states. This is not a bulk extension point conversion. It is used in the ztd::text::transcode_view type to increase the speed of iteration, where possible.
Must return a ztd::text::transcode_result.
text_validate_encodable_as_one
Form: text_validate_encodable_as_one(input, encoding, state)
An extension point to provide faster one-by-one validation. Provides a shortcut to not needing to perform both a decode_one
and an encode_one
step during the basic validation loop.
Must return a ztd::text::validate_result.
text_validate_decodable_as_one
Form: text_validate_decodable_as_one(input, encoding, state)
An extension point to provide faster one-by-one validation. Provides a shortcut to not needing to perform both a encode_one
and an decode_one
step during the basic validation loop.
Must return a ztd::text::validate_result.
text_validate_transcodable_as_one
Form: text_validate_transcodable_as_one(input, from_encoding, to_encoding, decode_state, encode_state, pivot)
An extension point to provide faster one-by-one validation. Provides a shortcut to not needing to perform both a encode_one
and an decode_one
step during the basic validation loop.
Must return a ztd::text::validate_transcode_result.
text_validate_encodable_as
Form: text_validate_encodable_as(input, encoding, state)
An extension point to provide faster bulk code point validation. There are many tricks to speed up validating of text using bit twiddling of the input sequence and more.
Must return a ztd::text::validate_result.
text_validate_decodable_as
Form: text_validate_decodable_as(input, encoding, state)
An extension point to provide faster bulk code unit validation. There are many tricks to speed up validating of text using bit twiddling of the input sequence and more.
Must return a ztd::text::validate_result.
text_count_as_encoded_one
Form: text_count_as_encoded_one(input, encoding, handler, state, pivot)
An extension point to provide faster one-by-one counting. Computation cycles can be saved by only needing to check a subset of things. For example, specific code point ranges can be used to get a count for UTF-16 faster than by encoding into an empty buffer.
Must return a ztd::text::count_result.
text_count_as_decoded_one
Form: text_count_as_decoded_one(input, encoding, handler, state)
An extension point to provide faster one-by-one counting. Computation cycles can be saved by only needing to check a subset of things. For example, the leading byte in UTF-8 can provide an immediate count for how many trailing bytes, leading to a faster counting algorithm.
Must return a ztd::text::count_result.
text_count_as_encoded
Form: text_count_as_encoded(input, encoding, handler, state)
An extension point for faster bulk code point validation.
Must return a ztd::text::count_result.
text_count_as_decoded
Form: text_count_as_decoded(input, encoding, handler, state)
An extension point for faster bulk code point validation.
Must return a ztd::text::count_result.
That’s All of Them
Each of these extension points are important to one person, or another. For example, Daniel Lemire spends a lot of time optimizing UTF-8 routines for fast validation or Fast Deterministic Finite Automata (DFA) decoding of UTF-8 and more. There are many more sped up counting, validating, encoding, and decoding routines: therefore it is critical that any library writer or application developer can produce those for their encodings and, on occasion, override the base behavior and implementation-defined internal speed up written by ztd.text itself.