Decode

Decoding is the action of converting from one sequence of encoded information to a sequence of decoded information. The formula given for Decoding is effectively just the first half of the diagram shown in the main Lucky 7 documentation, reproduced here with emphasis added:

Encode and Decode paths, split up and labeled with "Decode" on the top side, "Encode" on the bottom side, and the respective operation loops in each of the image's corners.

The generic pathway between 2 encodings, but modified to show the exact difference between the encoding step and the decoding step.

In particular, we are interested in the operation that helps us go from the encoded input to the decoded output, which is the top half of the diagram. The output we are interested in is labeled as an “intermediate”, because that is often what it is. But, there are many uses for working directly with the decoded data. Many Unicode algorithms are specified to work over unicode code points or unicode scalar values. In order to identify Word Breaks, classify Uppercase vs. Lowercase, perform Casefolding, Regex over certain properties properly, Normalize text for search + other operations, and many more things, one needs to be working with code points as the basic unit of operation.

Thusly, we use the algorithm as below to do the work. Given an input of code_units with an encoding, a target output, and any necessary additional state, we can generically bulk convert the input sequence to a form of code_points in the output:

  • ⏩ Is the input value empty? If so, is the state finished and have nothing to output? If both are true, return the current results with the the empty input, output, and state, everything is okay ✅!

  • ⏩ Otherwise,

    1. Do the decode_one step from input (using its begin() and end()) into the output code_point storage location.

      • 🛑 If it failed, return with the current input (unmodified from before this iteration, if possible), output, and states.

  • ⏩ Update input‘s begin() value to point to after what was read by the decode_one step.

  • ⤴️ Go back to the start.

This involves a single encoding type, and so does not need any cooperation to go from the code_units to the code_points. Notably, the encoding’s code_point type will hopefully be some sort of unicode code point type (see: ztd::text::is_code_point for a more code-based classification). Though, it does not have to be for many different (and very valid) reasons.

Check out the API documentation for ztd::text::decode to learn more.