Bibliography

These are all the resources that this documentation links to, in alphabetical order.

encoding_rs

Henri Sivonen. “encoding_rs”. February 2021. URL: https://github.com/libogonek/ogonek. A Rust library for performing encoding and decoding tasks. Takes a byte-based approach to handling encodings and decodings. The developer of this library worked on text for a very long time on Mozilla Firefox, and has great insight into the field of text on their blog, https://hsivonen.fi.

Fast UTF-8

Bob Steagall. “Fast Conversion from UTF-8 with C++, DFAs, and SSE Intrinsics”. September 26th, 2019. URL: https://www.youtube.com/watch?v=5FQ87-Ecb-A. This presentation demonstrates one of the ways an underlying fast decoder for UTF-8 can be written, rather than just letting the default work. This work can be hooked into the conversion function extension points location.

Fast UTF-8 Validation

Daniel Lemire. “Ridiculously fast unicode (UTF-8) validation”. October 20th, 2020. URL: https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/. This blog post is one of many that presents a faster, more optimized way to validate that UTF-8 is in its correcty form.

glibc-25744

Tom Honermann and Carlos O’Donnell. mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters. https://sourceware.org/bugzilla/show_bug.cgi?id=25744. This bug report details the problem with the C standard library’s ability to handle multiwide characters. This problem is also present in the “1:N” and “N:1” rules in the C++ standard library.

iconv

Bruno Haible and Daiki Ueno. libiconv. August 2020. URL: https://savannah.gnu.org/git/?group=libiconv. A software library for working with and converting text. Typically ships on most, if not all, POSIX and Linux systems.

ICU

Unicode Consortium. “International Components for Unicode”. April 17th, 2019. URL: https://github.com/hsivonen/encoding_rs The premiere library for not only performing encoding conversions, but performing other Unicode-related algorithms on sequences of text.

libogonek
  1. Martinho Fernandes. “libogonek: A C++11 Library for Unicode”. September 29th, 2019. URL: http://site.icu-project.org/ One of the first influential C++11 libraries to bring the concept of iterators and ranges to not only encoding, but normalization and others. It’s great design was only limited by how incapable C++11 as a language was for what its author was trying to do.

n2282

Philip K. Krause. “N2282 - Additional multibyte/wide string conversion functions”. June 2018. URL: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2282.htm. This paper attempted to add a few unicode functions to the list of things to do without changing anything.

Non-Unicode in C++

Henri Sivonen. “P0244 - Text_view: A C++ concepts and range based character encoding and code point enumeration library”. URL: https://hsivonen.fi/non-unicode-in-cpp/. A rebuttal to P0244’s “strong code points” and “strong code units” opinion. This is talked about in depth in the design documentation for strong vs. weak code point and code unit types.

p0244

Tom Honermann. “P0244 - Text_view: A C++ concepts and range based character encoding and code point enumeration library”. URL: https://wg21.link/p0244. A C++ proposal written by Tom Honermann, proposing some of the first ideas for an extensible text encoding interface and lightweight ranges built on top of that. Reference implementation: https://github.com/tahonermann/text_view.

p1041
  1. Martinho Fernandes. “P1041: Make char16_t/char32_t string literals be UTF-16/32”. February 2019. URL: https://wg21.link/p1041. This accepted paper enabled C++ to strongly associate all char16_t and char32_t string literals with UTF-16 and UTF-32. This is not the case for C.