The absolute minimum every software developer must know about unicode in 2023

12 October 2023 Programming Leave a comment

Last night I stumbled across this wonderful post regarding unicode that I wanted to share with you guys.

It contains many curious information and stuff I did not know about Unicode that I believe may be useful (to a certain point) for many of you devs who struggle to understand Unicode and UTF.

TL;DR;

To sum it up:

Unicode has won.
UTF-8 is the most popular encoding for data in transfer and at rest.
UTF-16 is still sometimes used as an in-memory representation.
The two most important views for strings are bytes (allocate memory/copy/encode/decode) and extended grapheme clusters (all semantic operations).
Using code points for iterating over a string is wrong. They are not the basic unit of writing. One grapheme could consist of multiple code points.
To detect grapheme boundaries, you need Unicode tables.
Use a Unicode library for everything Unicode, even boring stuff like strlen, indexOf and substring.
Unicode updates every year, and rules sometimes change.
Unicode strings need to be normalized before they can be compared.
Unicode depends on locale for some operations and for rendering.
All this is important even for pure English text.