Last night I stumbled across this wonderful post regarding unicode that I wanted to share with you guys.
It contains many curious information and stuff I did not know about Unicode that I believe may be useful (to a certain point) for many of you devs who struggle to understand Unicode and UTF.
TL;DR;
To sum it up:
- Unicode has won.
- UTF-8 is the most popular encoding for data in transfer and at rest.
- UTF-16 is still sometimes used as an in-memory representation.
- The two most important views for strings are bytes (allocate memory/copy/encode/decode) and extended grapheme clusters (all semantic operations).
- Using code points for iterating over a string is wrong. They are not the basic unit of writing. One grapheme could consist of multiple code points.
- To detect grapheme boundaries, you need Unicode tables.
- Use a Unicode library for everything Unicode, even boring stuff like strlen, indexOf and substring.
- Unicode updates every year, and rules sometimes change.
- Unicode strings need to be normalized before they can be compared.
- Unicode depends on locale for some operations and for rendering.
- All this is important even for pure English text.
Overall, yes, Unicode is not perfect, but the fact that
- an encoding exists that covers all possible languages at once,
- the entire world agrees to use it,
- we can completely forget about encodings and conversions and all that stuff
is a miracle. Send this to your fellow programmers so they can learn about it, too.