| HN Mirror

I love reading stuff like this, because it reminds me that there are two entire universes of IT, and both are mostly filled with people blissfully unaware of the other.

UTF-8 is a great hack that works wonderfully on Linux and BSD, because neither actually supported internationalisation properly until recently. They clung to 8-bit ASCII with white knuckles until they could bear it no longer, but then UTF-8 came to the rescue and there was much rejoicing. "It's the inevitable future!" cried millions of Linux devs... in English. I mention this because UTF-8 is a bit... shit... if you're from Asia.

Meanwhile, in the other universe, UCS-2 or UTF-16 have been around for forever because in that Universe people do things for money and had to take internationalisation seriously. Not just recently, but decades ago. Before some Linux developers were born. In this Universe, an ungodly amount of Real Important Code was written by Big Business and Big Government. The type of code that processes trillions of dollars, not the type used to call MySQL unreliably from some Python ML bullshit running in a container or whatever the kids are doing these days.

So, yes. Clearly UTF-16 has to "die" because it's inconvenient for C developers that never figured out how to deal with strings based on more than encoding.

PS: There are several Unicode compression formats that blow UTF-8 out of the water if used in the right way. If you can support those, then you can support UTF-16. If you can't, then you can't claim that you chose UTF-8 because you care about performance.