Hacker News new | ask | show | jobs
by qalmakka 1797 days ago
This is an unfortunate consequence of the poor choice of keeping UCS-2 alive as UTF-16 for way too long. The plug in 16 bit encodings should have been pulled a long time ago, but some people were and still are so focused on backwards compatibility that they didn't see they were just pushing the issue to another decade. UTF-8 has won, completely. UTF-16 is basically a zombie nobody wants anymore, kept artificially alive by the fear of big 90s frameworks of clean breaks with the past.

We must get rid of legacy encodings no matter the cost, I'm tired of seeing Java and Qt apps wasting millions of CPU cycles mindlessly converting stuff back and forth from UTF-16. It's plain madness, and sometimes you just need the courage to destroy everything and start again.

2 comments

I love reading stuff like this, because it reminds me that there are two entire universes of IT, and both are mostly filled with people blissfully unaware of the other.

UTF-8 is a great hack that works wonderfully on Linux and BSD, because neither actually supported internationalisation properly until recently. They clung to 8-bit ASCII with white knuckles until they could bear it no longer, but then UTF-8 came to the rescue and there was much rejoicing. "It's the inevitable future!" cried millions of Linux devs... in English. I mention this because UTF-8 is a bit... shit... if you're from Asia.

Meanwhile, in the other universe, UCS-2 or UTF-16 have been around for forever because in that Universe people do things for money and had to take internationalisation seriously. Not just recently, but decades ago. Before some Linux developers were born. In this Universe, an ungodly amount of Real Important Code was written by Big Business and Big Government. The type of code that processes trillions of dollars, not the type used to call MySQL unreliably from some Python ML bullshit running in a container or whatever the kids are doing these days.

So, yes. Clearly UTF-16 has to "die" because it's inconvenient for C developers that never figured out how to deal with strings based on more than encoding.

PS: There are several Unicode compression formats that blow UTF-8 out of the water if used in the right way. If you can support those, then you can support UTF-16. If you can't, then you can't claim that you chose UTF-8 because you care about performance.

Are you sure UTF-8 is the ideal format? After all, we have grapheme clusters that cannot be rendered as text units using UTF-8. Maybe UTF-8 is already obsolete and never took over the world? I am more than sure that soon we will see the new Unicode format ;)