| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jrochkind1 2392 days ago

Unicode is pretty amazing.

People REALLY like to complain about unicode, but where it's complicated, it's because the _problem space_ is complicated. Which it is. People are actually complaining that they wish handling global text wasn't so complicated, like, that humans had been a lot simpler and more limited with their inventions of alphabets and how they were used in typesetting and printing and what not, and that legacy digital text encodings historically had happened differently than they did, they're not actually complaining about unicode at all, which had to deal with the hand of cards it was dealt.

That unicode worked out as nice a solution as it is to storing global text is pretty amazing, there were some really smart and competent people working on it. When you dig into the details, you will be continually amazed how nice it is. And how well-documented.

One real testament to this is how _well adopted_ Unicode is. There is no actual guarantee that just because you make a standard anyone will use it. Nobody forced anyone to move from whatever they did to Unicode. (and in fact most eg internet standards don't force Unicode and are technically agnostic as to text encoding). That it has become so universal is because it was so well-designed, it solved real problems, with a feasible migration path for developers that had a cost justified by it's benefits. (When people complain about aspects of UTF-8 required by it's multi-facetted compatibility with ascii, they are missing that this is what led to unicode actually winning).

The OP, despite the title, doesn't actually serve as a great argument/explanation for how Unicode is awesome. But I'd read some of the Unicode "annex" docs -- they are also great docs!

4 comments

cryptonector 2392 days ago

If we could go back in time to Unicode's beginning and start over but with all that we know today... Unicode would still look a lot like what it looks like today, except that:

  - UTF-8 would have been specified first
  - we'd not have had UCS-2, nor UTF-16
  - we'd have more than 21 bits of codespace
  - CJK unification would not have been attempted
  - we might or might not have pre-composed codepoints[0]
  - a few character-specific mistakes would have gone unmade

which is to say, again, that Unicode would mostly come out the same as it is today.

Everything to do with normalization, graphemes, and all the things that make Unicode complex and painful would still have to be there because they are necessary and not mistakes. Unicode's complexity derives from the complexity of human scripts.

[0] Going back further to create Unicode before there was a crushing need for it would be impossible -- try convincing computer scientists in the 60s to use Unicode... Or IBM in the 30s. For this reason, pre-composed codepoints would still have proven to be very useful, so we'd probably still have them if we started over, and we'd still end up with NFC/NFKC being closed to new additions, which would leave NFD as the better NF just as it is today.