| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by carapace 3431 days ago

Unicode is a conflation of two ideas, one good and the other impossible.

The good idea is to have a standard mapping from numbers to little pictures (glyphs, symbols, kanji, ideograms, cuneiform pokings in dried clay, scratches on a rock, whatever.) This is really all ASCII was.

The impossible idea is to encode human languages into bits. This can't be done and will only continue to cause heartache in those who try.

ASCII had English letters but wasn't an encoding for English, although you can and everyone did and does use it for that.

1 comments

Manishearth 3430 days ago

I hate this argument every time I see it because it's invariably used in the wrong place.

Yes, the goal of encoding all human languages into bits is one that's near impossible. Unicode tries, and has broken half-solutions in many places. Lots of heartache everywhere.

This is completely irrelevant to the discussion here. The issue of code points not always mapping to graphemes is only an issue because programmers ignore it. It's a completely solved problem, theoretically speaking. It's necessary to be able to handle many scripts, but it's not something that "breaks" unicode.

link

carapace 3429 days ago

> It's a completely solved problem, theoretically speaking.

lol.

Unicode was ambitious for its time, but naive. Today we know better. It "jumped the shark" when the pizza slice showed up and has only been getting stupider since. Eventually it will go the way of XML (yes, I know XML hasn't gone anywhere, shut up) and we will be using some JSON hottness (forgive the labored metaphor please!) that probably consist of a wad of per-language standards and ML/AI/NLP stuff, etc.. blah blah hand-wave.)

Unicode just sucks.

link

Manishearth 3429 days ago

Again.

Yes, "it jumped the shark when the pizza slice showed up". However, that doesn't imply that it did everything wrong. The notion of multi-codepoint characters is necessary to handle other languages. that is a solved problem, it's just that programmers mess up when dealing with it. Emoji may be a mistake, but the underlying "problems" caused by emoji existed anyway, and they're not really problems, just programmers being stupid.

We had multiple per-language encodings. It sucked.

link

carapace 3427 days ago

I don't agree that the notion of multi-codepoint characters is necessary, I don't think it was a good idea at all. I submit [1] as evidence.

Whatever this mess is, it's a whole thing that isn't a byte-stream and it isn't "characters" and it isn't human language. Burn it with fire and let's do something else.

[1] http://stackoverflow.com/documentation/unicode/6485/characte...

(In reality I am slightly less hard-core, I see some value in Unicode. And I really like Z̨͖̱̟̺̈̒̌̿̔̐̚̕͟͡a̵̭͕͔̬̞̞͚̘͗̀̋̉̋̈̓̏͟͞l̸̛̬̝͎̖̏̊̈́̆̂̓̀̚͢͡ǵ̝̠̰̰̙̘̰̪̏̋̓̉͝o̲̺̹̮̞̓̄̈́͂͑͡ T̜̤͖̖̣̽̓͋̑̕͢͢e̻̝͎̳̖͓̤̎̂͊̀͋̓̽̕͞x̴̛̝͎͔̜͇̾̅͊́̔̀̕t̸̺̥̯͇̯̄͂͆̌̀͞ it is an obvious win.). Even when it doesn't quite work... (I think I'm back to "fuck Unicode" now.)

link

Manishearth 3427 days ago

I submit Hangul and all the Indic scripts as counterevidence. Not all scripts can easily avoid multicodepoint chars.

link