Hacker News new | ask | show | jobs
by jcranmer 2394 days ago
If you think Unicode is a "bloated monstrosity and a source of endless nightmare," what would you remove from Unicode?

And if you're going to respond "emoji", I'll point out that removing emoji doesn't actually remove anything that makes text processing with Unicode difficult, just makes it more likely that people will assume that what works for English works for everybody.

(Side note: it is not possible to accurately represent modern English text solely with ASCII, as English does contain several words with accented characters, such as façade and résumé).

6 comments

How about removing variation selectors? For example it's possible to turn an emoji back into text by appending a code point!

They are very painful to implement and most don't get it right.

See https://twitter.com/ridiculous_fish/status/10894210337932369...

Unicode doesn't solve the underlying complexity of human languages, as you noted. I think the main contribution of Unicode Consortium is that they brought all the nitty-gritty problems of human languages into one central repository and made them visible to everyone. That itself is an awesome effort, and I personally had a lot of benefit from it (my native language is Japanese). But that doesn't make Unicode as a standard "awesome". Maybe we should be thankful for how messy it is? That's more or less a view that I can agree with.
‘Remove’ is too strong, since Unicode is entrenched. But there are things that should have been done differently. For instance, combining characters and operators should have been placed before the base character rather than after, so that (a) it would be possible to know when you've reached the end of a character^W glyph^W grapheme cluster without reading ahead, and (b) dead keys would be identical to the corresponding characters.

> façade and résumé

ASCII (1967) allowed for them: c BS , or , BS c ↦ ç and e BS ' or ' BS e ↦ é. Encoding ç as 63 CC A7 is not manifestly better than encoding it as 63 08 2C.

> ASCII (1967) allowed for them: c BS , or , BS c ↦ ç and e BS ' or ' BS e ↦ é. Encoding ç as 63 CC A7 is not manifestly better than encoding it as 63 08 2C.

Doesn't work for ñ, since the ASCII ~ is often typeset in the middle of the box instead of in a position to appear above an 'n' character. " is a pretty poor substitute for ◌̈ though, especially when you're trying to write ï as in naïve. And then there's the æ of archæology, which doesn't work with overwriting.

I'll also point out that ç is U+00E7 in Unicode and C3 A7 in UTF-8, not 63 CC A6, since it's a precomposed character (and NFC form is usually understood to be the preferred way to normalize Unicode unless there's a reason to do something else).

Tilde exists in ASCII because of its use as an accent. (In 1967 the non-diacritic interpretation was an overline.) The use in programming languages, and lowering to fit other mathematical operators, came later.

There was never any requirement that ‘n BS ~’ have the same appearance as ‘n’ overprinted with ‘~’, although terminals capable of making the distinction didn't appear until the 70s.

Precomposed characters aren't relevant to illustrating composition mechanisms.

If you extend ASCII to CP1252, which is the most common encoding besides/before UTF-8 became common, then you do get those accented characters (and that's likely responsible for the popularity of '1252.)

In fact, the first 256 characters of Unicode are almost identical to CP1252. I'm pretty sure that's not a coincidence.

> the first 256 characters of Unicode are almost identical to CP1252. I'm pretty sure that's not a coincidence.

That depends on whether you consider the fact that Windows CP 1252 is almost identical to Latin-1 (ISO-8859-1), which is exactly the first 256 characters of Unicode, to be a coincidence.

> This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range.

* https://en.wikipedia.org/wiki/Windows-1252

Most resumes I’ve seen don’t even bother with the accents.

Most are written in Word on Windows, and I’d guess that most people don’t even know how to access the accented characters.

> such as façade and résumé

That's simple: just url encode.

Compare:

www.façebook.com

to

www.fa%C3%A7ebook.com

The second one is way easier to comprehend than the first.

You mean www.xn--faebook-vxa.com of course :P