Hacker News new | ask | show | jobs
by euske 2396 days ago
I mean, awesome for whom? It might be awesome for end users as you can type in or copy/paste things without caring which language you're using. But for programmers, Unicode is a bloated monstrosity and a source of endless nightmare. Eventually, it's not going to be awesome for end users either because it will be plagued by a lot of (subtle) inconsistencies. Unicode looks a lot like a leaky abstraction to me (because of poor foresight), and it's getting worse each year.
5 comments

If you think Unicode is a "bloated monstrosity and a source of endless nightmare," what would you remove from Unicode?

And if you're going to respond "emoji", I'll point out that removing emoji doesn't actually remove anything that makes text processing with Unicode difficult, just makes it more likely that people will assume that what works for English works for everybody.

(Side note: it is not possible to accurately represent modern English text solely with ASCII, as English does contain several words with accented characters, such as façade and résumé).

How about removing variation selectors? For example it's possible to turn an emoji back into text by appending a code point!

They are very painful to implement and most don't get it right.

See https://twitter.com/ridiculous_fish/status/10894210337932369...

Unicode doesn't solve the underlying complexity of human languages, as you noted. I think the main contribution of Unicode Consortium is that they brought all the nitty-gritty problems of human languages into one central repository and made them visible to everyone. That itself is an awesome effort, and I personally had a lot of benefit from it (my native language is Japanese). But that doesn't make Unicode as a standard "awesome". Maybe we should be thankful for how messy it is? That's more or less a view that I can agree with.
‘Remove’ is too strong, since Unicode is entrenched. But there are things that should have been done differently. For instance, combining characters and operators should have been placed before the base character rather than after, so that (a) it would be possible to know when you've reached the end of a character^W glyph^W grapheme cluster without reading ahead, and (b) dead keys would be identical to the corresponding characters.

> façade and résumé

ASCII (1967) allowed for them: c BS , or , BS c ↦ ç and e BS ' or ' BS e ↦ é. Encoding ç as 63 CC A7 is not manifestly better than encoding it as 63 08 2C.

> ASCII (1967) allowed for them: c BS , or , BS c ↦ ç and e BS ' or ' BS e ↦ é. Encoding ç as 63 CC A7 is not manifestly better than encoding it as 63 08 2C.

Doesn't work for ñ, since the ASCII ~ is often typeset in the middle of the box instead of in a position to appear above an 'n' character. " is a pretty poor substitute for ◌̈ though, especially when you're trying to write ï as in naïve. And then there's the æ of archæology, which doesn't work with overwriting.

I'll also point out that ç is U+00E7 in Unicode and C3 A7 in UTF-8, not 63 CC A6, since it's a precomposed character (and NFC form is usually understood to be the preferred way to normalize Unicode unless there's a reason to do something else).

Tilde exists in ASCII because of its use as an accent. (In 1967 the non-diacritic interpretation was an overline.) The use in programming languages, and lowering to fit other mathematical operators, came later.

There was never any requirement that ‘n BS ~’ have the same appearance as ‘n’ overprinted with ‘~’, although terminals capable of making the distinction didn't appear until the 70s.

Precomposed characters aren't relevant to illustrating composition mechanisms.

If you extend ASCII to CP1252, which is the most common encoding besides/before UTF-8 became common, then you do get those accented characters (and that's likely responsible for the popularity of '1252.)

In fact, the first 256 characters of Unicode are almost identical to CP1252. I'm pretty sure that's not a coincidence.

> the first 256 characters of Unicode are almost identical to CP1252. I'm pretty sure that's not a coincidence.

That depends on whether you consider the fact that Windows CP 1252 is almost identical to Latin-1 (ISO-8859-1), which is exactly the first 256 characters of Unicode, to be a coincidence.

> This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range.

* https://en.wikipedia.org/wiki/Windows-1252

Most resumes I’ve seen don’t even bother with the accents.

Most are written in Word on Windows, and I’d guess that most people don’t even know how to access the accented characters.

> such as façade and résumé

That's simple: just url encode.

Compare:

www.façebook.com

to

www.fa%C3%A7ebook.com

The second one is way easier to comprehend than the first.

You mean www.xn--faebook-vxa.com of course :P
>but for programmers

I think that depends on what level of the stack you work at. I'm a programmer, but strictly at an end-user-facing level. I'm not implementing Unicode support, I'm using programming languages that already have Unicode support. And Unicode support is an absolute godsend. It's amazing to not have to think about any of that, and just treat it as a solved problem.

25 or 50 examples of inconsistencies would help support your tone.
Unicode was originally designed to fit in 16 bits, and this is memorialized in Java APIs that make it easy to mess up.

The unicode character does not specify the glyph to draw. Han unification is the best known, but not only source, of this challenge.

The glyph does not specify the unicode character. Precombined vs combining characters is a source of this challenge. The result is that a name can be entered into a database then unfindable due to a search.

This feature has also been a source of security holes. See https://appcheck-ng.com/unicode-normalization-vulnerabilitie... for an explanation of how.

You would think that you could avoid this through banning control and combining characters and not lose anything. Indeed at one point the authors of Go (who included the inventors of UTF-8) thought this. But there are whole languages (particularly from the Indian subcontinent) that cannot be written without combining characters.

There are also lots and lots of invisible characters. This has been used to "fingerprint" text. (Each person gets a different invisible signature. The forwarded email includes the signature.) That's an interesting feature but complicates matching text documents even more.

Need I go on? When I see Unicode, I know that there lie dragons that programmers don't necessarily expect.

One of your points is that an encoding designed to handle languages has support for more than one kind of white space. Given that languages use more than one kind of white space, this is sort of a necessity.

Another one is that a standard designed to support all languages has a feature necessary for supporting some languages.

Those aren't inconsistencies, so do feel free to go on.

One of your points is that an encoding designed to handle languages has support for more than one kind of white space.

No. It is that there is more than one kind of invisible character. No language has invisible characters.

Another one is that a standard designed to support all languages has a feature necessary for supporting some languages.

Not sure what point you are misreading here. But that was not among my points.

You said "But there are whole languages (particularly from the Indian subcontinent) that cannot be written without combining characters."

I suppose I didn't consider that they could be written without combining characters given a different design.

As far as invisible characters, I'm not interested in arguing about it. English, as written, has all sorts of different structural uses of white space, it isn't all just style.

I suppose I didn't consider that they could be written without combining characters given a different design.

They could be.

Likewise European languages can be written without precombined characters. The fact that é can be written in multiple ways was my point.

As far as invisible characters, I'm not interested in arguing about it. English, as written, has all sorts of different structural uses of white space, it isn't all just style.

You still don't understand. I am not talking about whitespace. I am talking about invisible zero-width characters that can be slipped into text with no sign that they are there. Characters like U+180E, U+200B, U+FEFF, U+200C, U+200D, and U+FEFF. Not to mention that you can achieve the same thing with control characters like U+200FU+200E. (The undetectability of the last one is language dependent.)

As I said, this can be used to invisibly sign a document. But I don't see any other particular point to having so many ways to accomplish what looks like nothing.

Without arguing the details, I have to agree with your statement because the article never really supported it's claim of being awesome.
I think Unicode is terrible. Remove everything. Use ASCII and other character sets.

Unicode is OK for searching for data using many different languages (if you omit much of the junk such as emoji and compatibility characters), although might not be best with that too.

You can't effectively use one character set well for everything; different applications have different requirements. Unicode is equally bad for everything, rather than e.g. ASCII which is good for some stuff and not usable for some stuff, and other character set which is a similar thing. Many things you just can't do accurately with Unicode.

> You can't effectively use one character set well for everything; different applications have different requirements.

In our application, our users gets data from systems around the world, and might have to change some of it before sending a file with the data to some official system. The data includes names of people and places. How would you do this using character sets?

One file might need to contain names with Cyrillic characters and with Norwegian characters. There's no character set with both. Should each string in the file have an attribute saying which character set the string is encoded in? What are the odds that people implementing that won't mess that up when oh so many can't even get a single encoding attribute right[1]?

Or, just maybe, strings in the file could be Unicode, encoded in say UTF-8, so that the handling of all of them are uniform...

[1]: https://www.w3.org/TR/xml/#charencoding

> Or, just maybe, strings in the file could be Unicode, encoded in say UTF-8, so that the handling of all of them are uniform...

Actually, that won't work. There are cases where a character may be different according to the language, where capitalization may differ depending on the language, where sort order may depend on the language, etc.

If your application is allowing users to edit the text, or if you know which languages will be used, or if you don't care about capitalization, then you don't have to worry about any of those edge cases, and Unicode is useful.
Unicode solves all that. It has case folding rules to handle capitalization differences. It has collation rules to handle sorting differences.
Well if you write an application for a 'non-technical' international audience, you'll have to support international text output. And representing text as one of the universal Unicode encodings is still much better than the codepage mess and region-specific multi-byte encodings like Shift-JIS we had before.

UTF-8 is usually the best choice both for simple tools and 'user-facing applications' since it is backward-compatible with 7-bit ASCII (e.g. usually you don't need to change a thing in your code, at least if you just pass strings around).

If you encounter a byte in an UTF-8 encoded string which has the topmost bit cleared, it's an ASCII character and definitely not part of a multi-byte sequence. If the topmost bit is set, the byte is part of a multi-byte-sequence, and such sequences must remain intact.

UTF-8 isn't such a bad encoding (although it isn't ideal for fix pitch text; I invented a character set and encoding which would be better for fix pitch text). But I was not talking about the encoding; I was talking about the Unicode character set.
> UTF-8 isn't such a bad encoding (although it isn't ideal for fix pitch text; I invented a character set and encoding which would be better for fix pitch text).

This is utterly incoherent.

Can anyone explain how the statement I responded to makes sense?

I must be wrong, getting so many disagreements.

Well, UTF-8 is an encoding of Unicode, which allows for surrogate pairs and all that jazz which can be a bad fit for fixed-pitch text.

For example, take a Zalgo text generator[1] and try to make the result make sense in a fixed-pitch (monospace) setting.

At least that's my interpretation of what he tried to convey.

[1]: http://eeemo.net/

    > Use ASCII and other character sets.
We have tried that before. It did not work, and it was not pretty. You may not know, but there is a huge demand to be able to use characters from different sets in the same document. How do you do Wikipedia without Unicode? (E.g. this: Alexander Sergeyevich Pushkin (English: /ˈpʊʃkɪn/;[1] Russian: Александр Сергеевич Пушкин[note 1]).

How would you implement any chat/messaging app for the international audience? Like in my current company, I am sure at least five languages, each with its own alphabet, are used to communicate in Slack.

For me, app not supporting Unicode is broken.

Wikipedia didnt use unicode originally, en, da, sv, nl language wikipedia all used windows-1252. This all changed somewhere around 2004 i think, but there is still legacy code to deal with edits from before the switchover point.

I imagine the answer is, it kind of sucked but people made due the best they could with the limited allowed characters. Its not like IPA notation is a critical feature

> You can't effectively use one character set well for everything; different applications have different requirements

So how about an application like Twitter, which has the requirement "has to support all globally currently written languages, often right next to each other", what character set aside from Unicode is appropriate?

And for what application in 2019 is Unicode inappropriate and why?