Hacker News new | ask | show | jobs
by zzo38computer 2394 days ago
I think Unicode is terrible. Remove everything. Use ASCII and other character sets.

Unicode is OK for searching for data using many different languages (if you omit much of the junk such as emoji and compatibility characters), although might not be best with that too.

You can't effectively use one character set well for everything; different applications have different requirements. Unicode is equally bad for everything, rather than e.g. ASCII which is good for some stuff and not usable for some stuff, and other character set which is a similar thing. Many things you just can't do accurately with Unicode.

4 comments

> You can't effectively use one character set well for everything; different applications have different requirements.

In our application, our users gets data from systems around the world, and might have to change some of it before sending a file with the data to some official system. The data includes names of people and places. How would you do this using character sets?

One file might need to contain names with Cyrillic characters and with Norwegian characters. There's no character set with both. Should each string in the file have an attribute saying which character set the string is encoded in? What are the odds that people implementing that won't mess that up when oh so many can't even get a single encoding attribute right[1]?

Or, just maybe, strings in the file could be Unicode, encoded in say UTF-8, so that the handling of all of them are uniform...

[1]: https://www.w3.org/TR/xml/#charencoding

> Or, just maybe, strings in the file could be Unicode, encoded in say UTF-8, so that the handling of all of them are uniform...

Actually, that won't work. There are cases where a character may be different according to the language, where capitalization may differ depending on the language, where sort order may depend on the language, etc.

If your application is allowing users to edit the text, or if you know which languages will be used, or if you don't care about capitalization, then you don't have to worry about any of those edge cases, and Unicode is useful.
Unicode solves all that. It has case folding rules to handle capitalization differences. It has collation rules to handle sorting differences.
Well if you write an application for a 'non-technical' international audience, you'll have to support international text output. And representing text as one of the universal Unicode encodings is still much better than the codepage mess and region-specific multi-byte encodings like Shift-JIS we had before.

UTF-8 is usually the best choice both for simple tools and 'user-facing applications' since it is backward-compatible with 7-bit ASCII (e.g. usually you don't need to change a thing in your code, at least if you just pass strings around).

If you encounter a byte in an UTF-8 encoded string which has the topmost bit cleared, it's an ASCII character and definitely not part of a multi-byte sequence. If the topmost bit is set, the byte is part of a multi-byte-sequence, and such sequences must remain intact.

UTF-8 isn't such a bad encoding (although it isn't ideal for fix pitch text; I invented a character set and encoding which would be better for fix pitch text). But I was not talking about the encoding; I was talking about the Unicode character set.
> UTF-8 isn't such a bad encoding (although it isn't ideal for fix pitch text; I invented a character set and encoding which would be better for fix pitch text).

This is utterly incoherent.

Can anyone explain how the statement I responded to makes sense?

I must be wrong, getting so many disagreements.

Well, UTF-8 is an encoding of Unicode, which allows for surrogate pairs and all that jazz which can be a bad fit for fixed-pitch text.

For example, take a Zalgo text generator[1] and try to make the result make sense in a fixed-pitch (monospace) setting.

At least that's my interpretation of what he tried to convey.

[1]: http://eeemo.net/

    > Use ASCII and other character sets.
We have tried that before. It did not work, and it was not pretty. You may not know, but there is a huge demand to be able to use characters from different sets in the same document. How do you do Wikipedia without Unicode? (E.g. this: Alexander Sergeyevich Pushkin (English: /ˈpʊʃkɪn/;[1] Russian: Александр Сергеевич Пушкин[note 1]).

How would you implement any chat/messaging app for the international audience? Like in my current company, I am sure at least five languages, each with its own alphabet, are used to communicate in Slack.

For me, app not supporting Unicode is broken.

Wikipedia didnt use unicode originally, en, da, sv, nl language wikipedia all used windows-1252. This all changed somewhere around 2004 i think, but there is still legacy code to deal with edits from before the switchover point.

I imagine the answer is, it kind of sucked but people made due the best they could with the limited allowed characters. Its not like IPA notation is a critical feature

> You can't effectively use one character set well for everything; different applications have different requirements

So how about an application like Twitter, which has the requirement "has to support all globally currently written languages, often right next to each other", what character set aside from Unicode is appropriate?

And for what application in 2019 is Unicode inappropriate and why?