Hacker News new | ask | show | jobs
by wongarsu 2669 days ago
Mostly scholars. But even if nobody at all would be using it currently, the explicit goal of Unicode is to support all scripts. Unicode is meant to make all other text encodings obsolete so the world never has to think about text encodings again (which mostly worked so far). That goal can only be reached/maintained if every script anyone might plausibly want to use is contained in Unicode.
2 comments

More specifically, scripts and glyphs that have documented and valid use cases. If you made up a script today, you would have to start using it first (and gain acceptance of it in some community) before it would be eligible for inclusion in the Unicode standard. A good example is the power symbol (⏻, Unicode 9.0). The proposal for it neatly documented that it was in wide use already — in manuals in particular.

Emoji are a slightly different beast though. Those seem to get included based on projected use cases.

They used to be included because the Japanese had them in their encoding systems, but the situation now is far more fuzzy. Which is odd for a standard.
>but the situation now is far more fuzz

It's basically: "text/social comment/chat apps are big, let's add more BS icons for our Facebook/Apple/Google/MS/etc chat apps"

> Unicode is to support all scripts. Unicode is meant to make all other text encodings obsolete so the world never has to think about text encodings again

Technically speaking Unicode is not an encoding, but otherwise your point is mostly correct.

I guess UTF-8 is technically what we would call the encoding (with alternatives like UTF-32 with other tradeoffs). But what would be the correct word for Unicode, if not encoding? I guess I could always say Unicode standard, but that feels like just avoiding the issue (for example we usually say SMTP protocol, not SMTP standard).
"Character Set" is usually the phrase.

A character set can be encoded in a variety of ways, for Unicode / ISO-10646 the encoding UTF-8 is the most popular for a variety of reasons that I'm sure will one day be an exciting historical artefact for HN readers to remark upon.

I don't like the word character, because it tends to cause idiots to build software that thinks Unicode codepoints are the indivisible unit out of which strings are made, and that's no more true than for bytes. I prefer the nice fuzzy word "squiggle" when I mean the thing you as a human are perhaps imagining when saying "character" and to use nice technical terms like "pictogram", "grapheme", "glyph", "code point", "code unit", "symbol", and so on when I mean those specific technical things. But in the phrase "character set" that's what we ended up with, so be it.