| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wongarsu 2669 days ago
	Mostly scholars. But even if nobody at all would be using it currently, the explicit goal of Unicode is to support all scripts. Unicode is meant to make all other text encodings obsolete so the world never has to think about text encodings again (which mostly worked so far). That goal can only be reached/maintained if every script anyone might plausibly want to use is contained in Unicode.

2 comments

Freak_NL 2669 days ago

More specifically, scripts and glyphs that have documented and valid use cases. If you made up a script today, you would have to start using it first (and gain acceptance of it in some community) before it would be eligible for inclusion in the Unicode standard. A good example is the power symbol (⏻, Unicode 9.0). The proposal for it neatly documented that it was in wide use already — in manuals in particular.

Emoji are a slightly different beast though. Those seem to get included based on projected use cases.

link

epse 2669 days ago

They used to be included because the Japanese had them in their encoding systems, but the situation now is far more fuzzy. Which is odd for a standard.

link

coldtea 2668 days ago

>but the situation now is far more fuzz

It's basically: "text/social comment/chat apps are big, let's add more BS icons for our Facebook/Apple/Google/MS/etc chat apps"

link

josteink 2669 days ago

> Unicode is to support all scripts. Unicode is meant to make all other text encodings obsolete so the world never has to think about text encodings again

Technically speaking Unicode is not an encoding, but otherwise your point is mostly correct.

link

wongarsu 2669 days ago

I guess UTF-8 is technically what we would call the encoding (with alternatives like UTF-32 with other tradeoffs). But what would be the correct word for Unicode, if not encoding? I guess I could always say Unicode standard, but that feels like just avoiding the issue (for example we usually say SMTP protocol, not SMTP standard).

link

tialaramex 2669 days ago

"Character Set" is usually the phrase.

A character set can be encoded in a variety of ways, for Unicode / ISO-10646 the encoding UTF-8 is the most popular for a variety of reasons that I'm sure will one day be an exciting historical artefact for HN readers to remark upon.

I don't like the word character, because it tends to cause idiots to build software that thinks Unicode codepoints are the indivisible unit out of which strings are made, and that's no more true than for bytes. I prefer the nice fuzzy word "squiggle" when I mean the thing you as a human are perhaps imagining when saying "character" and to use nice technical terms like "pictogram", "grapheme", "glyph", "code point", "code unit", "symbol", and so on when I mean those specific technical things. But in the phrase "character set" that's what we ended up with, so be it.

link