Hacker News new | ask | show | jobs
by daoxid 2669 days ago
Are the more fancy scripts supported by Unicode used by real people in production? By scholars? With special fonts? Or is it more like Unicode just wanting to support everything, even though the target audience is actually using something else?

Asking because I'm impressed by the aim of the whole Unicode project but having no real experience with it beyond the basics.

2 comments

You will need a "special" font to visualise the text but depending on the writing system it may be enough for someone to simply make one new glyph for each of the characters in your system and add it to a general purpose "everything" font. For some writing systems you need more powerful technology because e.g. the system has complicated rules about how shapes fit together and are transformed by adjacent shapes.

For practical purposes there isn't "something else". We're well past the point where Unicode was adding things that worked fine on a specially modified edition of Microsoft Windows for the specific language (like Dungan, which needs extra characters not normally used in Cyrillic) or whatever, these are now often _really obscure_ writing systems where previously you'd only put them "on a computer" by uploading a picture of the writing. Now the computer can handle them as text because they're in Unicode.

For all the historical writing systems, and some of the minority systems that have very few users many of whom know another language that is more widely used and thus more useful to them in practice (imagine going on a forum to ask a question about maintaining the motor sledge you use, you know Russian and also Dungan - obviously you will ask in Russian, because that's a LOT more people who might answer) - in practice the new scripts in Unicode will only be used by academics to transcribe stuff. It still makes that easier, because they can use Unicode everywhere, not just in specialist tools that maybe another researcher built for the language they care about.

Mostly scholars. But even if nobody at all would be using it currently, the explicit goal of Unicode is to support all scripts. Unicode is meant to make all other text encodings obsolete so the world never has to think about text encodings again (which mostly worked so far). That goal can only be reached/maintained if every script anyone might plausibly want to use is contained in Unicode.
More specifically, scripts and glyphs that have documented and valid use cases. If you made up a script today, you would have to start using it first (and gain acceptance of it in some community) before it would be eligible for inclusion in the Unicode standard. A good example is the power symbol (⏻, Unicode 9.0). The proposal for it neatly documented that it was in wide use already — in manuals in particular.

Emoji are a slightly different beast though. Those seem to get included based on projected use cases.

They used to be included because the Japanese had them in their encoding systems, but the situation now is far more fuzzy. Which is odd for a standard.
>but the situation now is far more fuzz

It's basically: "text/social comment/chat apps are big, let's add more BS icons for our Facebook/Apple/Google/MS/etc chat apps"

> Unicode is to support all scripts. Unicode is meant to make all other text encodings obsolete so the world never has to think about text encodings again

Technically speaking Unicode is not an encoding, but otherwise your point is mostly correct.

I guess UTF-8 is technically what we would call the encoding (with alternatives like UTF-32 with other tradeoffs). But what would be the correct word for Unicode, if not encoding? I guess I could always say Unicode standard, but that feels like just avoiding the issue (for example we usually say SMTP protocol, not SMTP standard).
"Character Set" is usually the phrase.

A character set can be encoded in a variety of ways, for Unicode / ISO-10646 the encoding UTF-8 is the most popular for a variety of reasons that I'm sure will one day be an exciting historical artefact for HN readers to remark upon.

I don't like the word character, because it tends to cause idiots to build software that thinks Unicode codepoints are the indivisible unit out of which strings are made, and that's no more true than for bytes. I prefer the nice fuzzy word "squiggle" when I mean the thing you as a human are perhaps imagining when saying "character" and to use nice technical terms like "pictogram", "grapheme", "glyph", "code point", "code unit", "symbol", and so on when I mean those specific technical things. But in the phrase "character set" that's what we ended up with, so be it.