Hacker News new | ask | show | jobs
by arp242 302 days ago
These are not "unresolved issues"; these are opinionated views. That's okay, but please, don't fool yourself in to thinking this is somehow objective fact because it's just not. Encoding all human scripts in one encoding was always going to involve some choices, and no matter which choices would have been made some people were going to disagree with it.

I have no idea which "characters common in ordinary books" are missing; the explicit goal of Unicode is to encode exactly that sort of thing.

1 comments

Unicode's original mission was to encode all characters needed for written communication in the world.

Han Unification fails this mission and that is not a matter of opinion.

"I have no idea which "characters common in ordinary books" are missing; the explicit goal of Unicode is to encode exactly that sort of thing."

In

«Günther a souligné l’ambigüité de son discours.»

there is an umlaut and a dieresis.

They are different characters with different function. In traditional book printing they used to look differently and quality fonts do still have both. Unfortunately Unicode does not encode both of them.

To claim that these letters "cannot be represented" is just outright bizarre. You literally you did so yourself. Expecting Unicode to contain a codepoint for every single rendering variation is not realistic and the line must be drawn somewhere, with other rendering information provided in another way (e.g. lang=de, font-style, whatnot).

You can disagree how Unicode does this (or how other encodings do it, for that matter) but this is just an utterly disingenuous thing to say. I no longer believe you are engaging in good faith. You have either not understood Unicode or you're intentionally misrepresenting it. Good bye.

"To claim that these letters "cannot be represented" is just outright bizarre. You literally you did so yourself."

I did not. In every book printed before 1950 and every quality book printed now the different characters would actually look differently. This is not about rendering variations but about different characters (linguistically and functionally, e.g. wrt collation) that coincidentally look similar and Unicode confuses.

Here is a source from DIN (Deutsches Institut für Normung) with more background:

https://www.unicode.org/L2/L2003/03215-n2593-umlaut-trema.pd...

If you think its just crazy Germans arguing a moot point Yannis Haralambous has a paragraph specifically about the umlaut/trema issue in his O'Reilly book "Fonts & Encodings".

Haven't read the book yet, but isn't that more like a matter of the font/rendering engine? I have a murky notion that for Cyrillic, for example, there are some nuances in rendering certain glyphs in cursive between languages [1], but these nuances are usually resolved by cooperation of the font and client interpreting the language hints, so not in the "physical" text.

(Not telling I see this as a good thing or anything: it is way beyond my expertise; I definitely can see the motivation for introducing as many variants in the Unicode register as there are in the real world)

Isn't the umlaut vs trema/diaeresis in a similar situation?

[1] made me test it and cobble a demo. (Sadly, not speaking any of these languages, so cannot verify it is correct; just wanted to see the difference in practice):

    data:text/html;charset=utf-8;verbatim,<style>
    @import url("https://fonts.googleapis.com/css2?family=Noto+Sans:ital@0;1");
    body { font-family: 'Noto Sans'; }
    dl:hover i { font-style: normal; }
    </style>
    <dl>
    <dt>lang="ru"
    <dd lang="ru"><i>грипп, практика, график, типа</i>
    <dt>lang="sr"
    <dd lang="sr"><i>грипп, практика, график, типа</i>
    </dl>
Arguably, depending on wide (physical text ↔ specific font ↔ rendering agent) ecosystem feels quite fragile, but cannot tell if there is any better alternative for this particular case.

https://myfonj.github.io/sandbox.html#%3C!doctype%20html%3E%...

>Expecting Unicode to contain a codepoint for every single rendering variation

It's not just rendering variations. While they are etymlogically related they are made with different strokes and are incorrect to substitute for one another.

Technically Unicode has a variant selector that can be used for selecting between variations of the characters, but this does not have sufficient adoption. So that means pretty much everything has to annotate what language it is written in so it can be rendered correctly, else the system has to check the system settings to guess what language the user likely wants to see things rendered as.