| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by patrickthebold 3211 days ago
	I get the variable byte encodings. And I know that Unicode has things like U+0301 as you say, and so code points are not the same as characters/glyphs. But I don't understand why it was designed that way. Why is Unicode not simply an enumeration of characters.

3 comments

acdha 3211 days ago

It's important to distinguish between Unicode the spec, where people definitely do make that distinction, and implementations. Most of the problems are due to history: we have half a century of mostly English-speaking programmers assuming one character is one byte, especially bad when that's baked into APIs, and treating the problem as simpler than it really is.

Combining accents are a great example: if you're an American, especially in the 80s, it's easy to assume that you only need a couple of accents like you used in Spanish and French classes and that's really simple for converting old data to a new encoding. Later, it becomes obvious that far more are needed but by then there's a ton of code and data in the wild so you end up needing the concept of normalization for compatibility.

(That's the same lapse which lead to things like UCS-2 assuming 2^16 characters even though that's not enough for a full representation of Chinese alone.)

I think it's also worth remembering the combination of arrogance and laziness which was not uncommon in the field, especially in the 90s. I remember impassioned rants about how nobody needed anything more than ASCII from programmers who didn't want to have to deal with iconv, thought encoding was too much hassle, claimed it was too slow, etc. as if that excused not being able to handle valid requests. About a decade ago I worked at a major university where the account management system crashed on apostrophes or accents (in a heavily Italian town!) and it was just excused as the natural order of things so the team could work on more interesting problems.

link

Avernar 3211 days ago

One reason is because it would take a lot more code points to describe all the possible combinations.

Take the country flag emoji. They're actually two seperate code points. The 26 code points used are just special country code letters A to Z. The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points.

Another example is the new skin tone emoji. The new codes are just the colour and are put in front of the existing emoji codes. Existing software just shows the normal coloured emoji but you may see a square box or question mark symbol in front of it.

link

coldtea 3211 days ago

>The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points. Another example is the new skin tone emoji.

Still not answering the question though.

For one, when the unicode standard was originally designed it didn't have emoji in it.

Second, if it was limitations to the arbitrary addition of thousands of BS symbols like emoji that necessitate such a design, we could rather do without emojis in unicode at all (or klingon or whatever).

So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...

Using less memory (like utf-8 allows) I guess is a valid concern.

link

Avernar 3211 days ago

It didn't have emoji but it did have other combining characters. While some langages it's feasable to normalize them to single code points but other langagues it would not be.

Plus the fact that some visible characters are made up of many graphemes the number of single code points would be huge.

As to your second point it seems to me to be a little close minded. The whole point of a universal character set was that languages can be added to it whether they be textual, symbolic or pictographic.

link

coldtea 3211 days ago

>As to your second point it seems to me to be a little close minded. The whole point of a universal character set was that languages can be added to it whether they be textual, symbolic or pictographic.

Representing all languages is ok as a goal -- adding klingon and BS emojis not so much (from a sanity perspective, if adding them meddled with having a logical and simple representation of characters).

So, it comes to "the fact that some visible characters are made up of many graphemes the number of single code points would be huge" and "while some languages it's feasable to normalize them to single code points but other langagues it would not be".

Wouldn't 32 bits be enough for all possible valid combinations? I see e.g. that: "The largest corpus of modern Chinese words is as listed in the Chinese Hanyucidian (汉语辞典), with 370,000 words derived from 23,000 characters".

And how many combinations are there of stuff like Hangul? I see that's 11,172. Accents in languages like Russian, Hungarian, Greek should be even easier.

Now, having each accented character as a separate might take some lookup tables -- but we already require tons of complicated lookup tables for string manipulation in UTF-8 implementations IIRC.

link

Avernar 3211 days ago

You might be correct and 32 bits could have been enough but Unicode has restricted code points to 21 bits. Why? Because of stupid UTF-16 and surrogate pairs.

I'm curious why you think that UTF-8 requires complicated lookup tables.

link

coldtea 3209 days ago

>I'm curious why you think that UTF-8 requires complicated lookup tables.

Because in the end it's still a Unicode encoding, and still has to deal with BS like "equivalence", right?

Which is not mechanically encoded in the err, encoding (e.g. all characters with the same bit pattern there are equivalent) but needs external tables for that.

link

WorldMaker 3211 days ago

> So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...

Because language is messy. At some point you have to start getting into the raw philosophy of language and it's not just a technical problem at that point but a political problem and an emotional problem.

Take accents as one example: in English a diaresis is a rare but sometimes useful accent mark to distinguish digraphs (coöperate should be pronounced as two Os, not one OOOH sound like in chicken coop) the letter stays the same it just has "bonus information"; in German an umlaut version of a letter (ö versus o) is considered an entirely different letter, with a different pronunciation and alphabet order (though further complicated by conversions to digraphs in some situations such as ö to oe).

Which language is "right"? The one that thinks that diaresis is merely a modifier or the one that thinks of an accented letter as a different letter from the unmodified? There isn't a right and wrong here, there's just different perspectives, different philosophies, huge histories of language evolution and divergence, and lots of people reusing similar looking concepts for vastly different needs.

Similarly the Spanish ñ is single letter to Spanish but the ~ accent may be a tone marker in another language that is important to the pronunciation of the word and a modifier to the letter rather a letter on its own.

There's the case of the overlaps where different alphabets diverged from similar origins. Are the letters that still look alike the same letters? [1]

Math is a language with a merged alphabet of latin characters, arabic characters, greek characters, monastery manuscript-derived shorthands, etc. Is the modern Greek Pi the same as the mathematical symbol Pi anymore? Do they need different representations? Do you need to distinguish, say in the context of modern Greek mathematical discussions the usage of Pi in the alphabet versus the usage of mathematical Pi?

These are just the easy examples in the mostly EFIGS space most of HN will be aware of. Multiply those sorts of philosophical complications across the spectrum of languages written across the world, the diversity of Asian scripts, and the wonder of ancient scripts, and yes the modern joy of emoji. Even "normalization" is a hack where you don't care about the philosophical meaning of a symbol, you just need to know if the symbols vaguely look alike, and even then there are so many different kinds of normalization available in Unicode because everyone can't always agree which things look alike either, because that changes with different perspectives from different languages.

[1] An excellent Venn diagram: https://en.wikipedia.org/wiki/File:Venn_diagram_showing_Gree...

link

pseudalopex 3211 days ago

Some languages use multiple marks on a single base character. The combining character system is more flexible, uses fewer code points, and doesn't require lookup tables to search or sort by base character.

link