| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by panpog 390 days ago
	It seems plausible that this could be made efficiently doable byte-wise. For example, C3 xx could be made to uppercase to C4 xx. Unicode actually does structure its codespace to make certain properties easier to compute, but those properties are mostly related to legacy encodings, and things are designed with USC2 or UTF32 in mind, not UTF8. It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.

2 comments

karteum 390 days ago

> Usually, what you want is either the byte or the grapheme cluster.

Exactly ! That's what I understood after reading this great post https://tonsky.me/blog/unicode/

"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."

I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)

link

panpog 390 days ago

Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.

link

eviks 390 days ago

But they don't have that explosion if you only encode the combinatoric primitives those characters are made of and then use composing rules?

link

panpog 390 days ago

You still get the combinatoric explosion, but you have more bits to work with. Imagine if you could combine any 9 jamo into a single hangul syllable block. (The real combinatorics is more complicated, and I don't know if it's this bad.) Encoding just the 24 jamo and a a control character requires 25 codepoints. Giving each syllable block its own codepoint would require 24^9>2^32 codepoints.

link

eviks 390 days ago

> Giving each syllable block its own codepoint

That's the thing - you wouldn't do that! Only a small subset of frequently used combos would get it's own id, the rest would only be composable

link

duskwuff 390 days ago

Character case is a locale-dependent mess; trying to represent it in the values of code points (which need to be universal) is a terrible idea.

For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").

link

panpog 390 days ago

Of course you sometimes need tailoring to a particular language. On the other hand, I don't see how encoding untailered casing would make tailored casing harder.

link