| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Avernar 3212 days ago
	You might be correct and 32 bits could have been enough but Unicode has restricted code points to 21 bits. Why? Because of stupid UTF-16 and surrogate pairs. I'm curious why you think that UTF-8 requires complicated lookup tables.

1 comments

coldtea 3210 days ago

>I'm curious why you think that UTF-8 requires complicated lookup tables.

Because in the end it's still a Unicode encoding, and still has to deal with BS like "equivalence", right?

Which is not mechanically encoded in the err, encoding (e.g. all characters with the same bit pattern there are equivalent) but needs external tables for that.

link

Avernar 3210 days ago

But that's the same for UTF-16 and UTF-32. That's why I was wondering why you singled UTF-8 out, implying it needed extra handling.

link

coldtea 3209 days ago

Nah, didn't single it out, I asked why we don't have a 32-bit fixed-size code points, non-surrogate-pair-bs etc encoding.

And I added that while this might need some lookup tables, we already have those in UTF-8 too anyway (a non fixed width encoding).

So the reason I didn't mention UTF-16 and UTF-32 is because those are already fixed-size to begin with (and increasingly less used nowadays except in platforms stuck with them for legacy reasons) -- so the "competitor" encoding would be UTF-8, not them.

link