Hacker News new | ask | show | jobs
by Avernar 3212 days ago
You might be correct and 32 bits could have been enough but Unicode has restricted code points to 21 bits. Why? Because of stupid UTF-16 and surrogate pairs.

I'm curious why you think that UTF-8 requires complicated lookup tables.

1 comments

>I'm curious why you think that UTF-8 requires complicated lookup tables.

Because in the end it's still a Unicode encoding, and still has to deal with BS like "equivalence", right?

Which is not mechanically encoded in the err, encoding (e.g. all characters with the same bit pattern there are equivalent) but needs external tables for that.

But that's the same for UTF-16 and UTF-32. That's why I was wondering why you singled UTF-8 out, implying it needed extra handling.
Nah, didn't single it out, I asked why we don't have a 32-bit fixed-size code points, non-surrogate-pair-bs etc encoding.

And I added that while this might need some lookup tables, we already have those in UTF-8 too anyway (a non fixed width encoding).

So the reason I didn't mention UTF-16 and UTF-32 is because those are already fixed-size to begin with (and increasingly less used nowadays except in platforms stuck with them for legacy reasons) -- so the "competitor" encoding would be UTF-8, not them.