Hacker News new | ask | show | jobs
by snogglethorpe 4947 days ago
The problem is that handling 1 unit is very different from 2+ units, in terms of coding patterns, whereas 3 is not so different from 4+. In the latter case there's already probably a loop to handle multiple unit characters, which will in many cases work without change for longer sequences (and if not, probably the code probably requires very little change to do so).

So whereas it's rather common for programs to mis-handle multiple-unit UTF-16 characters, it seems much less likely that programs will mis-handle 4+ unit UTF-8 characters.