|
|
|
|
|
by ajross
4040 days ago
|
|
And as the linked article explains, UTF-16 is a huge mess of complexity with back-dated validation rules that had to be added because it stopped being a wide-character encoding when the new code points were added. UTF-16, when implemented correctly, is actually significantly more complicated to get right than UTF-8. UTF-32/UCS-4 is quite simple, though obviously it imposes a 4x penalty on bytes used. I don't know anything that uses it in practice, though surely something does. Again: wide characters are a hugely flawed idea. |
|
Namely it won't save you from the following problems:
And possibly others I don't know about. I feel like I am learning of these dragons all the time.I almost like that utf-16 and more so utf-8 break the "1 character, 1 glyph" rule, because it gets you in the mindset that this is bogus. Because in Unicode it is most decidedly bogus, even if you switch to UCS-4 in a vain attempt to avoid such problems. Unicode just isn't simple any way you slice it, so you might as well shove the complexity in everybody's face and have them confront it early.