|
|
|
|
|
by asveikau
4040 days ago
|
|
Sure, go to 32 bits per character. But it cannot be said to be "simple" and will not allow you to make the assumption that 1 integer = 1 glyph. Namely it won't save you from the following problems: * Precomposed vs multi-codepoint diacritics (Do you write รก with
one 32 bit char or with two? If it's Unicode the answer is both)
* Variation selectors (see also Han unification)
* Bidi, RTL and LTR embedding chars
And possibly others I don't know about. I feel like I am learning of these dragons all the time.I almost like that utf-16 and more so utf-8 break the "1 character, 1 glyph" rule, because it gets you in the mindset that this is bogus. Because in Unicode it is most decidedly bogus, even if you switch to UCS-4 in a vain attempt to avoid such problems. Unicode just isn't simple any way you slice it, so you might as well shove the complexity in everybody's face and have them confront it early. |
|
Perl6 calls this NFG [1].
[1] http://design.perl6.org/S15.html
^ link currently broken, the plain-text version is at https://raw.githubusercontent.com/perl6/specs/master/S15-uni...