|
|
|
|
|
by targonca
2493 days ago
|
|
That's... a very bad idea. First of all, there's graphemes are inherently not one-to-one with code points, e.g. Á = A + `. There's simply no Unicode encoding that will let you safely index into an array without paying attention to the meaning of the underlying codepoints. (and no, using NFC won't solve this either, because there are combinations for which there's no composed equivalent) Secondly, general formatting info won't fit into 11 bits (italic, bold, underline, strikethrough - that's already 4 bits, and we haven't talked about color, font weights other than bold, etc.), so why bother baking in a limited, intentionally gimped version into your character encoding? |
|
It is not a "very bad idea" it is "an idea you do not like." Those are different things.
The way you're describing UTF-32, it can't work at all, and it definitely does.
Trying to save space by using UTF-8 over UTF-32 seems like a very small gain to me, is all. UTF-32 is simpler, for text created in that encoding.