|
|
|
|
|
by acdha
1379 days ago
|
|
> It seems to me it would be trivial to enumerate these combinations, and assign code points to them. For example, the Germanic umlaut is only used with vowels, so that's at most 5 code points. Well, 10 code points because vowels can be capitalized and 12 because ΓΏ is used in other languages. That's one of the easiest cases. Now you need to go through _every_ other language which has _ever_ been used in human history and repeat that process for every combining character. Note also that in some languages it's valid to keep stacking a fair number of combining modifiers so you'd need to cover every permutation allowed in each of them, and spend a lot of time working with linguists and classicists to make sure you weren't removing obscure combinations which are actually needed. At the end of years of work, you'd have an encoding which is easier for C programmers to think about but means all of your documents require substantially more storage than they used to. |
|
Perhaps this is just my ignorance talking, but it can't be that many permutations, can it? Somebody linked to https://en.wikipedia.org/wiki/Zalgo_text, which I doubt anyone would seriously want to enable. There's, what, maybe 3-4 marks typically added to chars in the most complex of cases, mostly for vowels, like Vietnamese. With 4 billion code points to work with, that seems doable. We could just throw in all permutations, regardless of past utility, to accommodate future expansions of acceptable marks. Chinese has, what, 10K chars? It doesn't seem like a big deal for Latin-based chars to have a similar set size when accounting for all mark variations.
>but means all of your documents require substantially more storage than they used to.
Good point! But that comes down to a trade-off analysis between design and space. High 32-bit code point values are meant to be used too, and not shied away from.