|
Well, in theory it's infinite, but in reality it's not of course. We've got 150K assigned codepoints assigned, leaving us with 950K unassigned codepoints. There's truly massive amounts of headroom. To be honest I think this argument is rather too abstract to be of any real use: if it's a theoretical problem that will never occur in reality then all I can say is: <shrug-emoji>. But like I said: I'm not "against" combining marks, purely in principle it's probably better, I'm mostly against two systems co-existing. In reality it's too late to change the world to decomposed (for Latin, Cyrillic, some others) because most text already is pre-composed, so we should go full-in on pre-composed for those. With our 950k unassigned codepoints we've got space for literally thousands of years to come. Also this is a problem that's inherent in computers: on paper you can write anything, but computers necessarily restrict that creativity. If I want to propose something like a "%" mark on top of the "e" to indicate, I don't know, something, then I can't do that regardless of whether combining characters are used, never mind entirely new characters or marks. Unicode won't add it until it sees usage, so this gives us a bit of a catch-22 with the only option being mucking about with special fonts that use private-use (hoping it won't conflict with something else). |
Unicode can't get rid of the many precombined characters for a huge number of backward compatibility reasons (including compatibility with ancient Mainframe encodings such as EBCDIC which existed before computer fonts had ligature support), but they've certainly done what they can to suggest the "normal" forms in this decade should "prefer" the decomposed combinations.
> If I want to propose something like a "%" mark on top of the "e" to indicate, I don't know, something, then I can't do that regardless of whether combining characters are used
This is where emoji as a living language actually shines a living example: It's certainly possible to encode your mark today as a ZWJ sequence, say «e ZWJ %», though you might want to consider for further disambiguation/intent-marking adding a non-emoji variation selector such as Variation Selector 1 (U+FE00) to mark it as "Basic Latin"-like or "Mathematical Symbol"-like. You can probably get away with prototyping that in a font stack of your choosing using simple ligature tools (no need for private-use encodings). A ZWJ sequence like that in theory doesn't even "need" to ever be standardized in Unicode if you are okay with the visual fallback to something like "e%" in fonts following Unicode standard fallback (and maybe a lot of applications confused by the non-recommended grapheme cluster). That said, because of emoji the process for filing new proposals for "Recommended ZWJ Sequences" is among the simplest Unicode proposals you can make. It's not entirely as Catch-22 on "needs to have seen enough usage in written documents" as some of the other encoding proposals.
Of course, all of that is theory and practice is always weirder and harder than theory. Unicode encoding truly living languages like emoji is a blessing and it does enable language "creativity" that was missing for a couple of decades in Unicode processes and thinking.