|
|
|
|
|
by WorldMaker
987 days ago
|
|
The Unicode committees have addressed this for languages such as Latin, Cyrillic, and others and stated outright that decomposed forms should be preferred and decomposition canonical forms are generally the safest for interoperability and operations such as collation (sorting) and case folding (lowercase to uppercase transformations). Unicode can't get rid of the many precombined characters for a huge number of backward compatibility reasons (including compatibility with ancient Mainframe encodings such as EBCDIC which existed before computer fonts had ligature support), but they've certainly done what they can to suggest the "normal" forms in this decade should "prefer" the decomposed combinations. > If I want to propose something like a "%" mark on top of the "e" to indicate, I don't know, something, then I can't do that regardless of whether combining characters are used This is where emoji as a living language actually shines a living example: It's certainly possible to encode your mark today as a ZWJ sequence, say «e ZWJ %», though you might want to consider for further disambiguation/intent-marking adding a non-emoji variation selector such as Variation Selector 1 (U+FE00) to mark it as "Basic Latin"-like or "Mathematical Symbol"-like. You can probably get away with prototyping that in a font stack of your choosing using simple ligature tools (no need for private-use encodings). A ZWJ sequence like that in theory doesn't even "need" to ever be standardized in Unicode if you are okay with the visual fallback to something like "e%" in fonts following Unicode standard fallback (and maybe a lot of applications confused by the non-recommended grapheme cluster). That said, because of emoji the process for filing new proposals for "Recommended ZWJ Sequences" is among the simplest Unicode proposals you can make. It's not entirely as Catch-22 on "needs to have seen enough usage in written documents" as some of the other encoding proposals. Of course, all of that is theory and practice is always weirder and harder than theory. Unicode encoding truly living languages like emoji is a blessing and it does enable language "creativity" that was missing for a couple of decades in Unicode processes and thinking. |
|
Yes, and that only makes things worse since the overwhelming majority of documents (99.something% last time I checked) uses pre-composed. Also AFAIK just about everyone just ignores that recommendation.
This is a classic "reality should adjust to the standard" type of thinking. Previous comments about that: https://news.ycombinator.com/item?id=36984331
I suppose "e ZWJ %" is a bit better than Private Use as it will appear as "e%" if you don't have font support, but the fundamental problem of "won't work unless you spend effort" remains. For a specific niche (math, language study, something else) that's okay, but for "casual" usage: not so much. "Ship font with the document" like PDF and webfonts do is an option, but also has downsides and won't work in a lot of contexts, and still requires extra effort from the author.
I'm not saying it's completely impossible, but certainly harder than it used to be, arguably much harder. I could coin a new word right here and now (although my imagination is failing me to provide a humorous example at this moment) and if people like it, it will see usage. In 1960s HN when we would have exchanged these things over written letters, and it would have been trivial to propose a "e with % on top" too, but now we need to resort to clunky phrases like this (even for typewriters you can manually amend things, if you really wanted to).
Or let me put it this way: something like ‽ would see very little chance of being added to Unicode if it was coined today. Granted, it doesn't see that much use, but I do encounter it in the wild on occasion and some people like it (I personally don't actually, but I don't want to prevent other people from using it).
None of this is Unicode's fault by the way, or at least not directly – this is a generic limitation of computers.