| > The Unicode committees have addressed this for languages such as Latin, Cyrillic, and others and stated outright that decomposed forms should be preferred Yes, and that only makes things worse since the overwhelming majority of documents (99.something% last time I checked) uses pre-composed. Also AFAIK just about everyone just ignores that recommendation. This is a classic "reality should adjust to the standard" type of thinking. Previous comments about that: https://news.ycombinator.com/item?id=36984331 I suppose "e ZWJ %" is a bit better than Private Use as it will appear as "e%" if you don't have font support, but the fundamental problem of "won't work unless you spend effort" remains. For a specific niche (math, language study, something else) that's okay, but for "casual" usage: not so much. "Ship font with the document" like PDF and webfonts do is an option, but also has downsides and won't work in a lot of contexts, and still requires extra effort from the author. I'm not saying it's completely impossible, but certainly harder than it used to be, arguably much harder. I could coin a new word right here and now (although my imagination is failing me to provide a humorous example at this moment) and if people like it, it will see usage. In 1960s HN when we would have exchanged these things over written letters, and it would have been trivial to propose a "e with % on top" too, but now we need to resort to clunky phrases like this (even for typewriters you can manually amend things, if you really wanted to). Or let me put it this way: something like ‽ would see very little chance of being added to Unicode if it was coined today. Granted, it doesn't see that much use, but I do encounter it in the wild on occasion and some people like it (I personally don't actually, but I don't want to prevent other people from using it). None of this is Unicode's fault by the way, or at least not directly – this is a generic limitation of computers. |
It shouldn't matter what's in the wild in documents. That's why we have normalization algorithms and normalization forms. Unicode was built for the ugly reality of backwards compatibility and that you can't control how people in the past wrote. These precomposed characters largely predate Unicode and were a problem before Unicode. Unicode won in part because it met other encodings where they were rather than where they wished they would be. It made sure that mappings from older encodings could be (mostly) one-to-one with respect to code points in the original. It didn't quite achieve that in some cases, but it did for, say, all of EBCDIC.
Unicode was never in the position to fix the past, they had to live with that.
> This is a classic "reality should adjust to the standard" type of thinking.
Not really. The Unicode standard suggests the normal/canonical forms and very well documented algorithms (including directly in source code in the Unicode committee-maintained/approved ICU libraries) to take everything seen in the wilds of reality and convert them to a normal form. It's not asking reality to adjust to the standard, it is asking developers to adjust to the algorithms for cleanly dealing with the ugly reality.
> Or let me put it this way: something like ‽ would see very little chance of being added to Unicode if it was coined today.
Posted to HN several times has been the well documented proposal process from start to finish (it succeeded) of getting common and somewhat less common power symbols encoded in Unicode. It's a committee process. It certainly takes committee time. But it isn't "impossible" to navigate and is certainly higher than "little chance" if you've got the gumption to document what you want to see encoded and push the proposal through the committee process.
Certainly the Unicode committee picked up a reputation for being hard to work with in the early oughts when the consortium was still fighting the internal battles over UCS-2 being "good enough" and had concerns about opening the "Astral Plane". Now that the astral plane is open and UTF-16 exists, the committee's attitude is considered to be much better, even if its reputation hasn't yet shifted from those bad old days.
> None of this is Unicode's fault by the way, or at least not directly – this is a generic limitation of computers.
Computers do anything we program them to do and in general people find a way regardless of the restrictions and creative limitations that get programmed. I've seen MS Paint drawn symbols embedded in Word documents because the author couldn't find the symbol they needed or it didn't quite exist. It's hard to use such creative problem solving in HN's text boxes, but that from some viewpoints is just as much a creative deficiency in HN's design. It's not an "inherent" problem to computers. When it is a problem they pay us software developers to fix it. (If we need to fix it by writing a proposal to a standards committee such as the Unicode Consortium, that is in our power and one of our rights as developers. Standards don't just bind in one-direction, they also form an agreement of cooperation in the other.)