| Unicode was originally designed to fit in 16 bits, and this is memorialized in Java APIs that make it easy to mess up. The unicode character does not specify the glyph to draw. Han unification is the best known, but not only source, of this challenge. The glyph does not specify the unicode character. Precombined vs combining characters is a source of this challenge. The result is that a name can be entered into a database then unfindable due to a search. This feature has also been a source of security holes. See https://appcheck-ng.com/unicode-normalization-vulnerabilitie... for an explanation of how. You would think that you could avoid this through banning control and combining characters and not lose anything. Indeed at one point the authors of Go (who included the inventors of UTF-8) thought this. But there are whole languages (particularly from the Indian subcontinent) that cannot be written without combining characters. There are also lots and lots of invisible characters. This has been used to "fingerprint" text. (Each person gets a different invisible signature. The forwarded email includes the signature.) That's an interesting feature but complicates matching text documents even more. Need I go on? When I see Unicode, I know that there lie dragons that programmers don't necessarily expect. |
Another one is that a standard designed to support all languages has a feature necessary for supporting some languages.
Those aren't inconsistencies, so do feel free to go on.