| Combining characters have their issues (https://en.wikipedia.org/wiki/Zalgo_text), but making string reversal trickier isn’t one of them. “Reversing” is an extremely atypical thing to do with text. I think only programming exercises and palindrome searchers do it. Why would your data structure make that easy to do? For Unicode, a “design from scratch” design would remove duplicate legacy code points. Why have “é” both as a single code point and as ”e” plus a combining character? It also wouldn’t have any of the deprecated characters (https://en.wikipedia.org/wiki/Unicode_character_property#Dep...) I also would remove the few special flag code points (https://home.unicode.org/the-past-and-future-of-flag-emoji/) If “design from scratch” also means “drop the goal of encompassing old character encodings”, more code points probably could go. Why are DOS box characters in Unicode, while Atari/PET, etc, ones aren’t, for example? Finally, I would look into making it easier to retrieve character class from a code point (the ‘these code points are digits, these are combining marks, etc’ tables are a bit of a wart, and getting rid of them could be useful in small embedded devices). I doubt a solution exists there that is future proof agains extension of Unicode and doesn’t blow up memory use, though, and am not sure any embedded devices too small to host those tables actually could use that info. |
Reversing a string merely indicates the problem. There are many cases for operating on graphemes instead of code points. For example, deleting the previous grapheme in a text editor when pressing backspace/delete. I think most programmers assume they're dealing with graphemes when they're actually dealing with code points. See, for example, the rune type in Go.