Hacker News new | ask | show | jobs
by ts4z 1596 days ago
I know you're kidding, but I want to note that UTF-256 isn't enough. There's an Arabic ligature that decomposes into 20 codepoints. That was already in Unicode 20 years ago. You can probably do something even crazier with the family emoji. These make "single characters" that do not have precomposed forms.
1 comments

Also, if you want O(1) indexing by grapheme cluster you can get that with less memory overhead by precomputing a lookup table of the location in the string where you can find every k-th grapheme cluster, for some constant k >= 1. (This requires a single O(n) pass through the string to build the index, but you were always going to have do make at least one such pass through the string for other reasons.)