| Rust is probably one of the languages which does this crap best, and that's thanks to static typing and deciding to not decide: 1. it has proper, validated unicode strings (though the stdlib is not grapheme-aware so manipulating these strings is not ideal) 2. it has proper bytes, entirely separate from strings 3. it has "the OS layer is a giant pile of shit" OsString, because file paths might be random bag of bytes (UNIX) or random bags of 16-bit values (and possibly some other hare-brained scheme on other platforms but I don't believe rust supports other osstrings currently) 4. and it has nul-terminated bag o'bytes CString For the latter two, conversion to a "proper" language string is explicitly known to be lossy, and the developer has to decide what to do in that case for their application. |
Grapheme clusters are overrated in their importance for processing. The list of times you want to iterate over grapheme clusters:
1. You want to figure out where to position the cursor when you hit left or right.
2. You want to reverse a string. (When was the last time you wanted to do that?)
The list of times when you want to iterate over Unicode codepoints:
1. When you're implementing collation, grapheme cluster searching, case modification, normalization, line breaking, word breaking, or any other Unicode algorithm.
2. When you're trying to break text into separate RFC 2047 encoded-words.
3. When you're trying to display the fonts for a Unicode string.
4. When you're trying to convert between charsets.
Cases where neither is appropriate:
1. When you want to break text to separate lines on the screen.
2. When you want to implement basic hashing/equality checks.
(I'm not sure where "cut the string down to 5 characters because we're out of display room" falls in this list. I suspect the actual answer is "wrong question, think about the problem differently").
Grapheme clusters is relatively expensive to compute, and its utility is very circumscribed. Iterating over Unicode codepoints is much more useful and foundational and yet still very cheap.