| > 1. it has proper, validated unicode strings (though the stdlib is not grapheme-aware so manipulating these strings is not ideal) Grapheme clusters are overrated in their importance for processing. The list of times you want to iterate over grapheme clusters: 1. You want to figure out where to position the cursor when you hit left or right. 2. You want to reverse a string. (When was the last time you wanted to do that?) The list of times when you want to iterate over Unicode codepoints: 1. When you're implementing collation, grapheme cluster searching, case modification, normalization, line breaking, word breaking, or any other Unicode algorithm. 2. When you're trying to break text into separate RFC 2047 encoded-words. 3. When you're trying to display the fonts for a Unicode string. 4. When you're trying to convert between charsets. Cases where neither is appropriate: 1. When you want to break text to separate lines on the screen. 2. When you want to implement basic hashing/equality checks. (I'm not sure where "cut the string down to 5 characters because we're out of display room" falls in this list. I suspect the actual answer is "wrong question, think about the problem differently"). Grapheme clusters is relatively expensive to compute, and its utility is very circumscribed. Iterating over Unicode codepoints is much more useful and foundational and yet still very cheap. |
> 1. You want to figure out where to position the cursor when you hit left or right.
> 2. You want to reverse a string. (When was the last time you wanted to do that?)
You missed the big one:
3. You want to determine the logical (and often visual) length of a string.
Sure, there are some languages where logical-length is less meaningful as a concept, but there are many, many languages in which it's a useful concept, and can only be easily derived by iterating grapheme clusters.