|
There are multiple ways of counting "length" of a string. Number of UTF-8 bytes, number of UTF-16 code units, number of codepoints, number of grapheme clusters. These are all distinct yet valid concepts of "length." For the purpose of allocating buffers, I can see the obvious use in knowing number of bytes, UTF-16 code units, or the number of codepoints. I also see the use in being able to iterate through grapheme clusters, for instance for rendering a fragment of text, or for parsing. Perhaps someone can shed light on a compelling use case for knowing the number of grapheme clusters in a particular string, because I haven't been able to think of one. I'm not sure about calculating password lengths: if the point is entropy, the number of bytes seems good enough to me! The password field bug is possibly compelling, but I don't think it's obvious what a password field should do. Should it represent keystrokes? Codepoints? Grapheme clusters? Ligatures? Replace all the glyphs with bullets during font rendering? (Similarly, perhaps someone could explain why they think reversing a string should be a sensible operation. That this is hard to do is something I occasionally hear echoing around the internet. The best I've heard is that you can reuse the default forward lexicographic ordering on reversed strings for a use I've forgotten.) |
If you have a limit on the length of a field, it helps to tell the user what it is in a way they understand. For non-technical users, bytes (and the embedded issue of encoding) and code points are both pretty esoteric, but number of symbols is less so. OTOH, SMS has strict data and encoding limits, and people managed with that; also provisioning byte storage for grapheme limited fields is hard: some graphemes use a ton of code points, family emoji and zalgo text are clear examples.