| https://pastebin.com/raw/D7p7mRLK My comment in a pastebin. HN doesn't like unicode. You need this crate to deal with it in Rust, it's not part of the base libraries: https://crates.io/crates/unicode-segmentation The languages that have this kind of feature built-in in the standard library, to my knowledge, are Swift, JavaScript, C# and Java. Swift is the only one, of those four, that treat operating on graphemes as the default. JavaScript requires Intl.Segmenter, C# requires StringInfo, Java requires BreakIterator. By the way, Python, the language caused so much hurt with their 2.x->3.x transition promising better unicode support in return for this pain couldn't even do this right. There is no concept of graphemes in the standard library. So much for the batteries included bit. >>> test = " " >>> [char for char in test] ['', '\u200d', '', '\u200d', '', '\u200d', ''] >>> len(test) 7 In JavaScript REPL (nodejs): > let test = " " undefined > [...new Intl.Segmenter().segment(test)][0].segment; ' ' > [...new Intl.Segmenter().segment(test)].length; 1 Works as it should. In python you would need a third party library. Swift is truly the nicest of programming languages as far as strings are concerned. It just works as it always should have been. let test = " " for char in test { print(char)
}print(test.count) output : 1 [Execution complete with exit code 0] I, as a non-Apple user, feel quite the Apple envy whenever I think about swift. It's such a nice language, but there's little ecosystem outside of Apple UIs. But man, no using third party libraries, or working with a wrapper segmenter class or iterator. Just use the base string literals as is. It. Just. Works. |
I understand how iterating through a string by grapheme clusters is convenient for some applications. But it’s far from obvious to me that doing so should be the language’s default. Dealing with grapheme clusters requires a Unicode database, which needs to live somewhere and needs to be updated continuously as Unicode grows. (Should rust statically link that library into every app that uses it?)
Generally there are 3 ways to iterate a string: by UTF8 bytes (or ucs2 code points like Java/js/c#), by Unicode codepoint or by grapheme clusters. UTF8 encoding comes up all the time when encoding / decoding strings - like, to json or when sending content over http. Codepoints are, in my opinion, the correct approach when doing collaborative editing or patching strings. And grapheme clusters are useful in frontend user interfaces - like when building a terminal.
Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least. Tell me - why should grapheme clusters be the default way to iterate over a string? I can see the argument in Swift, which is a language built for frontend UI. But in a systems language like rust? That seems like a terrible default to me. UTF8 bytes are by far the most useful representation for strings in systems code, since from the pov of systems code, strings are usually just data.