Hacker News new | ask | show | jobs
by moosingin3space 2716 days ago
In Rust, you're supposed to use `unicode-segmentation`[1] if you need to split on logical character (grapheme cluster in the Unicode standard). Otherwise, the iterators `.bytes` emits raw bytes, and `.chars` emits UTF-8 codepoints.

Basically, string indexing is a lot harder than it seems at first glance, depending on what you want.

1 comments

One nitpick: `.chars`[1] gives you an iterable[2] of `char`s[3], each of which is always a 4 byte representation of a valid unicode character. This means that `"asdf".chars().collect()` will have a different size to `"asdf"` and `"asdf".chars().as_str()`. `.chars()` will never give you an incomplete codepoint, but it will give you incomplete characters, as you could have many c̶̼̟̏ó̷̘̉n̴̖̞̏̇t̸̡̃ĭ̸̻̬n̴̯͉̂͑ṵ̴̑a̷̛̫̳ẗ̸͕́i̷̱̫̓̋ǫ̸̑ǹ̶̼̅s̸̩̾̌ to represent what visually are a single char.

[1]: https://doc.rust-lang.org/std/string/struct.String.html#meth...

[2]: https://doc.rust-lang.org/std/str/struct.Chars.html

[3]: https://doc.rust-lang.org/std/primitive.char.html

> visually are a single char

IIRC that's what grapheme clusters are for.