Hacker News new | ask | show | jobs
by guitarbill 1286 days ago
Unicode/text is complicated, and there's a lot of terminology. Describing Rust strings as "a series of grapheme clusters" is maybe confusing, and `chars()` doesn't allow iterating over grapheme clusters.

As the docs point out, they are simply types that either borrow or own some memory (i.e. bytes), and the types/operations guarantee those bytes are valid UTF-8/Unicode code points (aka. characters). A code point is one to four bytes when encoded with UTF-8.

Grapheme clusters are more complicated. Roughly speaking they are a collection of code points that match more what humans expect (and depend on the language/script), e.g. `ü` can actually be two code points `u` + `¨`, and splitting after `u` could be nonsensical. AFAIK, Rust's standard library doesn't really provide a way to deal with grapheme clusters? EDIT: it used to, but it got deprecated and removed [0]

So TL;DR: 1-4 bytes => 1 character, 2+ characters => maybe 1 grapheme cluster. Hope that helps either you, or someone else reading this.

[0] https://github.com/rust-lang/rust/pull/24428