|
|
|
|
|
by burntsushi
2716 days ago
|
|
> How often comparatively do you want the nth byte compared to the nth character? I would suspect that's pretty rare. It's exactly the opposite of what you expect. Getting the nth codepoint is often (not always) semantically incorrect since a codepoint isn't necessarily one character. Multiple codepoints might combine to form one character. (In Unicode, these are called grapheme clusters.) Byte offsets are used a ton because you might often have the index to a position in the string from some routine, like, say, a search[1]. I've been working on text related things in both Rust and Go for several years. Both languages got this part of their strings exactly right given that their representation in memory is always a sequence of bytes. [1] - https://doc.rust-lang.org/std/primitive.str.html#method.find |
|
The reason being that the behavior of [] for string varies widely in different languages, and so this is something that's best made explicit, both to force the author of the code to consider whether their assumptions are valid and reasonable for what they're trying to do, and to give additional context to anyone else reading the code.
As it is, I suspect a common class of bugs for Rust will be with people assuming that [] slices codepoints, because it seems to work that way for ASCII.