Hacker News new | ask | show | jobs
by da_chicken 2716 days ago
The obvious follow up question would be: so why is slicing a string a byte-wise operation and not a character-wise operation? If a string is an array of characters, why does it let me refer to individual bytes without explicitly casting it to a byte array? How often comparatively do you want the nth byte compared to the nth character? I would suspect that's pretty rare.
3 comments

> How often comparatively do you want the nth byte compared to the nth character? I would suspect that's pretty rare.

It's exactly the opposite of what you expect. Getting the nth codepoint is often (not always) semantically incorrect since a codepoint isn't necessarily one character. Multiple codepoints might combine to form one character. (In Unicode, these are called grapheme clusters.)

Byte offsets are used a ton because you might often have the index to a position in the string from some routine, like, say, a search[1].

I've been working on text related things in both Rust and Go for several years. Both languages got this part of their strings exactly right given that their representation in memory is always a sequence of bytes.

[1] - https://doc.rust-lang.org/std/primitive.str.html#method.find

I still think that using the common [] operator for this is a mistake. Strings shouldn't offer [] at all, and instead should provide methods like codepoints(), bytes(), grapheme_clusters() etc for indexing, slicing, and iterating.

The reason being that the behavior of [] for string varies widely in different languages, and so this is something that's best made explicit, both to force the author of the code to consider whether their assumptions are valid and reasonable for what they're trying to do, and to give additional context to anyone else reading the code.

As it is, I suspect a common class of bugs for Rust will be with people assuming that [] slices codepoints, because it seems to work that way for ASCII.

I'm quite thankful that Rust has succinct notation for slicing strings. Do note that `string[n]` is not supported, so you'll stumble over an inconsistency in your mental model quite quickly if you think slicing is by codepoint.
The lack of direct indexing is a good point. But strings aren't sliced on byte boundaries all that often either - it's far more common to use higher-level APIs like split(), that deal with offsets under the hood, so that sugar mostly ends up being used in the implementation of such APIs. And, really, would something like s.slice_u8(x, y) be that unwieldy over s[x..y]?
How often do you actually want the nth character as opposed to the nth grapheme?

There is pretty much no case where indexing by character actually makes sense because it is almost always incorrect and it is always inefficient.

Indexing by byte is rarely useful, but it does have some usefulness since it can be used correctly and efficiently since you can easily find the next or previous character by searching a maximum of four bytes for the a byte that has a MSB of 0. If you want to do something like get a &str that would fit in a n-byte buffer, then byte indices will let you do that efficiently and correctly.

As stated below, indexing is an O(1) operation, and that is a O(n) operation.

> If a string is an array of characters

It is not, it is an array (technically vector) of bytes.

Who cares if it's O(1) if it causes a panic? What good is high performance if it doesn't complete or isn't safe?

At the very least, shouldn't there be an O(n) method to do character-wise slicing?

panics are safe. You expect the “I don’t have a bug” case to be fast.

You can, but it depends on what you mean by “character”, as that’s not a concept in Unicode. Every kind of thing you could mean has a method, specific to it, since they’re different things.

(char in Rust is a Unicode scalar value, and you can collect into a Vec<char> and then slice it, as an example of one of those things. And that’s still O(1) at the cost of using up to four times the memory.)