Hacker News new | ask | show | jobs
by steveklabnik 2716 days ago
Every string is checked. But UTF8 is a multi-byte encoding, and slicing works per bytes, so you if you slice in the middle of a multi-byte character, you may get nonsense. The error happens because of this checking, not in spite of it.

String always assumes full UTF-8. You could make an AsciiString type if you wanted, but it's not provided by the standard library.

2 comments

The obvious follow up question would be: so why is slicing a string a byte-wise operation and not a character-wise operation? If a string is an array of characters, why does it let me refer to individual bytes without explicitly casting it to a byte array? How often comparatively do you want the nth byte compared to the nth character? I would suspect that's pretty rare.
> How often comparatively do you want the nth byte compared to the nth character? I would suspect that's pretty rare.

It's exactly the opposite of what you expect. Getting the nth codepoint is often (not always) semantically incorrect since a codepoint isn't necessarily one character. Multiple codepoints might combine to form one character. (In Unicode, these are called grapheme clusters.)

Byte offsets are used a ton because you might often have the index to a position in the string from some routine, like, say, a search[1].

I've been working on text related things in both Rust and Go for several years. Both languages got this part of their strings exactly right given that their representation in memory is always a sequence of bytes.

[1] - https://doc.rust-lang.org/std/primitive.str.html#method.find

I still think that using the common [] operator for this is a mistake. Strings shouldn't offer [] at all, and instead should provide methods like codepoints(), bytes(), grapheme_clusters() etc for indexing, slicing, and iterating.

The reason being that the behavior of [] for string varies widely in different languages, and so this is something that's best made explicit, both to force the author of the code to consider whether their assumptions are valid and reasonable for what they're trying to do, and to give additional context to anyone else reading the code.

As it is, I suspect a common class of bugs for Rust will be with people assuming that [] slices codepoints, because it seems to work that way for ASCII.

I'm quite thankful that Rust has succinct notation for slicing strings. Do note that `string[n]` is not supported, so you'll stumble over an inconsistency in your mental model quite quickly if you think slicing is by codepoint.
The lack of direct indexing is a good point. But strings aren't sliced on byte boundaries all that often either - it's far more common to use higher-level APIs like split(), that deal with offsets under the hood, so that sugar mostly ends up being used in the implementation of such APIs. And, really, would something like s.slice_u8(x, y) be that unwieldy over s[x..y]?
How often do you actually want the nth character as opposed to the nth grapheme?

There is pretty much no case where indexing by character actually makes sense because it is almost always incorrect and it is always inefficient.

Indexing by byte is rarely useful, but it does have some usefulness since it can be used correctly and efficiently since you can easily find the next or previous character by searching a maximum of four bytes for the a byte that has a MSB of 0. If you want to do something like get a &str that would fit in a n-byte buffer, then byte indices will let you do that efficiently and correctly.

As stated below, indexing is an O(1) operation, and that is a O(n) operation.

> If a string is an array of characters

It is not, it is an array (technically vector) of bytes.

Who cares if it's O(1) if it causes a panic? What good is high performance if it doesn't complete or isn't safe?

At the very least, shouldn't there be an O(n) method to do character-wise slicing?

panics are safe. You expect the “I don’t have a bug” case to be fast.

You can, but it depends on what you mean by “character”, as that’s not a concept in Unicode. Every kind of thing you could mean has a method, specific to it, since they’re different things.

(char in Rust is a Unicode scalar value, and you can collect into a Vec<char> and then slice it, as an example of one of those things. And that’s still O(1) at the cost of using up to four times the memory.)

Why not this?

  fn main() {
      let a = "ab早".as_bytes();
      let a = &a[..3];
      println!("Hello, world!");
  }
It's not clear to me what you're suggesting; is it that String shouldn't have supported indexing in the first place? That code does work, but you have a &[u8] not a &str.
But neither AsciiString. It has a as_str method, but it's still a kludge.

This example was basically a suggestion to throw0u1t: if they want to cut in the middle of the utf-sequence for whatever reason, they can [edit:] do it without extra crates.

What I don't understand is why slices are indexed in bytes and not in objects. If String has an ability to check that we're cutting in the middle of the character sequence, why doesn't it provide an ability to take 3 fully formed characters.

I think the rust designers want to keep the implicit contract that indexing into a string is fast and O(1).

If you want to find the one millionth codepoint of a UTF8-encoded string, you have to more or less (1) visit every byte of the string.

If, on the other hand, you want to find the codepoint that covers the millionth byte, on the other hand, you have to read at most four bytes (read the millionth byte, and there are three cases:

- it’s a full codepoint. If so, you‘re done.

- it is the first byte of a multi-byte codepoint. If so, read forwards in the string for up to 3 continuation characters.

- it is a continuation character. If so, search backwards in the string for the first byte, then, if necessary, read forwards to find more continuation characters.

So, that is O(1)

(1) you can skip continuation characters, but these typically are rare.

> What I don't understand is why slices are indexed in bytes and not in objects.

Slicing is an O(1) operation, and that would be an O(n) operation.

It does: `s.chars().take(3)`. It just does it with iterators rather than with indexes because that better communicates the performance characteristics.
I think he's suggesting that slicing on strings should be by character, and if you want to slice on bytes, you should explicitly ask to treat the string as a byte array. It makes more sense semantically, and it's safe.
Slicing on characters is a linear time operation and indexing is meant to be cheap.
That seems like taking it too far. It's like using pointer arithmetic to index a linked list on the assumption that the nodes happen to be allocated contiguously in memory. I mean, I guess the thinking is, indexing a Unicode string isn't cheap, but indexing strings used to be cheap once upon a time, when strings were encoded in fixed one-byte-per-character representations, so let's pretend that's still the case and panic if it doesn't work out.... That's weirdly antithetical to Rust's purported focus on safety.

Also, you can get the same performance from an operation that returns a byte array instead of a string. If that kind of performance is what you want, then a Unicode string is simply not the right type to use.

Indexing a Unicode string is cheap... if you have a byte index. If you want to count out some fixed number of codepoints, then of course you've just moved the cost to calculating the corresponding byte index. But counting codepoints is almost always the wrong thing to do anyway [1]. In practice, it's more common to obtain indices by inspecting the string itself, e.g. searching for a substring or regex match. In that case, it's faster for the search to just return a byte index; there's no benefit to having it return a codepoint index, and then having to do an O(n) lookup when you try to use the index. And byte indices obtained that way will always be valid character boundaries, so you can use [] without worrying about panics.

You suggest just using a byte array instead, but then you'd lose the guarantee that what you're working with is valid Unicode. Contrary to your assertion, it is useful to have a type that provides that guarantee, yet which can still be operated on efficiently.

[1] https://manishearth.github.io/blog/2017/01/14/stop-ascribing...

Safety is about memory safety. Immediately exiting your program is about as memory safe as it gets.
Panics are not unsafe. Panic exists in Rust because they are safe. If you don't want a panic on index, just don't index.

Indexing into a UTF-8 string doesn't serve any reasonable consistent purpose anyway, because it is an abstraction of text that doesn't provide support to the notion that a "character" is more fundamental than a word or paragraph, etc. Rust's string slicing exists solely to make ASCII text easy to handle. If your text is not ASCII, then you shouldn't be slicing it at all. Thus the panic.