Hacker News new | ask | show | jobs
by richardwhiuk 2716 days ago
Slicing on characters is a linear time operation and indexing is meant to be cheap.
1 comments

That seems like taking it too far. It's like using pointer arithmetic to index a linked list on the assumption that the nodes happen to be allocated contiguously in memory. I mean, I guess the thinking is, indexing a Unicode string isn't cheap, but indexing strings used to be cheap once upon a time, when strings were encoded in fixed one-byte-per-character representations, so let's pretend that's still the case and panic if it doesn't work out.... That's weirdly antithetical to Rust's purported focus on safety.

Also, you can get the same performance from an operation that returns a byte array instead of a string. If that kind of performance is what you want, then a Unicode string is simply not the right type to use.

Indexing a Unicode string is cheap... if you have a byte index. If you want to count out some fixed number of codepoints, then of course you've just moved the cost to calculating the corresponding byte index. But counting codepoints is almost always the wrong thing to do anyway [1]. In practice, it's more common to obtain indices by inspecting the string itself, e.g. searching for a substring or regex match. In that case, it's faster for the search to just return a byte index; there's no benefit to having it return a codepoint index, and then having to do an O(n) lookup when you try to use the index. And byte indices obtained that way will always be valid character boundaries, so you can use [] without worrying about panics.

You suggest just using a byte array instead, but then you'd lose the guarantee that what you're working with is valid Unicode. Contrary to your assertion, it is useful to have a type that provides that guarantee, yet which can still be operated on efficiently.

[1] https://manishearth.github.io/blog/2017/01/14/stop-ascribing...

Safety is about memory safety. Immediately exiting your program is about as memory safe as it gets.
Panics are not unsafe. Panic exists in Rust because they are safe. If you don't want a panic on index, just don't index.

Indexing into a UTF-8 string doesn't serve any reasonable consistent purpose anyway, because it is an abstraction of text that doesn't provide support to the notion that a "character" is more fundamental than a word or paragraph, etc. Rust's string slicing exists solely to make ASCII text easy to handle. If your text is not ASCII, then you shouldn't be slicing it at all. Thus the panic.

Indexing into a UTF-8 string doesn't serve any reasonable consistent purpose anyway

If that's true, isn't it the job of a type system to help avoid such nonsensical operations? If "slice" only makes sense for byte arrays and ASCII strings, it could be provided on those types without being defined on UTF-8 strings.

Panics are not unsafe. Panic exists in Rust because they are safe.

That's "safe" by a very limited definition of safety. It's one step up from undefined behavior, granted, but it's not a very high standard. In practice, in most programs, you'd want to ensure that such a panic would never happen, and personally I think the language's unhelpfulness in that regard is a wart.

>If that's true, isn't it the job of a type system to help avoid such nonsensical operations?

It's not strictly true, because there are situations where you want to slice UTF-8. For instance, if you already know where the code point boundaries are for newlines. But if you know that, then you've run something like a regex with >O(1) behavior and you certainly wouldn't want string slicing to do redundant work.

>hat's "safe" by a very limited definition of safety

That's the definition of safe that is used. Safety in the context of Rust means memory safety. (Division can panic, btw.) If you don't see why undefined behavior is so much worse than a panic, then do some research on it. If you want programs that never fail, you need a comprehensive plan that takes into account things like hardware failure. A programming language can't do that.

I think that's too extreme. There are many legitimate reasons to slice non-ASCII text - for example, to split it on newlines.
That's not trivial and different languages vary in how they handle new line characters even. https://stackoverflow.com/questions/44995851/how-do-i-check-...
You can still split non-ASCII text on ASCII newlines, and quite often that's exactly what needs to be done.
And usually, you don't want it to cost O(n) on top of whatever parser you ran to find those newlines.