Hacker News new | ask | show | jobs
by weirdwitch 3026 days ago
No thanks. Rust made the right call by distinguishing between owned (String) and borrowed (str) strings. Even C++ is moving in that direction now, by finally adding string_views (https://en.cppreference.com/w/cpp/string/basic_string_view)
4 comments

Why is one capitalized and one lowercase?

I kind of agree with the sibling that the naming is odd. Why not call them String and Slice or String and MutableString? Using different ‘spellings” of the same word seems like it only encourage some confusion.

> Why is one capitalized and one lowercase?

Because one is a pretty bog-standard struct: https://doc.rust-lang.org/src/alloc/string.rs.html#294-296 while the other is a primitive type: https://doc.rust-lang.org/std/#primitives

Lowercase (i8, u64) or anonymity (`[]`, &, *) means you're dealing with something fundamental to the language. Not just with special compiler/language support (like Result) but way below that, something the language doesn't itself express but has intrinsic knowledge of.

The lowercase one is a language primitive, and so is lowercased like all language primitives.

The uppercase one is a standard library type, and so is uppercased like all library types.

std::String is a struct, and str is a type defined by the language.

There's a summary of the differences at [0].

[0]: http://www.ameyalokare.com/rust/2017/10/12/rust-str-vs-Strin...

str and String? That naming is horrible. string_view is a much better choice
1. string_view makes for much more verbose code when it's the more common version.

2. string_view makes no sense when many str are not actually views into strings but either static data or "cast" bytes.

3. string_view makes no sense when str is the basic builtin type.

Renaming String to something like StrBuf might have made sense, but it's not like it would have been any clearer, people get confused by Path/PathBuf all the same if not more so.

It's certainly confusing at first, but the terse naming of "str" is appreciated once you've internalized the difference. "str" is, in most projects I've seen, appears more often in source than "String".
I completely agree.

Path/PathBuf have better names, but the distinction still isn't immediately obvious.

String/str is the cause of so much confusion.

I often joke that str/StrBuf is my one wish for a Rust 2.0, but 2.0 will never happen.

This confusion is why we reorganized the book to talk about String and &str very early on, and use them to teach ownership and borrowing.

It took me like 2 seconds to learn it and haven't been confused since. So... maybe I'm just a grade A genius.
Could you explain to a non-Rust user why this is a good thing? I'm not familiar with the terms "owned" and "borrowed" in terms of strings. I take it this refers to strings instantiated by a piece of code and managed by a user or vendor code vs strings passed around to be read (pass by copy)?
> I'm not familiar with the terms "owned" and "borrowed" in terms of strings.

A string (general) is fundamentally a bunch of bytes in memory. In Rust, that's implemented as a contiguous buffer of UTF8 code units composing valid unicode data.

Now because Rust aims to be a systems-oriented language, these bytes must have one[0] thing which is fundamentally responsible for them, that's the owner. If the owner goes away, so do the bytes. "Owned" qualifies the owner of those bytes. In Rust, it's String: String maintains a "strong" reference to a bunch of bytes in memory which form a valid string, and if the String disappear so does the data associated with it (it's deallocated).

Borrowed by comparison is something which holds a "weak" reference to the same buffer, it knows they're there, but when the "borrowing" structure is destroyed nothing happens to the data it refers to, because it was just borrowing it. That's what `&str` is.

Note that `&str` can borrow from a String (that's a common case), but it can also borrow from static data in the binary (all string literals are in that case) or from just a random bytes buffer (using str::from_utf8[1]).

> Could you explain to a non-Rust user why this is a good thing?

It doesn't matter for high-level languages like Java or Python[2] but it matters a lot for lower-level languages like C, C++ or Rust, because the owner of a piece of data is whoever's supposed to deallocate it (and whoever's allowed to reallocate it to expand it). When there's no difference between owner and borrower (e.g. C's char * ) the complete onus of tracking who's responsible for what is on the developer and failure brings for them to do so generates memory unsafety (dangling pointers, double-free, use-after-free, …). And developers being humans, they mess up regularly.

Making a very clear distinction between owned and borrowed types firstly helps the developer: they know they must free through the owner but mustn't — and normally can't — free through the borrower, that's what C++ is adding; and secondly can — with some additional constraints on the developer — have the language manage all that on its own, the latter being what Rust does (C++ will do the freeing part but you can still have extant borrows and so it's not memory-safe, just significantly more helpful than C).

[0] unless you're in a case where it's unclear who should be responsible and you just go "everyone!" and use reference-counting to just punt

[1] https://doc.rust-lang.org/std/str/fn.from_utf8.html

[2] generally, it does matter in the problem of substrings and whether substrings copy or point to the base data, the former is more expensive (you allocate on each substring operation) but the latter can maintain gargantuan amount of data "live" and prevent their collection

Thanks, I figured it was something of this sort, but wanted to confirm. I find it interesting the distinction even exists in Rust, whereas in C, authors just tell you to not alter the data, or use a cleanup function they've provided, etc.
That's not the problem. There absolutely should be two different types (and I don't even care that the names are so poor). But half the rust APIs take one type and the other take another (even when the string isn't manipulated or stored in any way, shape, or form). Some interfaces are only implemented for string and others only for &str.

Deciding betweene two distinct types &str and &string (not mut &string) for your function's interface is nonsense. It makes no sense to have to _decide_ between which two views of a string that you can read-but-not-manipulate you want to use, and it makes zero sense that they can't unify the types with some simple compiler magic. A constant reference to a string should automatically decompose into a view of that string and that should be that. [edit: as in that view shouldn't be a separate type]

Additionally, that dereferencing a string returns a pointer... that makes no sense. That's the kind of nonsense we ran away from in the C++ world.

strings are the reason I regret not adopting rust back when as a user of a pre-1.0 language I could have joined in efforts to lobby against this insanity.

---

As a sidenote, string_view is so late in coming to the c++ world that it's not even funny. Having a separate std::string with an "implementation-defined" in-memory representation in a world of c strings (char *) is inane beyond belief. (Yes, nulls in strings would still be a problem. But why do your strings have nulls in the first place? That data should probably be a vector of strings or a [vector|array] of uint8_t (even if just typedef'd to unsigned char) and C++ strings should have been mandated utf8, contiguous, and null-terminated. You should be able to compose a zero-copy, read-only, non-owned string from a character array and decompose automatically to it. And don't get me started on the fact that C++ doesn't have sprintf because of the obsession with sticking to the overly verbose and way too complicated streaming operators. Developers end up using c strings with sprintf to format text and then copy it back to a std::string just to work around that stupidity.

Anything implemented for &str is automatically implemented for String, because String implements Deref<Target=str>.

Most useful "String" methods are actually &str methods that you get access to through that deref trait.

Dereferencing a String doesn't return a raw pointer, I'm not sure where you got that idea.

Yes, anything implemented for &str is automatically implemented for String... except some API are stupidly implemented for &string instead. And you can't pattern match strings properly (think some `for in`) without first explicitly converting to &str.

Dereferencing a string does not return a raw pointer, that was exaggeration on my behalf. But a string is a container, so *string returns.... &str? But string.deref() returns str?

Don't get me wrong, I'm fully invested in the language [0], [1], [2], but it's got a lot of warts that could have been avoided by thinking bigger picture. So many APIs are restricted by thinking easy instead of big pre-1.0. Like str being hardcoded into APIs that should have been generic (FromStr vs From<&str>, .parse() vs .into()), shipping 1.0 without async/await, and the whole mess with strings.

0: https://github.com/rust-lang/rust/issues?utf8=%E2%9C%93&q=is...

1: https://github.com/rust-lang/rfcs/issues?utf8=%E2%9C%93&q=is...

2: https://crates.io/search?q=neosmart

> But a string is a container

In the same way unique_ptr is a container.

> so string.deref() returns.... &str?

Yes? &str::deref() also returns &str, Vec::deref() returns &[], Box<T>::deref() returns &T.

That's literally how Deref is defined, Deref<Target=T>::deref() returns &T.

*String returns str.

> (FromStr vs From<&str>, .parse() vs .into())

These are not equivalent. From/Into are non-failing conversions, FromStr can fail.

What you're looking for is TryFrom/TryInto which are still not done 2 years into the RFC: https://github.com/sfackler/rfcs/blob/try-from/text/0000-try...

> * String returns str.

Typed that out too fast, yes, that's my problem. * String is one thing but String.deref() is another. But * is the dereference operator. Operator overloading ftw ;)

> What you're looking for is TryFrom/TryInto which are still not done 2 years into the RFC: https://github.com/sfackler/rfcs/blob/try-from/text/0000-try....

Sorry, yes, I actually opened an issue with my suggestions regarding that one with particular focus on the fallible vs infallible nature: https://github.com/rust-lang/rfcs/issues/2143

> Typed that out too fast, yes, that's my problem. * String is one thing but String.deref() is another. But * is the dereference operator.

They're the same thing, Deref::deref() is just the operation which underlies the dereferencing operator.

Either way I don't see what's problematic about a string buffer deref'ing to a string.

> except some API are stupidly implemented for &string instead.

Which ones are those? I can't think of any off the top of my head, though brains are fallible!

Hi Steve! Sorry, I didn't mean to imply in the core API. I don't think (though I too could be wrong) that &string is used anywhere that AsRef<OsStr> isn't also available.
Fundamentally, String and &str communicate very different things. String has ownership, &str does not. This is not a reconcilable difference.

The closest you could get is automatically allocating &strs into Strings, but then you're introducing silent allocation, which has a host of its own problems.

> A constant reference to a string should automatically decompose into a view of that string and that should be that.

This is exactly what happens, through Deref coercions.

I mean that &string should not be a type distinct from &str.

I absolutely don't think &str should silently allocate for converting to String. But I think &str should zero-allocation convert to &string (and bypass dealloc, too).

Sorry for misunderstanding you!

That would introduce special cases into the type system, adding complexity. String is a library type, so you can take a reference to it like any other type. We'd have to move String into the language for that to happen, and currently, the language itself knows nothing about allocation, so then we'd have to put allocation into the language, which would then change our story on embedded significantly... and then it'd be one of the only types that you couldn't take a reference to for some reason, which would affect generic APIs, etc...

Well, that's actually the root problem which I avoided discussing until now: why does String have a str special case but no other data type does? Same with Path and PathBuf. Why does there need to be a distinct data type for a non-owned view into an object? Why is that not just part of the language in the first place? C++ needs string vs string_view because it has no borrow checker, but rust could (theoretically) implement this without the need for two different types.
> why does String have a str special case but no other data type does?

Vec's "special case" is [].

And it's not that String "has" a special case, it's that str is a special fundamental case of the language in the same way i8 or () is and String exists to make it easier to work with (otherwise you'd have to deal with Box<str> and efficiently working with that would require going back and forth to Vec, except you'd have removed Vec since it'd be a special case of [] and where do you store capacities at this point?)

> C++ needs string vs string_view because it has no borrow checker, but rust could (theoretically) implement this without the need for two different types.

C++ has char* which conceptually underlies both string and string_view. Rust shoves char* in the unsafe corner but need something to replace it, something which not only means "a bag of bytes" ([u8] does that just fine) but text as in "actual proper utf8-encoded unicode text". That's str.

So, we almost made str a library type, but there were downsides and not a lot of upsides. https://github.com/rust-lang/rust/pull/19612

> no other data type does?

You could argue that slices are a primitive type to arrays (also a primitive type) and vectors (a library type).

> Why does there need to be a distinct data type for a non-owned view into an object?

The difference between owned and non-owned types is fundamental; how would you propose distinguishing them if not as part of the type? Both concepts are part of the language, but like any language, you can use its fundamental bits to build better abstractions.

> That's not the problem. There absolutely should be two different types (and I don't even care that the names are so poor). But half the rust APIs take one type and the other take another (even when the string isn't manipulated or stored in any way, shape, or form). Some interfaces are only implemented for string and others only for &str.

Things which are implemented for Strings are those which actually require it. String derefs to &str so you can call any &str method on String, any trait implemented by &str is basically implemented by String, and if you need to pass &str and have a String you just &v.

> Deciding between two distinct types &str and &string (not mut &string) for your function's interface is nonsense.

There are very very few reasons to ever ask for an &String but then what, should the language somehow forbid regular references to a perfectly standard type?

> It makes no sense to have to _decide_ between which two views of a string that you can read-but-not-manipulate you want to use

&String is not a view of anything, it's a regular reference to a string living somewhere in memory.

> A constant reference to a string should automatically decompose into a view of that string and that should be that.

It does that if a function asks for an &str and you &s where s:String. All Rust doesn't do is remove &String from the language, in the same way string is a valid C++ construct.

> Additionally, that dereferencing a string returns a pointer... that makes no sense.

Dereferencing a String doesn't return a pointer. String is* a pointer, so you can deref' it to get the str behind it (which is not a pointer, it's the actual unsized string data).

> Deciding betweene two distinct types &str and &string (not mut &string) for your function's interface is nonsense.

There's really no decision to be made.If you don't want to mutate the argument, use &str, you can still call the function with an &String. If you need mutation, take ownership with String or a mutable ref with &mut String.

> Additionally, that dereferencing a string returns a pointer...

You can't deref a String, you can only deref a reference (&String), not an object. &String derefs to &str. String.str_method() where str_method takes &str works because it'll auto-ref String -> &String, and then deref to &str.

IMO you're making a mountain out of a molehill, it makes a lot of sense once you use it for any time at all.

> You can't deref a String

You can absolutely deref a String. Not necessarily usefully (or to the satisfaction of the compiler) as it yields an unsized `str` (exactly the same as deref'ing an &str) but you can certainly do it.

    let a = "foo".to_owned();
    let b = &*a;
works perfectly fine.

Incidentally you can also deref' a Vec, that yields a slice (actual sequence, not the commonly seen &[])