Hacker News new | ask | show | jobs
by bluescarni 559 days ago
> So apparently, move does not prevent generation of a copy, but the empty string instead of expected text “Dave” is very interesting. Apparently, after termination of show after the move, the object is invalidated. This does not affect the Person object, but only the string object. Recognize that I speak about a factual behavior on the hardware. I think we have undefined behavior here. And no compilation error.

There is a lot of wrong in this paragraph:

- a "copy" was not generated, at least not in the sense that the actual content of the string was copied anywhere;

- there's no undefined behaviour here and no invalidation of the string. Standard library types are required to be left in an unspecified but valid state after move. "Valid" here means that you can go on and inspect the state of the string after move, so you can query whether it is empty or not, count the number of characters, etc. etc. "Unspecified" means that the implementation gets to decide what is the status of the string after move. For long enough strings, typical implementation strategy is to set the moved-from string in an empty state.

3 comments

> at least not in the sense that the actual content of the string was copied anywhere

...unless it's a short string within the limits of the small-string-optimization capacity.

I think what confuses many people is that a C++ move assignment still can copy a significant amount of bytes since it's just a flat copy plus 'giving up' ownership of dangling data in the source object.

For a POD struct, 'move assignment' and 'copy assignment' are identical in terms of cost.

The same is true of Rust. I have no idea why the author decided to print addresses only for C++ and not for Rust.

  // (1)
  struct Person {
      name: String,
      age: u8,
  }
  
  fn show(person: Person) {
      println!("Person record is at address  {:p}", &person);
      println!("{} is {} years old", person.name, person.age);
  }
  
  fn main() {
      let p = Person { name: "Dave".to_string(), age: 42 }; // (2)
      println!("Person record is at address  {:p}", &p);
      show(p); // (3)
  }
Its output is:

  Person record is at address  0x7ffcfb2b4e40
  Person record is at address  0x7ffcfb2b4ec0
  Dave is 42 years old
I feel like that's a pedantic detail. True, yes, but irrelevant. You may as well also point out that the return address is going to be copied to the instruction pointer when the constructor returns.
It's a real semantic difference, not a pedantic detail: It means that there is a practical reason that the moved-from object could be non-empty.

A few standard library types do guarantee that the moved-from object is empty (e.g., the smart pointer types).

For some others (basically, all containers except string), it is not explicitly stated that this is the case but it is hard to imagine an implementation that doesn't (due to time complexity and iterator invalidation rules). Arguably, this represents a bigger risk than string'e behaviour, but it's still interesting.

>It's a real semantic difference, not a pedantic detail

What's the semantic difference? Of course moving a class will involve some amount of copying. How could it be any other way? If you have something like struct { int a[1000]; }, how are you supposed to move the contents of the struct without copying anything? What, you take a pair of really tiny scissors and cut a teeny tiny piece of the RAM, then glue the capacitors somewhere else?

> how are you supposed to move the contents of the struct without copying anything?

By taking the physical page this one struct resides in, and mapping it into the virtual address space the second time. This approach is usually used in the kernel-level development, but there has been a lot of research done since the seventies on how to use it in runtimes for high-level programming languages.

Now, it does involve copying an address of this struct from one place to another, that I cede.

Sure. At the cost of needing >=4K per object, since otherwise "moving" an object involves also moving the other objects sharing the same page.
I think it's a worthwhile distinction to bring up because it highlights a common misconception people have about strings and vectors. A string value is not the string content itself, just a small struct containing a pointer and other metadata. If we're talking about the in-depth semantics of a language then it's important to point out that this struct is the string, and the array of UTF-8 characters it points to is not. C++ obfuscates this distinction because of how it automatically deep copies vectors and strings for you in many cases.
> then it's important to point out that this struct is the string, and the array of UTF-8 characters it points to is not.

So then under this model, what’s the difference between a string and a string_view?

> So then under this model, what’s the difference between a string and a string_view?

string_view doesn't do any deep copying.

...one is a string and one is a string view?

I'm not sure what you're getting at. They're both small structs holding pointers to char data, they just operate on that data differently.

Exactly, thinking about things in terms of their implementations is usually not a good way to actually understand what that thing is. By arguing that std::string is just the struct itself, which consists of who knows what... you fail to appreciate the actual semantics of std::string and how those semantics are really what defines the std::string.

std::string_view also has implementation details that in principle could be similar to std::string, it's a pointer with a size, but the semantics of std::string_view are very different from the semantics of std::string.

And that's the crux of the issue, it's better to understand classes in terms of their semantics, how they operate, rather than their implementations. Implementations can change, and two very separate things can have the same or very similar implementations.

A std::string is not just some pointers and some record keeping data; a std::string is best understood as a class used to own and manage a sequence of characters with the various operations that one would expect for such management. A std::string_view is non-owning, read-only variation of such a class that operates on an existing sequence of characters.

How these are implemented and their structural details is not really what's important, it's how someone is expected to use them and what can be done with them that counts.

That I think the description “the array is not the string” isn’t very elucidating for someone that doesn’t understand the nuance of the ownership/lifetime and move semantics (the topic of the article).

“C++ obfuscates this distinction because of how it automatically deep copies vectors and strings”

It does this because it has to, to guarantee its interface invariants. That “array” (if there is one) really is the string. Just because there might be an indirection doesn’t change that.

> they just operate on that data differently.

Well they operate on the memory “array” of the char data differently (well in the latter not at all).

Also a nitpick: std::string unlike String in Rust or other languages is not married to an encoding. And C++ managed to fuck that one up even more so recently.

It should be, but it's very much not in the real world at least as far as I've seen.

Using std::move for anything other than "unique ownership without pointers" really messes things up. People put std::move everywhere expecting performance gains, just like we used to put "&" everywhere expecting performance gains. It's a bit of cargo cultism that can be nicely dispelled by realizing std::move is just std::copy with a compiler-defined constructor invocation potentially run to determine the old value. With that phrasing, it's hard to hallucinate performance gains that might come automatically.

> std::move is just std::copy with a compiler-defined constructor invocation potentially run to determine the old value

I have no idea what that means.

std::move is a cast to an rvalue reference. That can potentially trigger a specific overloaded function to be selected and possibly, ultimately, a move constructor or assignment operator to be called.

For an explicit move to be profitable, an expression would have otherwise chosen a copy constructor for a type with an expensive copy constructor and a cheap move constructor.

std::copy is a range algorithm, not sure what's the relevance.

Yes, typed too fast. I meant the explicit copy constructor. Luckly, HN will hide my garbage text quickly enough. Thanks for the correction!
In fact, using std::move everywhere can actually make your performance worse!

https://devblogs.microsoft.com/oldnewthing/20231124-00/?p=10...

The real gem of the article is the interlude. E.g., reaching back to C days and pointing out that "It's either copy, or pointer". Once someone has that mental model solidly in hand, all the syntax sugar in the world cannot harm you.

Also "It was an ergonomic advancement." hides a lot of the overwrought syntax sugar in C++ that causes it to be such a weird language if you come from elsewhere. But still an excellent insight into the state of affairs.

I think the "Apparently" language makes it seem like this is some kind of accident that nobody would know about, when really the author was probably just being a creative writer, and the example was fundamental to the post.

You can think of a c++ move as a shallow copy that takes ownership of all objects originally owned by the source.
I mean it'll copy 3 pointers worth of data in all cases. It's just that for short strings, those 3 pointers worth of data contains the text of the string.
there is a lot wrong but your analisys misses the elephant: the function takes a copy and so a copy must be generated. std::move will move if possible but in this case move isn't possible and so a copy will be made.

Move is allowed to not move because in generic code you don't want to have to check for if move is possible for the type in question.

In the case of the example, there is a move, and std::move works in the example.

The function, show, doesn't take a copy, it takes a Person object. Persons can be copy constructed or move constructed (both constructors are implicit, since there's no user-defined constructors). std::move returns an r-value reference to main's p, so Person's implicit move constructor is called, and show's p argument is move constructed from main's p. The reported address changes because moving creates a new object in C++, but the moved-to object may take ownership of the heap allocated memory and other resources from the moved-from object.

In this case, the moved-to Person takes ownership of the heap allocation from the moved-from Person's string member and sets the moved-from Person's string member to an empty string. Without std::move, show's p is copy constructed, including its string member.

C++ making the most inscrutable semantic possible, speedrun any %.
> "Unspecified" means that the implementation gets to decide what is the status of the string after move. For long enough strings, typical implementation strategy is to set the moved-from string in an empty state.

Thusly, what happens in code that accesses the string after the move is UB.

In the implementation of C++ the article uses the string was just empty. But for all we know it may still contain a 1:1 copy of the original or 20 copies or a gobbledygook of bytes.

Any code that relies on the string being something (even empty) may behave different if it isn't. That's the very definition of UB.

"A typical implementation strategy" is meaningless for someone writing code against a language specification.

You're then writing code against a specific compiler/std lib and that's fine. But let's be honest about it.

That's not what UB means. "This will behave differently on different implementations" is implementation defined behavior. Compilers are not allowed to assume that implementation defined behavior never occurs or reject your program if they can prove that it happens.

Undefined behavior is a stronger statement and says that if the behavior occurs then the entire program is simply not valid. This allows the compiler to make vastly more aggressive changes to your program.

There is nothing in the standard or definition of C++ that states that undefined behavior renders a program invalid.

On the contrary the actual C++ standard explicitly states that permissible undefined behavior includes, and I quote "behaving during translation or program execution in a documented manner characteristic of the environment".

It's also worth noting that numerous well known and used C++ libraries explicitly make use of undefined behavior, including boost, Folly, Qt. Furthermore, as weird and ironic as this sounds, implementing cryptographic libraries is not possible without undefined behavior.

"valid program" is not really a term that is used in the standard (I only count one normative usage). What the standard does say is:

"A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible executions of the corresponding instance of the abstract machine with the same program and the same input. However, if any such execution contains an undefined operation, this document places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation)."

I.e. a program the contains UB is undefined.

Of course, as you observer, an implementation can go beyond the standard and extend the abstract machine to give defined semantics to those undefined operations.

That's still different from implementation defined behaviour, where a conforming implementation must give defined semantics.

> Thusly, what happens in code that accesses the string after the move is UB.

No, it is implementation-defined behaviour.

> In the implementation of C++ the article uses the string was just empty. But for all we know it may still contain a 1:1 copy of the original or 20 copies or a gobbledygook of bytes.

Yes, and if you want to make sure that the string is empty before you do something else with it, you just use a clear() (which will be optimised away by the compiler anyway).

Or, if you prefer, you can assign another string to it, or anything else really.

> Any code that relies on the string being something (even empty) may behave different if it isn't. That's the very definition of UB.

No it is not.

> "A typical implementation strategy" is meaningless for someone writing code against a language specification.

Then don't rely on that specific implementation detail and make sure that the string is in the state you want or, even better, don't touch the moved-from string ever again.