CoW strings with atomic reference counting was definitely the wrong choice for a multi-core universe. The performance penalty is way too high. If you need that semantic there are other ways to get it.
Is a single atomic increment really that expensive?
I mean we are not even talking about a full memory barrier here, just the atomic increment's implied acquire and release on the single variable. Other operations not dependent on a subsequent read could still be re-ordered in both directions.
And also keep in mind that the alternative was copying the whole string instead. Which means both heap memory allocation (which is often pretty expensive, even with per-core heaps), plus the actual copying. Unless a platform has a terrible implementation of atomic increment, or you have a std::string that is frequently getting copied on multiple cores (so as to have meaningful contention), I would have expected the actual copying implementation to be slower. But I'm not super familiar with the timings of these things, so i certainly could be mistaken.
My understanding was that the change was more for about being able to set proper bounds on some operations, ensuring .c_str() is O(1), and not O(n) sometimes, and similarly with string writes, etc.
Copying short strings does not necessarily involve an allocation in implementations using short string optimization. Shooting down the cache line in a remote CPU that happens to have used a frequently-used string recently is absurdly expensive by comparison.
The COW and short string optimizations are not mutually exclusive. If we assume short string optimization is implemented both before and after, then we are back to comparing the atomic increment to allocation. And different allocation approaches can make the cost of heap allocation differ quite substantially. I'd fully expect that some allocation approaches are cheaper than the cache line invalidation from atomic increment, but some others that tend involve a lot of pointer chasing can be rather costly.
Certainly plenty of widely copied strings are short strings, so a COW implementation that lacks the short-string optimization could very easily be a bad bottleneck for multi-core compute.
You have accurately described the GNU CoW string :-)
My impression through the fog of history is that what happened was a really clever GNU person with little foresight and no access to an SMP system implemented std::string with CoW. Its performance in practice was so poor that the standard committee intentionally changed the standard to make it an illegal implementation, thereby eradicating the GNU CoW string. There was no higher principled logic.
Also, the point of that was to improve multithreading of string: I think this very idea is problematic. I've written at this point hundreds of thousands of line of C++, and the number of times where strings are really, by design, supposed to be shared across threads is honestly counted on the fingers of one hand, just like e.g. justification for using Arc over Rc in rust. 99% of string handling is done as some GUI work on the main thread or as part of some task processing done in some network thread, which stays in that thread.
Clearly there's a frontier where the cost situation begins to favor the CoW approach, and I think authors should consciously choose whether they want a CoW string or not based on their use-case, but that goes against the idea of std::string as a jack-of-all-trades. Personally I don't really like std::string as a concept. It overlaps with too many other concepts. It is just vector<char> or std::unique_ptr<char> with SSO? The latter is nice in cases where you want std::string to adopt or release existing memory. Or do you want something like absl::Cord, which is like the old GNU CoW string but with even more stunts under the hood?
Is a single atomic increment really that expensive? I mean we are not even talking about a full memory barrier here, just the atomic increment's implied acquire and release on the single variable. Other operations not dependent on a subsequent read could still be re-ordered in both directions.
And also keep in mind that the alternative was copying the whole string instead. Which means both heap memory allocation (which is often pretty expensive, even with per-core heaps), plus the actual copying. Unless a platform has a terrible implementation of atomic increment, or you have a std::string that is frequently getting copied on multiple cores (so as to have meaningful contention), I would have expected the actual copying implementation to be slower. But I'm not super familiar with the timings of these things, so i certainly could be mistaken.
My understanding was that the change was more for about being able to set proper bounds on some operations, ensuring .c_str() is O(1), and not O(n) sometimes, and similarly with string writes, etc.