Hacker News new | ask | show | jobs
by bjz_ 4293 days ago
> Someone at Yandex recently did a presentation about Rust[1] in which they pointed to a bit of (completely idiomatic!) C++11 code that caused undefined behavior.

It would be great to have at least that segment of the talk translated. Sounds like a good example.

3 comments

The example:

  std::string get_url() { 
      return "http://yandex.ru";
  }

  string_view get_scheme_from_url(string_view url) {
      unsigned colon = url.find(':');
      return url.substr(0, colon);
  }

  int main() {
      auto scheme = get_scheme_from_url(get_url());
      std::cout << scheme << "n";
      return 0;
  }
Can you say what is the problem here? What is a string_view?
A non-owning pointer into string memory owned by someone else (effectively a reference into some string). AIUI, the problem is the temporary string returned by get_url() is deallocated immediately after the get_scheme_from_url call, meaning that the string_view reference `scheme` is left dangling.
The string_view pattern is a pretty bad idea and useless with a decent compiler.
What do you mean by useless?

If I have a string like "foo bar baz" and I want the second word, should I copy out that data into a whole new string? That seems rather inefficient.

(How is a compiler going to optimise that away?)

For small strings, a copy is not only faster but more multithreading friendly.

Keep in mind that on a 64-bit architecture a view is at least 16 bytes large and that small strings can be copied to the stack resulting in better locality and reduced memory usage.

Last but not least, with copy elision, your temporaries might not even exist in the first place.

Example:

    std::string data;
    // ...
    auto str = data.substr(2, 3);
    // pretty sure str will be optimized away
    if (str[0] == 'a')
I don't think copy elision[1,2] means what you think it means, it simply allows the compiler to avoid e.g. allocating a new string when returning a string, or avoid allocating a new string to store the result of a temporary. That is, copy elision allows

   std::string str = data.substr(2, 3);
   return str;
to only allocate one new string (for the return value of substr), instead of two. There's no way the compiler can get out of constructing at least one std::string for the return value, especially if there's any form of dynamic substr'ing (e.g. parsing a CSV file with columns that aren't all the same width).

Sharing is only multithreading unfriendly if there's modification happening, and modification of textual (i.e. Unicode) data is bad practice and hard to get right, since all Unicode encodings are variable width (yes, even UTF-32, it is a variable width encoding of visible characters).

Furthermore, a string_view is strictly better than a string for many applications, since a string_view can always be copied into a string by the caller if necessary (i.e. each function can choose to return the most sensible/most performant thing, which is a string_view if it's just a substring of one of the arguments).

The only sensible argument against string_view in C++ I know is: it's easy to get dangling references. Which is correct, but that's a general problem with C++ itself, not with the idea of string views (Rust has a perfectly safe version in the form of &str, which cannot become dangling like in C++).

> Keep in mind that on a 64-bit architecture a view is at least 16 bytes large and that small strings can be copied to the stack resulting in better locality and reduced memory usage.

No, a string_view points into memory that already exists, there's no increased memory usage; a small string copied on to the stack will be part of the string struct, which is at least 3 * 8 = 24 bytes: a pointer, the length and the capacity. Also, a memcpy out of the original string is always going to be more expensive than just getting the pointer/length (or pair of pointers) for a string_view, since the memcpy has to do this anyway.

[1]: http://en.wikipedia.org/wiki/Copy_elision

[2]: http://definedbehavior.blogspot.com/2011/08/value-semantics-...

Yeah my example for copy elision sucked, but that doesn't mean it cannot play in favor when you work by value.

Sharing is only multithreading unfriendly if there's modification happening, modification of textual (i.e. Unicode) data is bad practice and hard to get right

Read-only access to data indeed scales "infinitely" on modern architectures.

No, a string_view points into memory that already exists,

Yes. Right. How do you store that? You need at least one pointer and and an int or two pointers. That 16 bytes. Memcpy for a couple of bytes is very quick when it's stack to stack thanks to page locality.

Also, if you are using pointers you will have aliasing issues which will have an impact on performance. If you work by values you allow the compiler to optimize things better.

For small strings string view are just dumb and "most of the time" strings are very small.

To give a better example of why working a string view is both a bad idea and dangerous, it's as if you said "I don't want to copy this vector, therefore I will work on iterators". That's obviously a bad idea.

The slides on Slideshare are surprisingly easy to follow (and were a good introduction for me): http://www.slideshare.net/yandex/rust-c
See slide 42 of STL's recent talk for what I guess will be a similar example: https://github.com/CppCon/CppCon2014/tree/master/Presentatio...
Reproduced here:

  const regex r(R"(meow(\d+)\.txt)");
  smatch m;
  if (regex_match(dir_iter->path().filename().string(), m, r)) {
      DoSomethingWith(m[1]);
  }
- What's wrong with this code?

  - Haqrsvarq orunivbe va P++11
  - Pbzcvyre reebe va P++14
  - .fgevat() ergheaf n grzcbenel fgq::fgevat
  - z[1] pbagnvaf vgrengbef gb n qrfgeblrq grzcbenel
(http://rot13.com/ 'd if you want to guess.)
Seems like this was fixed in C++14 by adding a std::string&& overload.

http://en.cppreference.com/w/cpp/regex/regex_match

The underlying problem is still there, fixing a few of the worst cases in the standard library is helpful but only up to a point. (E.g. anyone with a custom function that does something in a similar vein needs to remember to do the same.)