Hacker News new | ask | show | jobs
by imron 1465 days ago
It's the 'too many string types' that helps.

With C++, if you have char*'s (because you don't need to own the memory) and you pass it to a function that takes a const std::string& (because it also doesn't want to own the memory), then there will still be an implicit conversion to a temporary std::string (involving an allocation) despite neither the caller or the callee needing to own any memory.

With Rust, if you have a &str (because you don't need to own the memory) and you pass it to any function that takes a String (or even the unidiomatic &String), then you will get a compile error. There won't be any implicit conversion of types and therefore no implicit allocation. If you really want to pass it, you need to explicitly convert it, making the cost of the allocation explicit.

Rust's "too many strings" model says "there are many different ways in which you can use string-like objects, each with their own performance tradeoffs. Know which one you want to use in your code or I won't compile".

5 comments

This discussion is making me wonder if windows-rs [1], the crate with official Rust bindings for all Windows APIs, is doing something that's not idiomatic Rust. Specifically, for any Windows API function that takes a UTF-16 string as a parameter, the signature for that parameter is something like "impl IntoParam<PCWSTR>". The crate then implements that trait for String and &str, so you can pass a normal Rust UTF-8 string (even a string literal), and it'll be automatically converted to a freshly-allocated, null-terminated UTF-16 string (which gets freed after the function call). That seems like it could lead to the same thoughtless inefficiency as in the story about the Chrome omnibox.

[1]: https://github.com/microsoft/windows-rs

Well that will be necessary until windows gets UTF-8 APIs. Probably not soon. Until then there are various optimizations you can do, like caching the UTF-16 conversion alongside the UTF-8 string (good for calling OS APIs frequently with with long-lived strings), allocating temporary UTF-16 conversions on the stack (good for infrequent calls with strings up to a certain size), or storing raw UTF-16 strings as opaque bytes in Rust memory (good for providing strings back to the OS that you got from the OS).

You should try to avoid calling OS APIs in general and cache the results as much as possible. Who knows what the performance characteristics are of an API that has to serve 7 layers of historical OSes simultaneously. Unless you're directly interfacing with the kernel you shouldn't expect much. Omnibar-like layered calls between your app and the OS are a worst-case scenario regardless of conversions.

winapi does support UTF-8 on recent versions:

https://docs.microsoft.com/en-us/windows/apps/design/globali...

Very interesting I wasn't aware. After glancing over that doc, it looks like they smuggle UTF-8 in through the -A variant windows APIs [1] by explicitly setting the CP_UTF8 codepage in an application manifest. I wonder if this actually uses UTF-8 internally to service the API call or if it just manually converts strings to wide form and calls the -W variant on the windows side instead of making you do it on the app side. If the latter it may be better to avoid this feature so you don't close the door on potential optimizations like I mentioned above.

[1]: Windows has two variants of many API calls with either -A or -W suffix, where the -A suffix is for strings formatted as 1-byte ASCII (or a specified codepage) and the -W suffix is for strings formatted as 2-byte UTF-16 (kinda). Example: DlgDirListA / DlgDirListW, https://docs.microsoft.com/en-us/windows/win32/api/winuser/n...

That might hide it from the caller, but the function that receives that IntoParam type will still need to explicitly call the conversion function.
Yes, and all those receiving functions are auto-generated as part of windows-rs.
It would most likely suffer from similar problems when interacting with the C and C++ APIs in the rest of Chrome though (e.g. what to do if you have a Rust String, but the other side wants a const ref to a C++ std::string).
Use a CxxString: https://cxx.rs/binding/cxxstring.html

At some point there will need to be an allocation when crossing Rust -> C++ boundary because Rust strings are not null-terminated.

The difference Rust makes is that unlike C++, it is always explicit when the allocation occurs.

Unfortunately the restrictions mentioned on that page make it quite a pain to use in practice.
My minor, unpolished grievance with Rust's approach is that you have to do this for all kinds of types (e.g., Path vs PathBuf). It's tedious to have to write these pairs all the time, along with all of the trait implementations and so on. It almost feels like it would be nice if the type system could allow us to write `String` or `PathBuf` and automatically generate the corresponding `str` or `Path` types.
> With C++, if you have char*'s (because you don't need to own the memory)

If you are using C strings in C++ you are either doing something incredibly low level or don't care about performance at all. C strings require strlen calls or something equivalent for basic operations and you can easily run into code with exploding runtime if you aren't extremely careful.

> If you are using C strings in C++ you are either doing something incredibly low level or don't care about performance at all.

…or interoperating with C code?

But the temporary copy only happens going from const char * to std::string, so the C code would have to be calling C++ code.

std::string to const char doesn’t (usually?) require copying.

psst... std::string_view
A step in the right direction if you have a compiler with c++17 support.

Note: chrome only supported c++17 features in Dec 2021 [0], and whether std::string_view is allowed to be used is still 'to be determined'.

0: https://chromium.googlesource.com/chromium/src/+/HEAD/styleg...

A variant of what is becoming string_view in the standard has existed within Google's codebase(s) for years. I don't recall using it much in Chromium when I worked in there, but it's all over Google3 and is now in absl (Google's open sourcing of some of its base c++ components).

Chromium has "string_piece": https://chromium.googlesource.com/chromium/src/base/+/refs/h... which is at least 9 years old (was moved from elsewhere in the repo into base/ then)

The point is not that c++ can't do this (I also have code that does this dating back over 10 years), it's that despite having code to do string_view/string_piece, Chromium was still performing 25,000 allocations per keystroke in its Omnibox because c++ has other common ways to represent "constant string owned by someone else", and there are hidden performance issues that will trip up even experienced programmers when mixing these ways incorrectly.

Despite having better options available (either in the standard library or custom code), the less optimal ways still get used.

Rust had the benefit of learning from c++'s mistakes and separated the concepts of owned vs unowned strings in to separate types with explicit conversions required whenever an allocation would occur. This was baked in to the language from the beginning and so you don't get a mix of different types in signatures to convey the concept of pointing to a slice of a string owned by someone else, you just have &str.

Even if you get fancy with your interface and do things like AsRef<str>, there's still no concern about implicit or hidden allocations. Any time you need to own the memory (either for yourself or to pass in to another function) you need to do so explicitly and you end up with a different type (much to the chagrin and confusion of newcomers to the language).

C++ is trying to correct its mistakes also, but not everyone can use those latest features and even if they can, the mistakes still have to be left in for compatibility reasons.

Ah yes, the one that still just wraps a raw pointer in the end

https://github.com/isocpp/CppCoreGuidelines/issues/1038

And then someone will convert a std::string_view to a const char* and things will explode...