| Been doing some IPC experiments recently following the 3tilley post[0], because there just isn't enough definitive information (even if it's a snapshot in time) out there. Shared memory is crazy fast, and I'm surprised that there aren't more things that take advantage of it. Super odd that gRPC doesn't do shared memory, and basically never plans to?[1]. All that said, the constructive criticism I can offer for this post is that in mass-consumption announcements like this one for your project, you should: - RPC throughput (with the usual caveats/disclaimers)
- Comparison (ideally graphed) to an alternative approach (ex. domain sockets)
- Your best/most concise & expressive usage snippet 100ns is great to know, but I would really like to know how much RPC/s this translates to without doing the math, or seeing it with realistic de-serialization on the other end. [0]: https://3tilley.github.io/posts/simple-ipc-ping-pong/ [1]: https://github.com/grpc/grpc/issues/19959 |
1. Unless you're using either fixed sized or specially allocated structures, you end up paying for serialization anyhow (zero copy is actually one copy).
2. There's no way to reference count the shared memory - if a reader crashes, it holds on to the memory it was reading. You can get around this with some form of watchdog process, or by other schemes with a side channel, but it's not "easy".
3. Similar to 2, if a writer crashes, it will leave behind junk in whatever filesystem you are using to hold the shared memory.
4. There's other separate questions around how to manage the shared memory segments you are using (one big ring buffer? a segment per message?), and how to communicate between processes that different segments are in use and that new messages are available for subscribers. Doable, but also not simple.
It's a tough pill to swallow - you're taking on a lot of complexity in exchange for that low latency. If you can do so, it's better to put things in the same process space if you can - you can use smart pointers and a queue and go just as fast, with less complexity. Anything CUDA will want to be single process, anyhow, (ignoring cuda IPC, anyhow). The number of places where you need (a) ultra low latency (b) high bandwidth/message size (c) can't put everything in the same process (d) are using data structures suited to shared memory and finally (e) are okay with taking on a bunch of complexity just isn't that high. (It's totally possible I'm missing a Linux feature that makes things easy, though).
I plan on integrating iceoryx into a message passing framework I'm working on now (users will ask for SHM), but honestly either "shared pointers and a queue" or "TCP/UDS" are usually better fits.