| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ygoldfeld 799 days ago

It's so hard to communicate this stuff in writing! There are several angles of potential interest; I wish I could simply chat in-person with anyone curious, you know? Of course that is impossible. (I'll do my best here at HN and the Flow-IPC Discussions board at GitHub.)

I hope the above 2 links get the job done in communicating the key points. There is certainly no shortage of documentation! Still:

If you'll indulge me, I do want to share how this project got started and became open-source. I actually do suspect this might help one get a feeling of what this thing is, and is not.

My name is Yuri Goldfeld. I have worked at Akamai since 2005 (with a break for startup shenanigans, and VMware, in the middle). I designed or co-designed Flow-IPC and wrote about 75% of it (by lines of code ignoring comments); my colleague Eddy Chan wrote the rest, including the bulk of the SHM-jemalloc component (which is really cool IMO).

Akamai in certain core parts is a C++/Linux shop, with dogged scrutiny to latency. Every millisecond along the request path is scrutinized. A few years ago I was asked to do a couple things: - Determine the best serializer to use, in general, but especially for IPC protocols. The answer there was easy IMO: Cap'n Proto. - Split-up a certain important C++ service into several parts, for various reasons, without adding latency to the request path.

The latter task meant, among other things, communicating large amounts of user data from server application to server application. capnp-encoded structures (sometimes big - but not necessarily) would also need to be transmitted; as would FDs.

The technical answers to these challenges are not necessarily rocket science. FDs can be transmitted via Unix domain socket as "ancillary data"; the POSIX `sendmsg()` API is hairy but usable. Small messages can be transmitted via Unix domain socket, or pipe, or POSIX MQ (etc.). Large blobs of data it would not be okay to transmit via those transports, as too much copying into and out of kernel buffers is involved and would add major latency, so we'd have to use shared memory (SHM). Certainly a hairy technology... but again, doable. And as for capnp - well - you "just" code a `MessageBuilder` implementation that allocates segments in SHM instead of regular heap like `capnp::MallocMessageBuilder` does.

Thing is, I noticed that various parts of the company had similar needs. I've observed some variation of each of the aforementioned tasks custom-implemented - again, and again, and again. None of these implementations could really be reused anywhere else. Most of them ran into the same problems - none of which is that big a deal on its own, but together (and across projects) it more than adds up. To coders it's annoying. And to the business, it's expensive!

Plus, at least one thing actually proved to be technically quite hard. Sharing (via SHM) a native C++ structure involving STL containers and/or raw pointers: downright tough to achieve in a general way. At least with Boost.interprocess (https://www.boost.org/doc/libs/1_84_0/doc/html/interprocess....) - which is really quite thoughtful - one can accomplish a lot... but even then, there are key limitations, in terms of safety and ease of use/reusability. (I'm being a bit vague here... trying to keep the length under control.)

So, I decided to not just design/code an "IPC thing" for that original key C++ service I was being asked to split... but rather one that could be used as a general toolkit, for any C++ applications. Originally we named it Akamai-IPC, then renamed it Flow-IPC.

As a result of that origin story, Flow-IPC is... hmmm... meat-and-potatoes, pragmatic. It is not a "framework." It does not replace or compete with gRPC. (It can, instead, speed RPC frameworks up by providing the zero-copy transmission substrate.) I hope that it is neither niche nor high-maintenance.

To wit: If you merely want to send some binary-blob messages and/or FDs, it'll do that - and make it easier by letting you set-up a single session between the 2 processes, instead of making you worry about socket names and cleanup. (But, that's optional! If you simply want to set up a Unix domain socket yourself, you can.) If you want to add structured messaging, it supports Cap'n Proto - as noted - and right out of the box it'll be zero-copy end-to-end. That is, it'll do all the SHM stuff without a single `shm_open()` or `mmap()` or `ftruncate()` on your part. And if you want to customize how that all works, those layers and concepts are formally available to you. (No need to modify Flow-IPC yourself: just implement certain concepts and plug them in, at compile-time.)

Lastly, for those who want to work with native C++ data directly in SHM, it'll simplify setup/cleanup considerably compared to what's typical. For the original Akamai service in question, we needed to use SHM as intensively as one typically uses the regular heap. So in particular Boost.interprocess's built-in 2 SHM-allocation algorithms were not sufficient. We needed something more industrial-strength. So we adapted jemalloc (https://jemalloc.net/) to work in SHM, and worked that into Flow-IPC as a standard available feature. (jemalloc powers FreeBSD and big parts of Meta.) So jemalloc's anti-fragmentation algorithms, thread caching - all that stuff - will work for our SHM allocations.

Having accepted this basic plan - develop a reusable IPC library that handled the above oft-repeated needs - Eddy Chan joined and especially heavily contributed on the jemalloc aspects. A couple years later we had it ready for internal Akamai use. All throughout we kept it general - not Akamai-specific (and certainly not specific to that original C++ service that started it all off) - and personally I felt it was a very natural candidate for open-source.

To my delight, once I announced it internally, the immediate reaction from higher-up was, "you should open-source it." Not only that, we were given the resources and goodwill to actually do it. I have learned that it's not easy to make something like this presentable publicly, even having developed it with that in mind. (BTW it is about 69k lines of code, 92k lines of comments, excluding the Manual.)

So, that's what happened. We wrote a thing useful for various teams internally at Akamai - and then Akamai decided we should share it with the world. That's how open-source thrives, we figured.

On a personal level, of course it would be gratifying if others found it useful and/or themselves contributed. What a cool feeling that would be! After working with exemplary open-source stuff like capnp, it'd be amazing to offer even a fraction of that usefulness. But, we don't gain from "market share." It really is just there to be useful. So we hope it is!

2 comments

robobully 799 days ago

That's an impressive read, thank you and congrats on the release! I think that nowadays the development and adoption of performant IPC mechanisms is unfairly low, it's good to have such tech opensourced.

My question is, how does Flow-IPC compare to projects like Mojo IPC (from Chromium) and Eclipse iceoryx? At first glance they all pursue similar goals and pay much less attention to complex allocation management, yet managing to perform well enough.

link

ygoldfeld 799 days ago

Appreciate your time! And, naturally, this was the question I expected to pop up once I was able to work through everything required internally here at Akamai to actually put this guy out in public. Wouldn't it be sad :-( if the same thing already existed, and we just hadn't noticed it?

In tactical terms, back when this all started, of course we looked around for something to use; after all why write a whole thing, if we could use something? We didn't write a serializer, for example, since a kick-butt one (capnp - and FlatBuffers also seems fine) already existed. Back then, though, nothing really jumped out. So looking back, it may have simply been a race; a few people/groups out there saw this niche and started developing things. I see iceoryx in particular has one identical plank, which is workable/general end-to-end zero-copy via SHM; and it was released a couple years before, hence has a super nice presentation I hugely appreciate: many well-documented examples in particular. Whereas for us, providing that will take some more effort. (That said, we did not skimp on documentation: everything is documented meticulously, and there is a hopefully-reader-friendly Manual as well.)

When it came down to the core abilities we needed, it was like this: 1. We wanted to be able to share arbitrary combinations of C++ native structures, and not just PoDs (plain-old-datatypes). Meaning, things with pointers needed to work; and things with STL-compliant containers needed to work. Boost.interprocess initially looked like it got that job done... but not enough for our use-case at least. When it came down to it, with Boost.ipc:

- Allocation from a SHM-segment had to be done using a built-in Boost-written heap-allocation algorithm (they provided two of them, and you can plug in your own... as long as all the control structures lived inside SHM).

- The shared data structure had to live entirely within one SHM-segment (mmap()ed area).

But, we needed some heavy-duty allocation - the Boost ones are not that. Plugging in a commercial-grade one - like jemalloc - was an option, but that was itself quite a project, especially since the control structures have to live in SHM for it to work. jemalloc is the most advanced thing available, but it kept control structures as globals, so plopping those into SHM meant changing jemalloc (a lot... Eddy actually did pursue this during the design phase). Plus, having both sides of the conversation reading and writing in one shared SHM-segment was not great due to safety concerns.

And, whatever allocation would be used - with Boost.interprocess's straightforward assumptions - had to be constrained to one mmap()ed area (SHM-segment). jemalloc (for example; substitute tcmalloc or any other heap-provider as desired) would want to mmap() new segments at will. Boost.ipc doesn't work in that advanced way.

2. We wanted to to send capnp-encoded messages (and, more generally, just "stuff" - linear buffers) with end-to-end zero-copy, meaning capnp-segments would need to be allocated in SHM. I spoke with Kenton Varda (Cap'n Proto overlord) very recently; he too felt this straightforward desire of not piping-over copies of capnp-encoded items. Various Akamai teams implemented and reimplemented this by hand, for specific use cases, but as I said earlier, it wasn't reusable in a general way (not for our specific use-case for that original big C++ service that I was tasked with splitting-up).

Other niceties were desirable too - not worrying about names IPC-resource names/conflicts/..., ensuring SHM cleanup straightforwardly on exit or crash - but they were more tangential (albeit extremely useful) things that came about once we decided to handle the core (1) and (2) in reusable fashion.

At that point, nothing seemed to be around that would just give us those fairly intuitive things. I am not saying these are necessary for every IPC use-case... but they never hurt at the very least, and having those readily available give one a feeling of power and freedom.

Now, as to the actual question: How does it compare to those? I am not going to lie (because lying is bad): It'll take me a few days to understand the ins and outs of Mojo IPC and iceoryx, so any impression I give here is going to be preliminary and surface-level. To that point, I expect the correct/true answer to your question will be a matter of diving into each API and simply seeing which one seems best to the particular potential user. For Flow-IPC, this Manual page here should be a pretty decent overview of what's available with code snippets: https://flow-ipc.github.io/doc/flow-ipc/versions/main/genera...

That said, my preliminary initial impression is:

(cont.)

link

ygoldfeld 799 days ago

Versus iceoryx (the C++ version, not the Rust-oriented iceoryx2):

TL;DR: So far, it looks super-sweet (as well as mature, already supporting macOS for example). However more of an investment to use than is Flow-IPC, with a central daemon and a special event-loop model. It also doesn't want to do #1 above described (no pointers, no using existing STL-compliant container types).

This guy seems really cool, and it directly addresses at least the major part of need #2 above. You can transmit buffers with near-zero latency, and it'll do the SHM stuff for you. (For capnp specifically one would then implement the required SHM-allocating capnp::MessageBuilder, and off we go. Flow-IPC does give you this part out-of-the-box, granted.) Looking over the examples and overview, it seems like integrating it into an event loop might involve some pretty serious learning of iceoryx's event-loop model + subscribe/publish. There is also a central daemon that needs to run.

Flow-IPC, to me, seems to have a lower-learning/lower-maintenance curve approach to this. There's no central daemon or any equivalent of it. For each asynchronous thing (a transport::Channel, for example, which has receive-oriented methods), you can use one of 2 supplied APIs. The sync_io-style API will let you plug into anything select()/poll()/epoll()-oriented (and has a syntactic-sugar hook for boost.asio loops). If you've got an event loop, it'll be easy to plug Flow-IPC ops right into it - no background threads added thereby. Or, use the async-I/O-style API; then it'll create background threads as needed and call your callback (e.g., on message receipt) from there, leaving it to you to handle it there or by posting the "true" handling onto one of your own threads.

Point being, my impression so far is, using Flow-IPC in this sense is a lower-effort enterprise. It's pretty much just there to plug-in. (I really hope that isn't slander. That's my take so far - as I said, it'll take me a few days to understand these products in-depth.)

Now, in terms of need #1. (I acknowledge, this need is not for every C++ IPC use-case ever. 2 processes collaborating on one native C++ data structure full of SHM-compliant containers and/or pointers =/= done every day. Still, though, if 2 threads in one process can do it easily, why shouldn't they as-easily be able to do it across a process boundary? Right?) If I understand iceoryx's example on this topic (https://iceoryx.io/latest/examples/complexdata/)... I quote: "To implement zero-copy data transfer we use a shared memory approach. This requires that every data structure needs to be entirely contained in the shared memory and must not internally use pointers or references. ... Therefore, most of the STL types cannot be used, but we reimplemented some constructs. This example shows how to send/receive a iox::cxx::vector and how to send/receive a complex data structure containing some of our STL container surrogates."

With Flow-IPC, this does not apply. You can share existing STL-compliant containers, and (if you want) can have raw pointers too. We have tests nesting boost::container string/vector/map guys and our own flow::util::Basic_blob STL-compliant guy and sharing them, no problem. We've provided the necessary allocator and fancy-pointer types. Moreover, with a single line you can do this in jemalloc-allocated SHM; or instead choose a Boost.ipc-backed single-segment SHM. (Depends on what you desire for safety versus simplicity, internally. I am being a bit vague on that here, but it's in the docs, I promise.) I believe this is a pretty good illustration of Flow-IPC's "thing":

- Meat-and-potatoes, do what you want to do in your daily C++, without a major learning curve... - ...but without sacrificing essential power... - ...and extensibly, meaning you can modify its behavior in core ways without requiring a massive amount of learning of how Flow-IPC is built.

Versus Mojo IPC:

I really need to understand it better, before I can really comment. So far, it seems like its equivalent of Flow-IPC's sessions = super cool, building up a network of processes that can all talk to each other once in the network. Flow-IPC's sessions are basic: you want process A and B to speak, you establish a session (during this step, one is designated as the session-server and can therefore accept more sessions from that app or other apps)... then from there, you can make channels (and access SHM arenas, if you are using SHM directly as opposed to letting the zero-copy channels do it invisibly). It also has various-language bindings; Flow-IPC is C++... straight up.

That established, I need to understand it better. It looks like it provides super-fast low-level IPC transports (similar to Flow-IPC's unstructured-layer channels) in platform-agnostic fashion - but does not seem to specifically facilitate end-to-end zero-copy transmission of data structures via SHM. I could be completely wrong here, but it actually looks like one could feasibly plug-in Mojo IPC pipes as Flow-IPC Blob_sender/receiver (and/or Native_handle_sender/receiver) concept impl, into Flow-IPC, and get the end-to-end zero-copy goodness.

At least superficially, so far, Flow-IPC again looks like perhaps a more down-to-earth/readily-pluggable effort. (But, still documented out-the-wazoo!)

link

elfenpiff 794 days ago

I am one of the maintainers of iceoryx and the creator of iceoryx2, so I wanted to add and complete some more details.

iceoryx/iceoryx2 was intended for safety-critical systems initially but now expands to all other domains. In safety-critical systems that run, for instance, in cars or planes, you do not want to have undefined behavior - but the STL is full of it, so we had to reimplement an STL subset in (https://github.com/eclipse-iceoryx/iceoryx/tree/master/iceor...) that does not use heap, exceptions or comes with undefined behavior. So you can send vectors or strings via iceoryx, but you have to use our STL implementations.

It also comes with a service-oriented architecture; you can create a service - identified by name - and communicate via publish-subscribe, request-response, and direct events (and in the planning: pipeline or blackboard).

One major thing is iceoryx robustness. In safety-critical systems, we have a term called freedom-of-interference, meaning that a crash in application A does not affect application B. When they communicate via shared memory, for instance, and use a mutex, they could dead-lock each other when one app dies while holding the mutex. This is why we go for lock-free algorithms here that are tested meticulously, and we are also planning a formal verification of those lock-free constructs.

iceoryx2 is the next-gen of iceoryx where we refactored the architecture heavily to make it more modular and address all the major pain points. * no longer requires a central daemon and has decentralized all the management tasks, so you get the same behavior without the daemon * comes with events that can be either based on an underlying fd-event (slower but can be integrated with OS event-multiplexing), or you can choose the fast semaphore route (it is now up to the user)

Currently, we are also working on language bindings for C, C++, Python, Lua, Swift, C#, etc.

link

robobully 798 days ago

Thanks for the detailed answer! I really appreciate that.

link

ygoldfeld 798 days ago

You’re welcome. But I must tell you, at work I asked how my answers are, keep me honest. So a coworker looked at this thread and was just like, “dude just get to the point, no one wants to read all that.” And then explained that in huge detail.

That’s just how I talk. With all the writing I’ve had to do lately - documentation, blog, announcements - it’s been a constant struggle forcing myself to say fewer words, keep it short, keep the eyeballs, come onnnnnnn, edit edit edit!!! And that’s good… it’s how it should be. It’s just totally unnatural to me personally… hehe.

FINALLY there’s a chance to simply talk about it to some humans, so I uh… maybe went a little wild with the verbosity.

link

bsder 798 days ago

Actually, I'm happy you spent all those words, and I read them all.

I've been looking for an SHM IPC for a very long time and not finding one. It's nice to know that I'm not the only idiot thinking along these lines.

In addition, it's also nice to know that this was hard. I have taken several stabs at doing this, and I always bounced off thinking "It can't be this difficult. I'm screwing up." Seeing that smart people working for a real company had to do major surgery on something like jemalloc is a bit of a validation.

Can't say I'm happy to see this in C++, but I'll take what I can get. :)

Thanks to all the folks who wrote it. And thank you for the long winded explanations otherwise I probably would have ignored it.

link

OnlyMortal 799 days ago

I’ve spent a lot of time with boost asio and serialisation of objects into a boost variant to send that across the wire. The server vists the variant to process the message. Including boost shared memory for file data.

Both for unix domain sockets and TCP.

There’re plenty of boost examples around so, I’d suggest, you take their examples and work them for your framework.

As I’m sure you’re aware, a clean and easy to read example will make a difference.

It’s great that you’re open source and I hope you get some traction.

link

ygoldfeld 799 days ago

Indeed, examples from every angle are probably the one deficit of the existing documentation. There are a couple, such as the perf_demo described in the blog post. I’d like to add ones showing integration with

- epoll based event loop

- boost.asio based event loop

(Boost.interprocess and boost.asio are huge inspirations and are both used inside!)

As for traction: it’s tough! Have to get eyeballs; and then have to convey a sense of being worth one’s trust.

Thank you for your time.

link

OnlyMortal 799 days ago

Integration with boost asio would be of interest to many - myself included. It is the defacto for anyone who’s got past Stephen’s Unix Network Programming.

It would gain a level of trust with developers.

link

ygoldfeld 799 days ago

Roger dodger.

For what it is worth at this time - obviously acting on the following statement will require some level of trust -

It is very much ready to use with boost.asio. (I know that, because I myself use boost.asio religiously. If it were not compatible with it, I'd pretty much have to not use Flow-IPC myself.) Though, it could (fairly easily) gain a number of wrapper classes that would turn our stuff into actual boost.asio I/O objects; then it'd be even more straightforward.

Topic is covered here:

https://flow-ipc.github.io/doc/flow-ipc/versions/main/genera...

There's even the little section entitled, "I'm a boost.asio user. Can't I just give your constructor my io_context, and then you'll place the completion handler directly onto it?"

To summarize, though...

-1- You can have Flow-IPC create background threads as-needed and ping your completion handler (e.g., "message received") from such threads.

-2- You can have it not create any background threads, instead asking you to .async_wait() (via boost.asio, most easily; but also manually with poll() or whatever you want) whenever it needs internally to async-await something. Your own completion handler (e.g., handle just-received message M) shall execute synchronously at only predictable points, in non-blocking fashion.

-3- Direct integration with boost.asio - meaning ipc::transport::Channel (e.g.) would take an io_context/executor/whatever in its ctor, and .async_X(F) would indeed post F onto that io_context/executor/whatever = essentially syntactic sugar = a TODO. (I'd best file an Issue, I just remembered.)

The perf_demo (partially recreated in the blog-post) integrates into a single-threaded boost.asio io_context, using technique #2 above. In the source code snippets in the blog, we avoided anything asynchronous just to keep it focused for the max # of readers (hopefully).

link

OnlyMortal 791 days ago

Top tip: ensure your ASIO code is not exported from a shared library.

I’ve been hit by Cephfs using some version and my own code using another.

The fixes were simple though.

Edit: as for performance, I’d not focus on that too much. It’ll depend on circumstances the end user has. Myself, I’d measure the interfaces with stack based timings and dump to a JSON file at exit. Graphs under various loads and an a/b comparison.

As an example, on a dedupe system I measure LZO was better for performance than LZ4. HPE rack units with spinning rust disks.

Edit 2: I’ve forwarded your GitHub to my work account. I’ll offer the research to a colleague (Jira backlog) to look at when “someone” wants our new system to be faster. We have a boost asio solution I wrote that works - local unix domain sockets. Hitachi NAS.

link