| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ygoldfeld 795 days ago

I think that (whether a native-struct versus a capnp schema-based struct = helps/how much) is a general question of what kind of serialization is best for a particular use-case. I wouldn't want to litigate that here fully. Personally though I've found capnp-based IPC protocols to be neat and helpful, across versions and protocol changes (where e.g. there are well-defined rules of forward-compatibility; and Flow-IPC gives you niceties including request-response and message-type demultiplexing to a particular handler). [footnote 1 below]

BUT!!! Some algorithms don't require an "IPC protocol" per se, necessarily, but more like = 2+ applications collaborating on a data structure. In that case native structures are for sure superior, or at times even essentially required. (E.g., if you have some custom optimized hash-table -- you're not going to want to express it as a capnp structure probably.)

So, more to the point:

- Flow-IPC 100% supports transmitting/sharing (and constructing, and auto-destroyting) native C++ structures. Compared to iceoryx, on this point, it appears to have some extra capabilities, namely full support for structures with pointers/references and/or STL-compliant containers. (This example https://iceoryx.io/latest/examples/complexdata/ and other pages say things like, "To implement zero-copy data transfer we use a shared memory approach. This requires that every data structure needs to be entirely contained in the shared memory and must not internally use pointers or references. The complete list of restrictions can be found...".) Flow-IPC, in this context, means no need to write custom containers sans heap-use, or eliminate pointers in an existing structure. [footnote 2 below]

- Indeed, the capnp framing (only if you choose to use the Flow-IPC capnp-protocol feature in question!) adds processing and thus some computational and RAM-use overhead. For many applications, the 10s of microseconds added there don't matter much -- as long as they are constant regardless of structure size, and as long as they are 10s of microseconds. So a 100usec (modulo processor model of course!) RTT (size-independent) is pretty good still. Of course I would never claim this overhead doesn't matter to anyone, and iceoryx's results here are straight-up admirable.

[footnote 1] The request/response/demultiplexing/etc. niceties added by Flow-IPC's capnp-protocol feature-in-question work well IMO, but one might prefer the sweet RPC-semantics + promise pipelining of capnp-RPC. Kenton V (capnp inventor/owner) and I have spoken recently about using Flow-IPC to zero-copy-ify capnp-RPC. I'm looking into it! (He suspects it is pretty simple/natural, given that we handle the capnp-serialization layer already, and capnp-RPC is built on that.) This wouldn't change Flow-IPC's existing features but rather exercise another way of using them. In a way Flow-IPC provides a simple-but-effective-out-of-the-box schema-based conversation protocol via capnp-serialization, and capnp-RPC would provide an alternate (to that out-of-the-box guy) conversation protocol option. I tried pretty hard to design Flow-IPC in a grounded and layered way, so such work would be natural as opposed to daunting.

[footnote 2] In fact the Flow-IPC capnp-based structured-channel feature (internally) itself uses Flow-IPC's own native-structure-transmission feature in its implementation (eat our own dog-food). Since a capnp serialization = sequence of buffers (a.k.a. segments), for us it is (internally) represented as essentially an STL list<vector<uint8_t>>. So we construct/build one of those in SHM (internally); then only a small SHM-handle is (internally) transmitted over the IPC-transport [footnote 3]; and the receiver then obtains the in-place list<vector<uint8_t>> (essentially) which is then treated as the capnp-encoding it really is. This would all happen (internally) when executing the quite-short example in the blog (https://www.linode.com/blog/open-source/flow-ipc-introductio...). As you can see there, to the Flow-IPC-using developer, it's just -- like -- "create a message with this schema here, call some mutators, send"; and conversely "receive a message expected to have that (same) schema, OK -- got it; call some accessors."

[footnote 3] IPC-transport = Unix domain socket or one 2 MQ types -- you can choose via template arg (or add your own IPC-transport by implementing a certain pair of simple concepts).

2 comments

jeffreygoesto 794 days ago

Thank you very much for this excellent explanation! I am one of the fathers of IceOryx and it's predecessor. We had to lift component based embedded development to Posix systems and are very latency and memory bandwidth sensitive (driver assistance and automated driving on what most people would call small SoCs). There it is easier to enforce the senders and receivers to use the same struct.

What you did with the shm arena and sharing std containers is outright amazing and indeed relaxes the "self contained" constraint nicely.

On QNX (up to 7) we were bitten by each syscall going through procnto, that is why we have chosen lockfree over mq btw.

Being aware of the use case and choosing the right tradeoff is crucial, as you wrote.

link

elBoberido 793 days ago

Now I'm curious. It's seems you are not the father I'm still drinking beer with. This means there is only one person left that fits this attribute :) ... we should meet for some beer with the other father ;)

link

jeffreygoesto 792 days ago

Got me. Next time I'm in Berlin we'll do... ;) Good job with IceOryx2, guys!

link

elBoberido 792 days ago

We are waiting with some salt & vinegar crisps ;)

link

jeffreygoesto 787 days ago

Nice, but please no T-Shirts on train platforms... =;-D

link

elBoberido 793 days ago

I'm one of the iceoryx mantainers. Great to see some new players in this field. Competition leads to innovation and maybe we can even collaborate in some areas :)

I did not yet look at the code but you made me curious with the raw pointers. Do you found a way to make this work without serialization or mapping the shm to the same address in all processes?

I will have a closer look at the jemmaloc integration since we had something similar in mind with iceoryx2.

link

ygoldfeld 793 days ago

We are doing it with fancy-pointers (yes, that is the actual technical term in C++ land) and allocators. It’s open-source, so it’s not like there’s any hidden magic, of course: “Just” a matter of working through it.

Using manual mapping (same address values on both sides, as you mentioned) was one idea that a couple people preferred, but I was the one who was against it, and ultimately this was heeded. So that meant:

Raw pointer T* becomes Allocator<T>::pointer. So if user happens to enjoy using raw pointers directly in their structures, they do need to make that change. But, beats rewriting the whole thing… by a lot.

container<T> becomes container<T, Allocator<T>>, where `container` was your standard or standard-compliant (uses allocator properly) container of choice. So if user prefers sanity and thus uses containers (including custom ones they developed or third-party STL-compliant ones), they do need to use an allocator template argument in the declaration of the container-typed member.

But, that’s it - no other changes in data structure (which can be nested and combined and …) to make it SHM-sharable.

We in library “just” have to provide the SHM-friendly Allocator<T> for user to use. And, since stateful allocators are essentially unusable by mere humans in my subjective opinion (boost.interprocess authors disagree apparently), use a particular trick to work with an individual SHM arena. “Activator” API.

So that leaves the mere topic of this SHM-friendly fancy-pointer type, which we provide.

For SHM-classic mode (if you’re cool with one SHM arena = one SHM segment and both sides being able to write to SHM; and boost.interprocess alloc algorithm) —- enabled with a template arg switch when setting up your session object —- that’s just good ol’ offset_ptr.

For SHM-jemalloc (which leverages jemalloc, and hence is multi-segment and cool like that, plus with better segregation/safety between the sides) internally there are multiple SHM-segments, so offset_ptr is insufficient. Hence we wrote a fancy-pointer for the allocator, which encodes the SHM segment ID and offset within the 64 bits. That sounds haxory and hardcore, but it’s not so bad really. BUT! It needs to also be able to be able to point outside SHM (e.g., into stack which is often used when locally building up a structure), so it needs to be able to encode an actually-raw vaddr also. And still use 64 bits, not more. Soooo I used pointer tagging, as not all 64 bits of a vaddr carry information.

So that’s how it all works internally. But hopefully to the user none of these details is necessary to understand. Use our allocator when declaring container members. Use allocator’s fancy-pointer type alias (or similar alias, we give ya the aliases conveniently hopefully) when declaring a direct pointer member. And specify which SHM-backing technique you want us to internally use - depending on your safety and allocation perf desires (currently available choices are SHM-classic and SHM-jemalloc).

link

elBoberido 793 days ago

Hehe, we are also using fancy-pointer in some places :)

We started with mapping to the shm to the same address but soon noticed that it was not a good idea. It works until some application already mapped something to the same address. It's good that you did not went that route.

I hoped you had an epiphany and found a nice solution for the raw-pointer problem without the need to change them and we could borrow that idea :) Replacing the raw-pointer with fancy-pointer is indeed much simpler than replacing the whole logic.

Since the raw-pointer need to be replaced by fancy-pointer, how do you handle STL container? Is there a way to replace the pointer type or some other magic?

Hehe, we have something called 'relative_ptr' which also tracks the segment ID + offset. It is a struct of two uint64_t though. Later on, we needed to condense it to 64 bit to prevent torn writes in our lock-free queue exchange. We went the same route and encoded the segment ID in the upper 16 bits since only 48 bits are used for addressing. It's kind of funny that other devs also converge to similar solutions. We also have something called 'relocatable_ptr'. This one tracks only the offset to itself and is nice to build relocatable structures which can be memcopied as long as the offset points to a place withing the copied memory. It's essentially the 'boost::offset_ptr'.

Btw, when you use jemalloc, do you free the memory from a different process than from which you allocate? We did the same for iceory1 but moved to a submission-queue/completion-queue architecture to reduce complexity in the allocator and free the memory in the same process that did the allocation. With iceoryx2 we also plan to be more dynamic and have ideas to implement multiple allocators with different characteristics. Funnily, jmalloc is also on the table for use-cases where fragmentation is not a big problem. Maybe we can create a common library for shm allocating strategies which can be used for both projects.

link

ygoldfeld 793 days ago

Hi again!

> I hoped you had an epiphany and found a nice solution for the raw-pointer problem without the need to change them and we could borrow that idea :)

Well, almost. But alas, I am unable to perform magic in which a vaddr in process 1 means the same thing in process 2, without forcing it to happen by using that mmap() option. And indeed, I am glad we didn't go down that road -- it would have worked within Akamai due to our kernel team being able to do such custom things for us, avoiding any conflict and so on; but this would be brittle and not effectively open-sourceable.

> Since the raw-pointer need to be replaced by fancy-pointer, how do you handle STL container? Is there a way to replace the pointer type or some other magic?

Yes, through the allocator. An allocator is, at its core, three things. 1, what to execute when asked to allocate? 2, what to execute when asked to deallocate? 3, and this is the relevant part here, what is the pointer type? This used to be an alias `pointer` directly in the allocator type, but it's done through traits, modernly. Point being: An allocator type can have the pointer type just be T; or* it can alias it to a fancy-pointer type. Furthermore, to be STL-compliant, a container type must religiously follow this convention and never rely on T* being the pointer type. Now, in practice, some GNU stdc++ containers are bad-boys and don't follow this; they will break; but happily:

- clang's libc++ are fine;

- boost.container's are fine (and, of course, implement exactly the required API semantics in general... so you can just use 'em);

- any custom-written containers should be written to be fine; for example see our flow::util::Basic_blob which we use as a nailed-down vector<uint8_t> (with various goodies like predictable allocation size behavior and such) for various purposes. That shows how to write such a container that properly follows STL-compliant allocator behavior. (But again, this is not usually something you have to do: the aforementioned containers are delightful and work. I haven't looked into abseil's.)

So that's how. Granted, subtleties don't stop there. After all, there isn't just "one" SHM arena, the way there is just one general heap. So how to specify which SHM-arena to be allocating-in? One, can use a stateful allocator. But that's pain. Two, can use the activator trick we used. It's quite convenient in the end.

> Btw, when you use jemalloc, do you free the memory from a different process than from which you allocate?

No; this was counter to the safety requirements we wanted to keep to, with SHM-jemalloc. We by default don't even turn on writability into a SHM-arena by any process except the one that creates/manages the arena - can't deallocate without writing. Hence there is some internal, async IPC that occurs for borrower-processes: once a shared_ptr<T> group pointing into SHM reaches ref-count 0, behind the scenes (and asynchronously, since deallocating need not happen at any particular type and shouldn't block user threads), it will indicate to the lending-process this fact. Then once all such borrower-processes have done this, and the same has occurred with the original shared_ptr<T> in the lender-process (which allocated in the first place), the deallocation occurs back in the lender-process.

If one chooses to use SHM-classic (which -- I feel compelled to keep restating for some reason, not sure why -- is a compile-time switch for the session or structure, but not some sort of global decision), then it's all simplicity itself (and very quick -- atomic-int-quick). offset_ptr, internally-stored ref-count of owner-processes; once it reaches 0 then whichever process/piece of code caused it, will itself deallocate it.

The idea of its design is that one could plug-in still more SHM-providers instead of SHM-jemalloc or SHM-classic. It should all keep working through the magic of concepts (not formal C++20 ones... it's C++17).

---

Somewhere above you mentioned collaboration. I claim/hope that Flow-IPC is designed in a pragmatic/no-frills way (tried to vaguely imitate boost.interprocess that way) that exposes whichever layer you want to use, publicly. So, to give an example relating to what we are discussing here:

Suppose someone wants to use iceoryx's badass lock-free mega-fast one-microsecond transmission. But, they'd like to use our SHM-jemalloc dealio to transmit a map<string, vector<Crazy_ass_struct_with_more_pointers_why_not>>. I completely assure you I can do the following tomorrow if I wanted:

- Install iceoryx and get it to essentially work, in that I can transmit little constant-size blobs with it at least. Got my mega-fast transmission going.

- Install Flow-IPC and get it working. Got my SHM-magic going.

- In no more than 1 hour I will write a program that uses just the SHM-magic part of Flow-IPC -- none of its actual IPC-transmission itself per se (which I claim itself is pretty good -- but it ain't lock-free custom awesomeness suitable for real-time automobile parts or what-not) -- but uses iceoryx's blob-transmission.

It would just need to ->construct<T>() with Flow-IPC (this gets a shared_ptr<T>); then ->lend_object<T>() (this gets a tiny blob containing an opaque SHM-handle); then use iceoryx to transmit the tiny blob (I would imagine this is the easiest possible thing to do using iceoryx); on the receiver call Flow-IPC ->borrow_object<T>(). This gets the shared_ptr<T> -- just the like the original. And that's it. It'll get deallocated once both shared_ptr<T> groups in both processes have reached ref-count 0. A cross-process shared_ptr<T> if you will. (And it is by the way just a shared_ptr<T>: not some custom type monstrosity. It does have a custom deleter, naturally, but as we know that's not a compile-time decision.)

So yes, believe it or not, I was not trying to out-compete you all here. There is zero doubt you're very good at what you do. The most natural use cases for the two overlap but are hardly the same. Live and let live, I say.

link

elBoberido 792 days ago

Don't worry. It's great to have other projects in this field, exploring different routes and you created a great piece of software. The best thing, it's all open source after all :)

Reading your response is almost as you've been at our coffee chats. Quite some of your ideas are also either already implemented in iceoryx2 or on our todo list. It seems we just put our focus on different things. Here and there you also added the cherry on top. This motivates us to improve on some areas we neglected the last years. We can learn from each other and improve our projects thanks to the beauty of open source.

Keep up the good work

link