Hacker News new | ask | show | jobs
by compudj 2924 days ago
(disclaimer: I am the patch author, Mathieu Desnoyers) Just as a clarification, the idea originates from Google (I give full credits to Paul Turner and Andrew Hunter for it). However, the extra 3 years of work required to get it upstream has been done by myself at EfficiOS.
2 comments

Hi Mathieu, Yep, was going to say, that the patch set has evolved rather notably. Good job.

Did you manage to push the required changes to glibc or do you maintain your own user space rseq lib?

-ss

I'm currently discussing with glibc maintainers on the best approach to integrate this into the Linux userspace ecosystem. So far, discussions aim into a direction where glibc would own the __rseq_abi TLS symbol, and register it for every thread. I can then maintain a rseq library which consists of helper header files that contain the common rseq operations for all supported architectures.

I am concerned about providing a librseq that handles rseq registration for early adopters though, because I don't want projects to eventually end up conflicting with future glibc versions. Once we settled how glibc will expose the symbol and register it, I will try to provide a helper library which exposes this symbol and allow performing explicit rseq registration in a way that won't conflict with future glibc versions.

> I am concerned about providing a librseq that handles rseq registration for early adopters though

Sounds very reasonable.

So at this point, as far as I understand it, FB and Google carry in-house rseq kernel and user space patches. Right? Are they on board with the mainline rseq? Will FB support rseq in jemalloc any time soon?

-ss

I've been in touch with FB. They are interested in using rseq for jemalloc. They have provided prototypes of jemalloc based on rseq, along with benchmarks helping me make the case for rseq mainlining.

I don't know whether Google will ever want to swap from their in house rseq implementation to the upstream Linux rseq, use both ABIs for a transition period, or simply keep using their own in-house rseq.

Thank you for persevering! Could you elaborate a bit on what had to be done to get it upstream?
Sure, before getting it upstream, I had to:

- Gather a list of desiderata, ensuring we take into account a complete list of use-cases targeted by everyone active in the rseq discussions. This is crucially important to ensure discussions don't spin in circles going back and forth between different requirements,

- Redesign the uapi/linux/rseq.h ABI, making sure a single TLS store is needed to enter a rseq critical section, without requiring any extra registers as ABI. I have introduced the "rseq_cs" structure as critical section descriptor to do this,

- Optimize arm32 and x86 rseq critical sections for speed, by creating my own benchmark programs,

- Rewrite the kernel rseq implementation a few times so it follows the kernel coding style and ensure it pleases everyone caring about it,

- Present 2 talks about rseq at Linux Plumbers Conference,

- Go through various rounds of in person, email, and IRC discussions with Paul Turner, Peter Zijlstra, Andy Lutomirski, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Ben Maurer, Linus Torvalds, and many others. Those were very constructive discussions bringing up everyone's concerns with respect to this new system call,

- Extend the rseq selftests, adding new testing strategies such as delay loops between "steps" of the critical section, thus increasing the likelihood of generating preemption races,

- Figure out nasty races only happening on NUMA systems after about a full day of stress-testing,

- Provide solutions for debugger single-stepping "lack of progress" problem if rseq is used when retrying on abort. It's basically the cpu_opv system call I plan to propose for 4.19. Meanwhile, without cpu_opv, rseq can still be used in ways to guarantee forward progress, but the abort code needs to use a partitioning strategy rather than a simple retry (e.g. going to a different memory pool in case of abort for a memory allocator),

- Harden the rseq mechanism for security, by adding a "signature" word before the abort label,

- Implement prototypes of lttng-ust and liburcu which use rseq, gathering benchmarks to validate the approach,

- Write rseq and cpu_opv man pages.

And this is just the items that were "forward progress" in the rseq adventure. I'm leaving out everything that were attempts at making things more generic that had to be thrown away.

Thanks for this detailed and a bit overwhelmingly long list. When I saw the first Paul Turner techtalk on rseq (it was called something else - LPC - if I remember correctly), it seemed so simple, so obvious "just read this memory address and if there was an interruption, we have to retry".

But then of course real life is a lot more complex than slideware.

cpu_opv is new for me (no time for LWN these days), but looks simple, elegant and sort of obvious (again). Which makes me wonder why no one thought about it yet. (But of course this is probably my ignorance speaking.)

Thanks for pushing the limits!