Hacker News new | ask | show | jobs
by 1amzave 4026 days ago
Yes, this essentially just a form of RCU. Note however that RCU isn't just for the kernel anymore though: http://urcu.so/
2 comments

Concurrency Kit (http://concurrencykit.org) also has ck_epoch. Performance comparison to the URCU implementations available at http://concurrencykit.org/presentations/ebr.pdf
Userspace RCU algorithms are experimental, quite brittle, and/or have little in common with RCU. They're certainly not usable as a production-ready scalable synchronization solution.
What makes you say this?

RCU refers to a wide array of reclamation implementations and some URCU implementations do indeed have a lot in common with kernel-space RCU (which also has a myriad of implementations).

Speaking of liburcu in specific, the "brittleness" is a function of your workload and your selection of the URCU implementation. The various implementations have different trade-offs, and it is definitely possible to livelock write-side or degrade read-side if you make the wrong design decisions. For example, there are major differences between signal-based URCU and QSBR URCU. No reclamation mechanism (whether one backed by RCU or something like hazard pointers) is a silver bullet and each will suck in fantastic ways with the right workload.

As far as userspace RCU algorithms in general being "experimental", plenty of people are using RCU or RCU-like mechanisms in user-space for production systems as a scalable synchronization solution. For example, the Concurrency Kit library (http://concurrencykit.org) has an RCU-like system called ck_epoch, and it sees at least billions of transactions in production a day without fail, and even supports freestanding environments.

I haven't actually used it myself and am certainly willing to believe it's immature, but how do you mean it has "little in common with RCU"? Comparing personnel between that library and this set of people --> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.... there's certainly at least some authorship in common...
RCU basically works by having a critical section only within which a reader may access the data structure. To know when to GC the old copy of the data, you wait until all readers leave the critical section. If you're in the kernel, this can be very efficient: you disable all interrupts within the CS, so you know a thread is never preempted while in it. This means that no matter how many threads there are, only those currently running on the available cores (say, there are 8 of them) can be in the CS, so no matter how many threads there are, you only need to wait for 8, and you know that that waiting period is going to be short because there are no interrupts.

So the efficiency of the algorithms stems from disabling interrupts within the reader's critical section, which can only be done in kernel mode (otherwise you could do really bad things). A similar algorithm in userspace will need to find another mechanism to wait for all readers to exit the CS, which may or may not be efficient. They may share then name "RCU", but not its desirable qualities.

There are, AFAIK, several different algorithms calling themselves userspace RCU. One userspace RCU algorithm I've seen essentially protects the deallocation with a read-write, which isn't known for terrific scalability, but I don't know, they may have come up with something better.

Garbage collection is the Achille's heel of all nonblocking data structures. As far as I know, there is no algorithm more efficient than a general purpose GC just yet, which is both generally applicable and efficient.

Common authorship, BTW, says little. Most concurrent algorithms out there were invented by a handful of people.

1) Efficiency of algorithms stems from disabling interrupts

The efficiency stems from having little to zero synchronization cost for readers in a read-mostly workload or workload that just isn't update-intensive. In addition to that, note that a deferral interface is available so writers basically never have to block until it is actually safe to free memory. URCU and ck_epoch all share those desirable qualities.

2) "RCU algorithm I've seen ... protects with a read-write"

The synchronize portion of RCU and RCU-like systems is almost always heavy-weight. Applications that adopt URCU can typically afford this.

3) "no algorithm more efficient than a general purpose GC"

All the blocking schemes are magnitudes more efficient but the trade-off is that it isn't generally applicable. However, they're still applicable for plenty of workloads. We are starting to see more GC mechanisms adopt SMR-like techniques so that may change.

4) "Common authorship, BTW, says little"

Not in this case.

If you would like to learn more about the performance trade-off with the various mechanisms, I suggest http://www.rdrop.com/users/paulmck/RCU/hart_ipdps06.pdf and the URCU paper as a start. I've some additional numbers on something similar at http://concurrencykit.org/presentations/ebr.pdf

Paul also has a good introduction up at https://queue.acm.org/detail.cfm?id=2488549

> The efficiency stems from having little to zero synchronization cost for readers in a read-mostly workload or workload that just isn't update-intensive.

That's the advantage of any copy-on-write concurrency mechanism. The "magic" of RCU (unless you define any copy-on-write algorithm as RCU) is the indifference of the algorithm complexity to the number of threads involved -- only the number of cores. I am not aware of userspace algorithms that display this property, but I may be wrong about that.

> All the blocking schemes are magnitudes more efficient but the trade-off is that it isn't generally applicable.

But that's where things get tricky. Magnitudes more efficient how? Total throughput? Maximum latency? I can certainly believe the latter but I doubt the former. Hazard-pointer techniques basically are how modern GCs work, with the difference of when the scans are triggered.

The magic of RCU definitely is the safety guarantee on read-side with little to no overhead and an easy to approach API. Traditional copy-on-write systems do not provide this guarantee as there would be synchronization and / or blocking on the fast path (barriers / atomic operations / locks).

Cores versus threads is a detail of the execution environment which may affect RCU implementation but not the actual magic. For example, a lot of work was necessary to get RCU to scale to many hundreds of cores in the kernel (http://lwn.net/Articles/305782/ is one example).

For ck_epoch, we had to implement a proxy collection-like pattern for workloads involving thousands of concurrent events that required longer-lived hazardous references (thanks to John Esmet for this work)...this will hit upstream as ck_epc and is sitting in a pull request on the github page (http://github.com/concurrencykit/ck).

re:Efficiency, magnitudes more efficient on read-side in all ways (including throughput and latency) assuming that delete frequency is not too heavy and that you can afford write-side blocking. Consider that the read-side has few to none atomic operations on the fast path, and if so, is typically on exclusively written memory and it does not block in any fashion. This means readers scale and are very fast. You might enjoy the paper I linked to which exposes some of the performance characteristics (it isn't exhaustive, it doesn't explore real-world implications on memory usage as update frequency increases, for example, but it's a great start). Paul also has some refreshed performance analysis in his ACM Queue article. If you couple this with specialized concurrently readable data structures (like http://backtrace.io/blog/blog/2015/03/13/workload-specializa...), you get super sexy performance for readers.

You might also be interested in "passive reader-writer locks", which provide similar costs on the fast path to RCU for TSO architectures while still providing read-write lock semantics...but at the cost of exceptionally more expensive write-side synchronization (and user-space support requires some kernel modifications).