| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pron 4029 days ago

RCU basically works by having a critical section only within which a reader may access the data structure. To know when to GC the old copy of the data, you wait until all readers leave the critical section. If you're in the kernel, this can be very efficient: you disable all interrupts within the CS, so you know a thread is never preempted while in it. This means that no matter how many threads there are, only those currently running on the available cores (say, there are 8 of them) can be in the CS, so no matter how many threads there are, you only need to wait for 8, and you know that that waiting period is going to be short because there are no interrupts.

So the efficiency of the algorithms stems from disabling interrupts within the reader's critical section, which can only be done in kernel mode (otherwise you could do really bad things). A similar algorithm in userspace will need to find another mechanism to wait for all readers to exit the CS, which may or may not be efficient. They may share then name "RCU", but not its desirable qualities.

There are, AFAIK, several different algorithms calling themselves userspace RCU. One userspace RCU algorithm I've seen essentially protects the deallocation with a read-write, which isn't known for terrific scalability, but I don't know, they may have come up with something better.

Garbage collection is the Achille's heel of all nonblocking data structures. As far as I know, there is no algorithm more efficient than a general purpose GC just yet, which is both generally applicable and efficient.

Common authorship, BTW, says little. Most concurrent algorithms out there were invented by a handful of people.

1 comments

sbahra 4029 days ago

1) Efficiency of algorithms stems from disabling interrupts

The efficiency stems from having little to zero synchronization cost for readers in a read-mostly workload or workload that just isn't update-intensive. In addition to that, note that a deferral interface is available so writers basically never have to block until it is actually safe to free memory. URCU and ck_epoch all share those desirable qualities.

2) "RCU algorithm I've seen ... protects with a read-write"

The synchronize portion of RCU and RCU-like systems is almost always heavy-weight. Applications that adopt URCU can typically afford this.

3) "no algorithm more efficient than a general purpose GC"

All the blocking schemes are magnitudes more efficient but the trade-off is that it isn't generally applicable. However, they're still applicable for plenty of workloads. We are starting to see more GC mechanisms adopt SMR-like techniques so that may change.

4) "Common authorship, BTW, says little"

Not in this case.

If you would like to learn more about the performance trade-off with the various mechanisms, I suggest http://www.rdrop.com/users/paulmck/RCU/hart_ipdps06.pdf and the URCU paper as a start. I've some additional numbers on something similar at http://concurrencykit.org/presentations/ebr.pdf

Paul also has a good introduction up at https://queue.acm.org/detail.cfm?id=2488549

link

pron 4029 days ago

> The efficiency stems from having little to zero synchronization cost for readers in a read-mostly workload or workload that just isn't update-intensive.

That's the advantage of any copy-on-write concurrency mechanism. The "magic" of RCU (unless you define any copy-on-write algorithm as RCU) is the indifference of the algorithm complexity to the number of threads involved -- only the number of cores. I am not aware of userspace algorithms that display this property, but I may be wrong about that.

> All the blocking schemes are magnitudes more efficient but the trade-off is that it isn't generally applicable.

But that's where things get tricky. Magnitudes more efficient how? Total throughput? Maximum latency? I can certainly believe the latter but I doubt the former. Hazard-pointer techniques basically are how modern GCs work, with the difference of when the scans are triggered.

link

sbahra 4029 days ago

The magic of RCU definitely is the safety guarantee on read-side with little to no overhead and an easy to approach API. Traditional copy-on-write systems do not provide this guarantee as there would be synchronization and / or blocking on the fast path (barriers / atomic operations / locks).

Cores versus threads is a detail of the execution environment which may affect RCU implementation but not the actual magic. For example, a lot of work was necessary to get RCU to scale to many hundreds of cores in the kernel (http://lwn.net/Articles/305782/ is one example).

For ck_epoch, we had to implement a proxy collection-like pattern for workloads involving thousands of concurrent events that required longer-lived hazardous references (thanks to John Esmet for this work)...this will hit upstream as ck_epc and is sitting in a pull request on the github page (http://github.com/concurrencykit/ck).

re:Efficiency, magnitudes more efficient on read-side in all ways (including throughput and latency) assuming that delete frequency is not too heavy and that you can afford write-side blocking. Consider that the read-side has few to none atomic operations on the fast path, and if so, is typically on exclusively written memory and it does not block in any fashion. This means readers scale and are very fast. You might enjoy the paper I linked to which exposes some of the performance characteristics (it isn't exhaustive, it doesn't explore real-world implications on memory usage as update frequency increases, for example, but it's a great start). Paul also has some refreshed performance analysis in his ACM Queue article. If you couple this with specialized concurrently readable data structures (like http://backtrace.io/blog/blog/2015/03/13/workload-specializa...), you get super sexy performance for readers.

You might also be interested in "passive reader-writer locks", which provide similar costs on the fast path to RCU for TSO architectures while still providing read-write lock semantics...but at the cost of exceptionally more expensive write-side synchronization (and user-space support requires some kernel modifications).

link

pron 4029 days ago

> re:Efficiency, magnitudes more efficient on read-side in all ways (including throughput and latency)

More efficient than what? RCU with a general-purpose GC? I thought that's what we were talking about.

In our in-memory database (written in Java), we use what you call RCU (I had no idea any copy-on-write without any synchronization on the read side is called RCU) where we can for read-mostly data, but the problem is that in some data structures, nodes may be referenced by two references (or more), so since we don't have DCAS, we can't do the atomic replacement of the node (in some circumstances we can forgo the transactional update of on one of the references, so we can do RCU). Another problem is that some writes must block reads to ensure transaction isolation. Instead, to reduce the cost of reads updating a read-write lock, we use optimistic locking that also ensures a reader requires no memory writes, but may retry due to a concurrent writer.

link

sbahra 4028 days ago

More efficient than using garbage collection, if you're working with a language where you can operate in an unmanaged mode (barring FFI). I do expect this gap to start closing based on some literature some friends have shared with me...

With most managed languages I've played with, I haven't found a compelling reason to use RCU-like mechanisms (except if it involves some FFI boundary or embedded language in unmanaged system).

What you're doing sounds like a seqlock pattern (see ck_sequence) or generation counter, not RCU. Fundamental to all RCU implementations is "grace period detection" (detection of the point in which it is safe to free memory) occurring without any read-side delays. With RCU, a reader never has to retry due to RCU itself.

RCU is really special in an unmanaged language, but I'm not sure it's that interesting for developers working in a managed language (unless they happen to be solving safe memory reclamation).

link

pron 4028 days ago

The optimistic locking I mentioned is what we use when RCU is inappropriate (due to multiple pointers).

But just to get the definitions straight: what you call RCU is a copy-on-write combined with a GC algorithm. What would you call a copy-on-write (with no synchronization on the reader's end whatsoever) that uses a general-purpose GC?

Also, I don't understand how hazard pointer GC can be more efficient than a general-purpose GC, given that modern GCs work based on the same principle, only much more refined (e.g. HotSpot creates a stack map containing the location of each reference in every thread's stack instead of a hazard pointer list). Of course compacting collectors (usually young-generation) don't do any reclamation work at all, instead they only work to copy the live objects, so it all comes down to the question what do you have more, live objects or dead objects? But in any case, simple hazard pointer mechanisms seem like a crude, rather old, non-compacting GC technique. I don't see how they can beat a good general-purpose GC.

link