|
RCU basically works by having a critical section only within which a reader may access the data structure. To know when to GC the old copy of the data, you wait until all readers leave the critical section. If you're in the kernel, this can be very efficient: you disable all interrupts within the CS, so you know a thread is never preempted while in it. This means that no matter how many threads there are, only those currently running on the available cores (say, there are 8 of them) can be in the CS, so no matter how many threads there are, you only need to wait for 8, and you know that that waiting period is going to be short because there are no interrupts. So the efficiency of the algorithms stems from disabling interrupts within the reader's critical section, which can only be done in kernel mode (otherwise you could do really bad things). A similar algorithm in userspace will need to find another mechanism to wait for all readers to exit the CS, which may or may not be efficient. They may share then name "RCU", but not its desirable qualities. There are, AFAIK, several different algorithms calling themselves userspace RCU. One userspace RCU algorithm I've seen essentially protects the deallocation with a read-write, which isn't known for terrific scalability, but I don't know, they may have come up with something better. Garbage collection is the Achille's heel of all nonblocking data structures. As far as I know, there is no algorithm more efficient than a general purpose GC just yet, which is both generally applicable and efficient. Common authorship, BTW, says little. Most concurrent algorithms out there were invented by a handful of people. |
The efficiency stems from having little to zero synchronization cost for readers in a read-mostly workload or workload that just isn't update-intensive. In addition to that, note that a deferral interface is available so writers basically never have to block until it is actually safe to free memory. URCU and ck_epoch all share those desirable qualities.
2) "RCU algorithm I've seen ... protects with a read-write"
The synchronize portion of RCU and RCU-like systems is almost always heavy-weight. Applications that adopt URCU can typically afford this.
3) "no algorithm more efficient than a general purpose GC"
All the blocking schemes are magnitudes more efficient but the trade-off is that it isn't generally applicable. However, they're still applicable for plenty of workloads. We are starting to see more GC mechanisms adopt SMR-like techniques so that may change.
4) "Common authorship, BTW, says little"
Not in this case.
If you would like to learn more about the performance trade-off with the various mechanisms, I suggest http://www.rdrop.com/users/paulmck/RCU/hart_ipdps06.pdf and the URCU paper as a start. I've some additional numbers on something similar at http://concurrencykit.org/presentations/ebr.pdf
Paul also has a good introduction up at https://queue.acm.org/detail.cfm?id=2488549