| > Presumably the semantics of the "lock these two cachelines" instruction trap when one of them is paged out, rather than deferring the trap until you write to one of the cachelines you've locked. Right, I didn't think of that. > I don't totally follow; off the top of my head, lock-free reference counting and doubly-linked list removal both get significantly simpler. There's lots of algorithms that get much less complicated and dodge more atomic ops if you have access to CAS-2, and this is strictly more powerful. I was talking about disabling interrupts and not lock-freedom in general or CAS-2. I agree that CAS-2 would make those things simpler if it was also fast enough. Sorry about the confusing context switch. > HTM is already showing wins in some domains, and it's only going to get faster relative to other concurrency primitives. Can you define what "some domains" is? I think it's important to be specific. > I'm not totally sure what you mean by "contended and racy" and "contended but not racy"; by "racy" do you mean cacheline ping-ponging? "Contended-and-racy" means that if you took the lock away, you'd have a data race. It means that if you used a transaction then the race detection that aborts a transaction would conclude that there had been a race and abort. Contended-and-racy is exactly the case where TM is not a speed-up, pretty much by definition. If contended-but-not-racy critical sections were common, then it would make sense to pay some base cost for executing in a TM critical section since it would increase concurrency. But most contended critical sections are also racy. TM won't give you a speed-up in a racy section, so you end up paying all of the cost without getting any of the benefit. > I think most of the desire for lock-freedom comes less from fast-path cycle reduction as much as protection from the scheduler or some very slow process. Yeah, and also deadlock avoidance. And that lock-free algorithms are often faster in the uncontended and contended-and-racy cases. I'm a big fan of lock-free algorithms based on conventional word CAS. But that's different than TM or 2-cache-line-CAS. Lock-free algorithms based on conventional word CAS tend to be faster than locks in the uncontended case. TM, and probably 2-CAS, is slower than locks. It's easy to justify handling uncommon scheduling pathologies, or using exotic approaches for avoiding deadlock, if it also gives you a speed-up. Not so much if it's slow like TM. > The overhead of locking in the single-threaded case can be substantial even if the lock acquisition always succeeds; on my machine a non-atomic CAS is about 6x faster than an atomic one even if the CAS always succeeds (and this was with no stores in between the operations; a more realistic example would have a deeper write-buffer). It's true that the uncontended cost of locks is dominated by the uncontended cost of CAS. But this cost is much smaller than the unconteded cost of HTM, and I suspect that the unconteded cost of CAS is also lower than the uncontended cost of 2-cache-line-CAS. If the uncontended cost of 2-cache-line-CAS is more than 2x more than the uncontended cost of CAS, then probably using a lock is better. |