|
|
|
|
|
by pizlonator
3657 days ago
|
|
Disabling interrupts prevents you from using interrupt-unsafe logic inside the critical section, which is impractical in most cases. It would mean, for example, that you wouldn't be able to touch memory that had been swapped. It also doesn't actually improve performance. When people do this, it's because they want to edit process-local data structures. Unless you're in kernel, you probably don't want this. As a practical matter, HTM is dead. It's 2x slower than locking in the common cases: uncontended critical section or contended critical section that has a race. It also prevents you from doing effects outside of memory (it's just guaranteed to revert to locks in that case, so you pay all of the overhead of HTM and all of the overhead of locks). Interesting about the difficulty of locking two cache lines. To me the important thing to keep in mind when talking about concurrency alternatives to the status quo is: nobody has proved that contended-but-not-racy critical sections are common. According to my data, they are unicorns. This means that from a performance standpoint, any proposed concurrency protocol that is different from locking must justify how it plans to beat locks on the two things that they are super good at, which are uncontended critical sections (with locks you pay one or two CASes on cache lines you probably already own) and contended-and-racy critical sections (with locks you pay for some CASes but the lock will make threads contend as quietly as possible to allow the lock owner to proceed quickly). If a proposal is worse than locks on the two most common kinds of critical sections, then that's silly. AFAICT, all of these trasactions-or-better-CAS approaches would only help in the case of contended-but-not-racy critical sections, which are too rare to matter. The two cache line thing definitely sounds like it will be more expensive than locking in the uncontended case, and I don't see how it will help in the contended-but-racy case. EDIT: I said "contended-but-racy" when I meant "contended-but-not-racy" in one place, above. Fixed. |
|
Presumably the semantics of the "lock these two cachelines" instruction trap when one of the cachelines you're trying to lock isn't present, rather than when you perform an operation on that cacheline subsequent to the locking.
> It also doesn't actually improve performance. When people do this, it's because they want to edit process-local data structures. Unless you're in kernel, you probably don't want this.
I don't totally follow; off the top of my head, lock-free reference counting and doubly-linked list removal both get significantly simpler. There's lots of algorithms that get much less complicated and dodge more atomic ops if you have access to CAS-2, and this is strictly more powerful.
To be clear, I don't think this instruction is a good idea; I just don't think it's obviously bad.
> As a practical matter, HTM is dead. It's 2x slower than locking in the common cases: uncontended critical section or contended critical section that has a race. It also prevents you from doing effects outside of memory (it's just guaranteed to revert to locks in that case, so you pay all of the overhead of HTM and all of the overhead of locks).
HTM is already showing wins in some domains, and it's only going to get faster relative to other concurrency primitives.
I'm not totally sure what you mean by "contended and racy" and "contended but not racy"; by "racy" do you mean cacheline ping-ponging? Certainly that will never be cheap. I think most of the desire for lock-freedom comes less from fast-path cycle reduction as much as protection from the scheduler or some very slow process. There's also plenty of situations in which data structures are touched mostly by one thread, but periodically need contention management. The overhead of locking in the single-threaded case can be substantial even if the lock acquisition always succeeds; on my machine a non-atomic CAS is about 6x faster than an atomic one even if the CAS always succeeds (and this was with no stores in between the operations; a more realistic example would have a deeper write-buffer).