| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by davidtgoldblatt 3657 days ago

I assume interrupts would be disabled or deferred during this section, so you get stronger guarantees on worst-case behavior; hence the bound on the number of cycles a thread is allowed to execute for.

On one of the less common unixes (Sun's maybe? I don't totally remember), you could set a bit in some thread-specific part of memory to indicate that you were uninterruptible. If the OS saw that bit was set, it would set a "you need to yield" bit in the same part of memory and then allow the thread to continue (and penalize threads that abused the feature).

As a practical matter, I suspect that the scheme proposed wouldn't have advantages over a careful use of HTM.

I also suspect that the "lock two cachelines" thing is harder to implement than it sounds. The way Intel implements atomic operations that span cachelines now involves waiting for a interconnect quiescence period (taking O(microseconds)). What's proposed here is strictly more general.

1 comments

pizlonator 3657 days ago

Disabling interrupts prevents you from using interrupt-unsafe logic inside the critical section, which is impractical in most cases. It would mean, for example, that you wouldn't be able to touch memory that had been swapped. It also doesn't actually improve performance. When people do this, it's because they want to edit process-local data structures. Unless you're in kernel, you probably don't want this.

As a practical matter, HTM is dead. It's 2x slower than locking in the common cases: uncontended critical section or contended critical section that has a race. It also prevents you from doing effects outside of memory (it's just guaranteed to revert to locks in that case, so you pay all of the overhead of HTM and all of the overhead of locks).

Interesting about the difficulty of locking two cache lines. To me the important thing to keep in mind when talking about concurrency alternatives to the status quo is: nobody has proved that contended-but-not-racy critical sections are common. According to my data, they are unicorns. This means that from a performance standpoint, any proposed concurrency protocol that is different from locking must justify how it plans to beat locks on the two things that they are super good at, which are uncontended critical sections (with locks you pay one or two CASes on cache lines you probably already own) and contended-and-racy critical sections (with locks you pay for some CASes but the lock will make threads contend as quietly as possible to allow the lock owner to proceed quickly). If a proposal is worse than locks on the two most common kinds of critical sections, then that's silly. AFAICT, all of these trasactions-or-better-CAS approaches would only help in the case of contended-but-not-racy critical sections, which are too rare to matter. The two cache line thing definitely sounds like it will be more expensive than locking in the uncontended case, and I don't see how it will help in the contended-but-racy case.

EDIT: I said "contended-but-racy" when I meant "contended-but-not-racy" in one place, above. Fixed.

link

davidtgoldblatt 3657 days ago

> Disabling interrupts prevents you from using interrupt-unsafe logic inside the critical section, which is impractical in most cases. It would mean, for example, that you wouldn't be able to touch memory that had been swapped.

Presumably the semantics of the "lock these two cachelines" instruction trap when one of the cachelines you're trying to lock isn't present, rather than when you perform an operation on that cacheline subsequent to the locking.

> It also doesn't actually improve performance. When people do this, it's because they want to edit process-local data structures. Unless you're in kernel, you probably don't want this.

I don't totally follow; off the top of my head, lock-free reference counting and doubly-linked list removal both get significantly simpler. There's lots of algorithms that get much less complicated and dodge more atomic ops if you have access to CAS-2, and this is strictly more powerful.

To be clear, I don't think this instruction is a good idea; I just don't think it's obviously bad.

> As a practical matter, HTM is dead. It's 2x slower than locking in the common cases: uncontended critical section or contended critical section that has a race. It also prevents you from doing effects outside of memory (it's just guaranteed to revert to locks in that case, so you pay all of the overhead of HTM and all of the overhead of locks).

HTM is already showing wins in some domains, and it's only going to get faster relative to other concurrency primitives.

I'm not totally sure what you mean by "contended and racy" and "contended but not racy"; by "racy" do you mean cacheline ping-ponging? Certainly that will never be cheap. I think most of the desire for lock-freedom comes less from fast-path cycle reduction as much as protection from the scheduler or some very slow process. There's also plenty of situations in which data structures are touched mostly by one thread, but periodically need contention management. The overhead of locking in the single-threaded case can be substantial even if the lock acquisition always succeeds; on my machine a non-atomic CAS is about 6x faster than an atomic one even if the CAS always succeeds (and this was with no stores in between the operations; a more realistic example would have a deeper write-buffer).

link

pizlonator 3657 days ago

> Presumably the semantics of the "lock these two cachelines" instruction trap when one of them is paged out, rather than deferring the trap until you write to one of the cachelines you've locked.

Right, I didn't think of that.

> I don't totally follow; off the top of my head, lock-free reference counting and doubly-linked list removal both get significantly simpler. There's lots of algorithms that get much less complicated and dodge more atomic ops if you have access to CAS-2, and this is strictly more powerful.

I was talking about disabling interrupts and not lock-freedom in general or CAS-2. I agree that CAS-2 would make those things simpler if it was also fast enough. Sorry about the confusing context switch.

> HTM is already showing wins in some domains, and it's only going to get faster relative to other concurrency primitives.

Can you define what "some domains" is? I think it's important to be specific.

> I'm not totally sure what you mean by "contended and racy" and "contended but not racy"; by "racy" do you mean cacheline ping-ponging?

"Contended-and-racy" means that if you took the lock away, you'd have a data race. It means that if you used a transaction then the race detection that aborts a transaction would conclude that there had been a race and abort. Contended-and-racy is exactly the case where TM is not a speed-up, pretty much by definition.

If contended-but-not-racy critical sections were common, then it would make sense to pay some base cost for executing in a TM critical section since it would increase concurrency. But most contended critical sections are also racy. TM won't give you a speed-up in a racy section, so you end up paying all of the cost without getting any of the benefit.

> I think most of the desire for lock-freedom comes less from fast-path cycle reduction as much as protection from the scheduler or some very slow process.

Yeah, and also deadlock avoidance. And that lock-free algorithms are often faster in the uncontended and contended-and-racy cases. I'm a big fan of lock-free algorithms based on conventional word CAS.

But that's different than TM or 2-cache-line-CAS. Lock-free algorithms based on conventional word CAS tend to be faster than locks in the uncontended case. TM, and probably 2-CAS, is slower than locks. It's easy to justify handling uncommon scheduling pathologies, or using exotic approaches for avoiding deadlock, if it also gives you a speed-up. Not so much if it's slow like TM.

> The overhead of locking in the single-threaded case can be substantial even if the lock acquisition always succeeds; on my machine a non-atomic CAS is about 6x faster than an atomic one even if the CAS always succeeds (and this was with no stores in between the operations; a more realistic example would have a deeper write-buffer).

It's true that the uncontended cost of locks is dominated by the uncontended cost of CAS. But this cost is much smaller than the unconteded cost of HTM, and I suspect that the unconteded cost of CAS is also lower than the uncontended cost of 2-cache-line-CAS. If the uncontended cost of 2-cache-line-CAS is more than 2x more than the uncontended cost of CAS, then probably using a lock is better.

link

gpderetta 3657 days ago

Transactions compose and TM is potentially easier to use.

Also some algorithms necessarily require locking multiple mutexes (think for example of a wait-for-any operation). HTM in principle would allow implementing them more efficiently.

link

pizlonator 3657 days ago

Transactions don't compose with I/O or any OS side-effects that the TM system isn't in the loop on, while locks compose fine with those things. One module that uses locks is free to call another module that uses locks so long as the relationship is one-way like user->kernel, so locks compose also, just with different constraints. In short, transactions don't compose with I/O while locks don't compose with cyclic relationships. So, the statement that "transactions compose" is false in both directions: it's false because transactions don't compose in general, and it's false because it implies that locks don't, while they in fact do.

The only empirical evidence about transactions being easier to use is actually contrary: even with the availability of HTM, it's not used nearly as widely as locks, because if you enable HTM for every lock then everything runs a lot slower (or doesn't run at all).

link

gpderetta 3657 days ago

Under the proposed C++ TM, synchronised blocks can do i/o. They are not transactional, but do compose.

If you have two locked data structures and you want an operation to manipulate both atomically you need to expose the locks. And if you want the operation itself to compose with others... well get hairy.

TM is widely used I believe by least Haskel and Closure programmers, even with non hardware accelerated versions because of the ease of use.

HTM is another story, it's availability is still relatively restricted. Time will tell.

link

pizlonator 3657 days ago

Closure is indeed TM-based, but that's not an A/B test of ease-of-use. I don't think there exists evidence to suggest that transactions are easier to use.

I think you are confused about the meaning of composability. The synchronized/atomic TM proposal for C++ is roughly single-global-lock. This means that you can still deadlock if you use synchronized to protect IPC with another process, which then does an IPC call back to you, which you service on another thread that also uses synchronized. Whenever these transactional proposals fall back on locking, they immediately bring in the pitfalls of locking.

The C++ TM proposal is a joke. There's no good reason to use concurrency if you won't get a speed-up, and the current design of synchronized is sure to make it more difficult to get speed-ups.

link