XACT: Lock-Free Multi-CAS for C++/x64 Built on TSX | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	XACT: Lock-Free Multi-CAS for C++/x64 Built on TSX (github.com)
	56 points by scivey7 3466 days ago

6 comments

andikleen2 3465 days ago

He's assuming that retrying forever is a valid retry strategy, which it is not. For example if a page fault was needed to satisfy one of the memory access it would never finish.

See https://software.intel.com/en-us/articles/tsx-anti-patterns-... and https://software.intel.com/en-us/blogs/2013/06/23/tsx-fallba... for more details/

To make his code work he likely would need a global fallback lock (or a real STM) and guarantee that every change of the touched memory uses those too (which would be hard)

So I'm afraid the library is fairly broken.

scivey7 3465 days ago

I'm aware of the issue with non-terminating transactions, though I wasn't aware of the role played by page faults -- thanks for adding that detail.

Looking back over the readme, I can see how the loops used in the examples are a little misleading.

This is mostly a documentation issue: the core XACT code doesn't use infinite retry loops, and actually does not retry transactions at all. As with std::atomic, the goal is to provide a basic primitive and leave retry / backoff / etc. up to the user. This is especially important with lock-based fallbacks, as I can't pick one perfect lock to fit everyone's workload.

I ended up dropping retries because I ran into so many never-ending transactions in my early experiments with TSX. That was also my motivation for limiting the transactions to as few locations as possible.

I'm just now starting to reexamine this and add some configurable retry logic back in -- e.g. the retry policy here is used in some test code: https://github.com/scivey/xact/blob/master/include/xact/atom...

As to the difficulty of protecting any memory touched in a transaction under a locking scheme: that kind of problem is exactly why XACT is focused on CAS-like operations on relatively limited sets of memory addresses.

Can you elaborate on the global lock? What's the motivation there?

andikleen2 3465 days ago

Practically all valid fallback schemes require putting the lock (or something else like a sequence counter for a STM) into the read set of the transaction to properly synchronize between transactions and non transactions. Since you hide the transaction in your library it's not possible to do that with your current API. It would be very hard to construct a fallback path that is not racy.

(See Anti pattern #4 in the link above)

A global lock is usually the simplest fall back path, and the performance can be good enough because it's just a slow path. Of course it's always possible to do something more complex.

scivey7 3465 days ago

Agreed that the basic "store to 8 locations" API would need tweaking to allow locking.

Re: adding a counter into the read set, I think the new generalized API here will support that out of the box: https://github.com/scivey/xact/blob/master/docs/api/generali...

Thoughts?

andikleen2 3465 days ago

Yes with a read primitive it could be done in theory. It will be just quite awkward to use however as every caller has to do all that: define a lock, pass it always in, make sure the check for "lock is free" is correct etc.

Your unit tests don't seem to do it right.

It would probably be easier to hide the lock in your library, and enforce all other access to follow the right protocol using some ADTs. But then you just have a simple hardware TM accelerated STM.

FWIW the sweet spots for nice to use TM APIs are currently either lock elision, or compiler assisted TM (like __transaction* in gcc), or higher level libraries.

scivey7 3465 days ago

Your feedback has been very helpful. Do you mind if I ask you for more advice down the line?

hendzen 3465 days ago

For anyone not aware, the parent commenter, Andi Kleene, is an expert on TSX. When I last seriously looked in to TSX, around 2013, he was maintaining a fork of glibc with support for TSX-optimized pthread primitives and had written most of the high quality blog posts and information about TSX available online.

loeg 3465 days ago

It's really unfortunate semantics that a page fault condition during a transaction doesn't actually raise the fault. Is there a downside I'm not seeing to raising the fault and then aborting the transaction? (That way, retry would succeed.)

andikleen2 3465 days ago

This would be only useful for "good" page faults that fault something in, but not for "bad" ones (like NULL pointer). If a bad page fault was executed it would allow transactions to crash the program, which wouldn't be very atomic.

The transaction mechanism doesn't know in advance if it's a good or a bad page fault.

You would need to tell the operating system kernel that the page fault happened in a transaction, and let it ignore it if it was a bad page fault. That would be much more complicated than current TSX.

Also there are other cases were retries will not succeed, page fault was just an example. Another common case is the dynamic linker when a library function is first executed.

loeg 3465 days ago

> If a bad page fault was executed it would allow transactions to crash the program, which wouldn't be very atomic.

It would allow bad page faults to crash the program, i.e., ordinary behavior. No? Why do programs need this protection for HTM transactions?

> You would need to tell the operating system kernel that the page fault happened in a transaction, and let it ignore it if it was a bad page fault.

It wouldn't ignore it. It would fault the thread and probably tear down the process, as usual. No?

> Also there are other cases were retries will not succeed, page fault was just an example. Another common case is the dynamic linker when a library function is first executed.

That would be an abort due to excessive memory use?

Thanks! I'm not as familiar with this stuff as I would like to be.

bonzini 3465 days ago

A bad page fault might arise just from reading a partially-updated data structure. For example you could write two locations in one thread and read them in another. If the read side assumes that "location 1 nonzero" implies "location 2 nonzero", and then dereferences location 2, an inconsistent read would cause such a bad page fault. The only correct way to handle this is to abort the transaction.

0x0 3465 days ago

The last time I read about TSX it was a story about how Intel pushed a microcode update to disable TSX because it was flawed. Has this been fixed in newer CPUs? Is there a risk of TSX being flawed on CPUs in the wild (for example, if you're missing the latest microcode updates?)

http://www.anandtech.com/show/8376/intel-disables-tsx-instru...

htns 3465 days ago

I wonder what exactly was the bug? As far as I can google it's not been made public. https://www-ssl.intel.com/content/dam/www/public/us/en/docum... (errata for the CPU I have, and I just tested and the TSX instructions aren't disabled on it btw. I probably enabled TSX in BIOS and forgot about it) mentions minor issues with string instructions' interaction with TSX, rdrand having a chance of hanging if called within a transaction, but the bug blamed for the disabling of TSX (HSW136) is just described as "unpredictable system behavior" "under a complex set of internal timing conditions and system events".

greglindahl 3465 days ago

If you have a more-recent-than-2014 kernel, BIOS, or stepping, the feature bit ought to be accurate.

So sure, there are some systems in the wild that are broken, but probably not that many.

0x0 3465 days ago

Let's say you're deploying to a random cloud VM that may or may not have the latest microcode/BIOS. How do you know if TSX is safe to use? Can it be determined in software by looking at CPUID values? (If so, do all TSX-using libraries/compilers insert such checks?)

The risk of subtle locking bugs in multi threaded applications due to CPU bugs makes me want to shy away from the entire feature.

wtallis 3465 days ago

CPUID values would be sufficient. TSX should be correct on Haswell-EX (Xeon E7), Broadwell except for the tablet SoCs (Core M), and all Skaylake, Kaby Lake and newer.

greglindahl 3464 days ago

Note that most Linux distros put the latest microcode updates into all of their kernels for any supported version. That means that an updated box with an "old" distro is still going to be OK.

0x0 3464 days ago

Does that work under a hypervisor/xen/VM/whatever? Can you apply a microcode update only within a given VM?

loeg 3465 days ago

TSX was broken on Haswell CPUs. I don't know which specific newer microarchitecture fixes TSX. Microcode updates have disabled TSX on Haswell for a long time.

bonzini 3465 days ago

Haswell Xeon E7 do not have the bug and they do enable TSC.

DSingularity 3465 days ago

So, I looked through the readme and at the example code. I didn't dig into the implementation code.

How do you deal with group size limitations? My understanding is that the hardware transactional support makes no forward progress guarantees specifically because it's bound by what it can monitor in the cache. So if the group size is too large, then transactions can keep failing. Hopefully I am not missunderstsnding this. If this is correct it means libraries of this nature have to take a position with regards to group size limits.

So what is your approach?

scivey7 3465 days ago

You're correct: the limits on transaction size are unknown. That's mentioned in the documentation here https://github.com/scivey/xact/blob/master/docs/api/n_way.md

TSX is a black box in many ways, and I think we can expect its behavior to change over time and across implementations.

I'm not enforcing an arbitrary limit on transaction size because the primary goal is just to expose a simple C++ API to fundamental primitives. The TSX intrinsics are much more difficult to work with, and assembly is painful.

If that seems like a cop-out, consider that DCAS is effectively a transaction size of two. TSX appears to handle this trivially. Yet DCAS is already a very powerful operation, and is useful in itself.

As the docs emphasize, the goal is not general transactions but extended versions of the small atomic operations already in common use.

In terms of safety and opinionatedness, I think of XACT like a library of locking primitives: pthread_spinlock_t is very useful, but it will not stop you from introducing deadlocks. Likewise, I won't stop you from attempting transactions that are too large to succceed on current hardware. Ultimately, I expect anyone using this to test and benchmark their own code on their own machines.

Beyond a certain size, transactions will be less and less valuable even if they can be successfully completed: if you're attempting 64-way CAS, benchmarks are probably going to guide you toward traditional locking anyway.

zvrba 3465 days ago

What is the motivation behind this? Multi-CAS is used as a basic building block for lock-free data structures to emulate more complicated transactional operations. But when you already have TSX, why would you use multi-CAS to emulate them? It's better to modify the algorithm and express the transactions directly using TSX.

scivey7 3465 days ago

In an ideal world yes, but TSX has some significant limitations. andikleen2 has mentioned some of those in his comments.

TSX is somewhat unpredictable as a general tool, and there are difficulties with e.g. knowing which transactions are even feasible. Generic "complicated transactional operations" also make lock-based fallbacks very difficult and expensive, which andikleen2 also touched on.

After experimenting with more general use of TSX, I very quickly came not to trust it. So the real motivation here is to tame TSX's unpredictability by using it in a very controlled way.

TSX simply isn't suitable yet for complicated transactions, but just providing hardware-level support for multi-CAS is already a big deal.

bcatanzaro 3465 days ago

Are there any performance benchmarks to show when this kind of approach is useful over less exotic solutions?

bonzini 3465 days ago

We used transactional memory in QEMU to emulate load-locked/store-conditional instructions, and it had much better performance than instrumenting each store manually (20%, I think).

brudgers 3466 days ago

If it meets the guidelines, this might make a good 'Show HN'. Show HN guidelines: https://news.ycombinator.com/showhn.html