He's assuming that retrying forever is a valid retry strategy, which it is not. For example if a page fault was needed to satisfy one of the memory access it would never finish.
To make his code work he likely would need a global fallback lock (or a real STM) and guarantee that every change of the touched memory uses those too (which would be hard)
I'm aware of the issue with non-terminating transactions, though I wasn't aware of the role played by page faults -- thanks for adding that detail.
Looking back over the readme, I can see how the loops used in the examples are a little misleading.
This is mostly a documentation issue: the core XACT code doesn't use infinite retry loops, and actually does not retry transactions at all. As with std::atomic, the goal is to provide a basic primitive and leave retry / backoff / etc. up to the user. This is especially important with lock-based fallbacks, as I can't pick one perfect lock to fit everyone's workload.
I ended up dropping retries because I ran into so many never-ending transactions in my early experiments with TSX. That was also my motivation for limiting the transactions to as few locations as possible.
As to the difficulty of protecting any memory touched in a transaction under a locking scheme: that kind of problem is exactly why XACT is focused on CAS-like operations on relatively limited sets of memory addresses.
Can you elaborate on the global lock? What's the motivation there?
Practically all valid fallback schemes require putting the lock (or something else like a sequence counter for a STM) into the read set of the transaction to properly synchronize between transactions and non transactions. Since you hide the transaction in your library it's not possible to do that with your current API. It would be very hard to construct a fallback path that is not racy.
(See Anti pattern #4 in the link above)
A global lock is usually the simplest fall back path, and the performance can be good enough because it's just a slow path. Of course it's always possible to do something more complex.
Yes with a read primitive it could be done in theory. It will be just quite awkward to use however as every caller has to do all that: define a lock, pass it always in, make sure the check for "lock is free" is correct etc.
Your unit tests don't seem to do it right.
It would probably be easier to hide the lock in your library, and enforce all other access to follow the right protocol using some ADTs. But then you just have a simple hardware TM accelerated STM.
FWIW the sweet spots for nice to use TM APIs are currently either lock elision, or compiler assisted TM (like __transaction* in gcc), or higher level libraries.
For anyone not aware, the parent commenter, Andi Kleene, is an expert on TSX. When I last seriously looked in to TSX, around 2013, he was maintaining a fork of glibc with support for TSX-optimized pthread primitives and had written most of the high quality blog posts and information about TSX available online.
It's really unfortunate semantics that a page fault condition during a transaction doesn't actually raise the fault. Is there a downside I'm not seeing to raising the fault and then aborting the transaction? (That way, retry would succeed.)
This would be only useful for "good" page faults that fault something in, but not for "bad" ones (like NULL pointer). If a bad page fault was executed it would allow transactions to crash the program, which wouldn't be very atomic.
The transaction mechanism doesn't know in advance if it's a good or a bad page fault.
You would need to tell the operating system kernel that the page fault happened in a transaction, and let it ignore it if it was a bad page fault. That would be much more complicated than current TSX.
Also there are other cases were retries will not succeed, page fault was just an example. Another common case is the dynamic linker when a library function is first executed.
> If a bad page fault was executed it would allow transactions to crash the program, which wouldn't be very atomic.
It would allow bad page faults to crash the program, i.e., ordinary behavior. No? Why do programs need this protection for HTM transactions?
> You would need to tell the operating system kernel that the page fault happened in a transaction, and let it ignore it if it was a bad page fault.
It wouldn't ignore it. It would fault the thread and probably tear down the process, as usual. No?
> Also there are other cases were retries will not succeed, page fault was just an example. Another common case is the dynamic linker when a library function is first executed.
That would be an abort due to excessive memory use?
Thanks! I'm not as familiar with this stuff as I would like to be.
A bad page fault might arise just from reading a partially-updated data structure. For example you could write two locations in one thread and read them in another. If the read side assumes that "location 1 nonzero" implies "location 2 nonzero", and then dereferences location 2, an inconsistent read would cause such a bad page fault. The only correct way to handle this is to abort the transaction.
The last time I read about TSX it was a story about how Intel pushed a microcode update to disable TSX because it was flawed. Has this been fixed in newer CPUs? Is there a risk of TSX being flawed on CPUs in the wild (for example, if you're missing the latest microcode updates?)
I wonder what exactly was the bug? As far as I can google it's not been made public. https://www-ssl.intel.com/content/dam/www/public/us/en/docum... (errata for the CPU I have, and I just tested and the TSX instructions aren't disabled on it btw. I probably enabled TSX in BIOS and forgot about it) mentions minor issues with string instructions' interaction with TSX, rdrand having a chance of hanging if called within a transaction, but the bug blamed for the disabling of TSX (HSW136) is just described as "unpredictable system behavior" "under a complex set of internal timing conditions and system events".
Let's say you're deploying to a random cloud VM that may or may not have the latest microcode/BIOS. How do you know if TSX is safe to use? Can it be determined in software by looking at CPUID values? (If so, do all TSX-using libraries/compilers insert such checks?)
The risk of subtle locking bugs in multi threaded applications due to CPU bugs makes me want to shy away from the entire feature.
CPUID values would be sufficient. TSX should be correct on Haswell-EX (Xeon E7), Broadwell except for the tablet SoCs (Core M), and all Skaylake, Kaby Lake and newer.
Note that most Linux distros put the latest microcode updates into all of their kernels for any supported version. That means that an updated box with an "old" distro is still going to be OK.
TSX was broken on Haswell CPUs. I don't know which specific newer microarchitecture fixes TSX. Microcode updates have disabled TSX on Haswell for a long time.
So, I looked through the readme and at the example code. I didn't dig into the implementation code.
How do you deal with group size limitations? My understanding is that the hardware transactional support makes no forward progress guarantees specifically because it's bound by what it can monitor in the cache. So if the group size is too large, then transactions can keep failing. Hopefully I am not missunderstsnding this. If this is correct it means libraries of this nature have to take a position with regards to group size limits.
TSX is a black box in many ways, and I think we can expect its behavior to change over time and across implementations.
I'm not enforcing an arbitrary limit on transaction size because the primary goal is just to expose a simple C++ API to fundamental primitives. The TSX intrinsics are much more difficult to work with, and assembly is painful.
If that seems like a cop-out, consider that DCAS is effectively a transaction size of two. TSX appears to handle this trivially. Yet DCAS is already a very powerful operation, and is useful in itself.
As the docs emphasize, the goal is not general transactions but extended versions of the small atomic operations already in common use.
In terms of safety and opinionatedness, I think of XACT like a library of locking primitives: pthread_spinlock_t is very useful, but it will not stop you from introducing deadlocks. Likewise, I won't stop you from attempting transactions that are too large to succceed on current hardware. Ultimately, I expect anyone using this to test and benchmark their own code on their own machines.
Beyond a certain size, transactions will be less and less valuable even if they can be successfully completed: if you're attempting 64-way CAS, benchmarks are probably going to guide you toward traditional locking anyway.
What is the motivation behind this? Multi-CAS is used as a basic building block for lock-free data structures to emulate more complicated transactional operations. But when you already have TSX, why would you use multi-CAS to emulate them? It's better to modify the algorithm and express the transactions directly using TSX.
In an ideal world yes, but TSX has some significant limitations. andikleen2 has mentioned some of those in his comments.
TSX is somewhat unpredictable as a general tool, and there are difficulties with e.g. knowing which transactions are even feasible. Generic "complicated transactional operations" also make lock-based fallbacks very difficult and expensive, which andikleen2 also touched on.
After experimenting with more general use of TSX, I very quickly came not to trust it. So the real motivation here is to tame TSX's unpredictability by using it in a very controlled way.
TSX simply isn't suitable yet for complicated transactions, but just providing hardware-level support for multi-CAS is already a big deal.
We used transactional memory in QEMU to emulate load-locked/store-conditional instructions, and it had much better performance than instrumenting each store manually (20%, I think).
See https://software.intel.com/en-us/articles/tsx-anti-patterns-... and https://software.intel.com/en-us/blogs/2013/06/23/tsx-fallba... for more details/
To make his code work he likely would need a global fallback lock (or a real STM) and guarantee that every change of the touched memory uses those too (which would be hard)
So I'm afraid the library is fairly broken.