| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by andikleen2 3465 days ago

He's assuming that retrying forever is a valid retry strategy, which it is not. For example if a page fault was needed to satisfy one of the memory access it would never finish.

See https://software.intel.com/en-us/articles/tsx-anti-patterns-... and https://software.intel.com/en-us/blogs/2013/06/23/tsx-fallba... for more details/

To make his code work he likely would need a global fallback lock (or a real STM) and guarantee that every change of the touched memory uses those too (which would be hard)

So I'm afraid the library is fairly broken.

3 comments

scivey7 3465 days ago

I'm aware of the issue with non-terminating transactions, though I wasn't aware of the role played by page faults -- thanks for adding that detail.

Looking back over the readme, I can see how the loops used in the examples are a little misleading.

This is mostly a documentation issue: the core XACT code doesn't use infinite retry loops, and actually does not retry transactions at all. As with std::atomic, the goal is to provide a basic primitive and leave retry / backoff / etc. up to the user. This is especially important with lock-based fallbacks, as I can't pick one perfect lock to fit everyone's workload.

I ended up dropping retries because I ran into so many never-ending transactions in my early experiments with TSX. That was also my motivation for limiting the transactions to as few locations as possible.

I'm just now starting to reexamine this and add some configurable retry logic back in -- e.g. the retry policy here is used in some test code: https://github.com/scivey/xact/blob/master/include/xact/atom...

As to the difficulty of protecting any memory touched in a transaction under a locking scheme: that kind of problem is exactly why XACT is focused on CAS-like operations on relatively limited sets of memory addresses.

Can you elaborate on the global lock? What's the motivation there?

link

andikleen2 3465 days ago

Practically all valid fallback schemes require putting the lock (or something else like a sequence counter for a STM) into the read set of the transaction to properly synchronize between transactions and non transactions. Since you hide the transaction in your library it's not possible to do that with your current API. It would be very hard to construct a fallback path that is not racy.

(See Anti pattern #4 in the link above)

A global lock is usually the simplest fall back path, and the performance can be good enough because it's just a slow path. Of course it's always possible to do something more complex.

link

scivey7 3465 days ago

Agreed that the basic "store to 8 locations" API would need tweaking to allow locking.

Re: adding a counter into the read set, I think the new generalized API here will support that out of the box: https://github.com/scivey/xact/blob/master/docs/api/generali...

Thoughts?

link

andikleen2 3465 days ago

Yes with a read primitive it could be done in theory. It will be just quite awkward to use however as every caller has to do all that: define a lock, pass it always in, make sure the check for "lock is free" is correct etc.

Your unit tests don't seem to do it right.

It would probably be easier to hide the lock in your library, and enforce all other access to follow the right protocol using some ADTs. But then you just have a simple hardware TM accelerated STM.

FWIW the sweet spots for nice to use TM APIs are currently either lock elision, or compiler assisted TM (like __transaction* in gcc), or higher level libraries.

link

scivey7 3465 days ago

Your feedback has been very helpful. Do you mind if I ask you for more advice down the line?

link

hendzen 3465 days ago

For anyone not aware, the parent commenter, Andi Kleene, is an expert on TSX. When I last seriously looked in to TSX, around 2013, he was maintaining a fork of glibc with support for TSX-optimized pthread primitives and had written most of the high quality blog posts and information about TSX available online.

link

loeg 3465 days ago

It's really unfortunate semantics that a page fault condition during a transaction doesn't actually raise the fault. Is there a downside I'm not seeing to raising the fault and then aborting the transaction? (That way, retry would succeed.)

link

andikleen2 3465 days ago

This would be only useful for "good" page faults that fault something in, but not for "bad" ones (like NULL pointer). If a bad page fault was executed it would allow transactions to crash the program, which wouldn't be very atomic.

The transaction mechanism doesn't know in advance if it's a good or a bad page fault.

You would need to tell the operating system kernel that the page fault happened in a transaction, and let it ignore it if it was a bad page fault. That would be much more complicated than current TSX.

Also there are other cases were retries will not succeed, page fault was just an example. Another common case is the dynamic linker when a library function is first executed.

link

loeg 3465 days ago

> If a bad page fault was executed it would allow transactions to crash the program, which wouldn't be very atomic.

It would allow bad page faults to crash the program, i.e., ordinary behavior. No? Why do programs need this protection for HTM transactions?

> You would need to tell the operating system kernel that the page fault happened in a transaction, and let it ignore it if it was a bad page fault.

It wouldn't ignore it. It would fault the thread and probably tear down the process, as usual. No?

> Also there are other cases were retries will not succeed, page fault was just an example. Another common case is the dynamic linker when a library function is first executed.

That would be an abort due to excessive memory use?

Thanks! I'm not as familiar with this stuff as I would like to be.

link

bonzini 3465 days ago

A bad page fault might arise just from reading a partially-updated data structure. For example you could write two locations in one thread and read them in another. If the read side assumes that "location 1 nonzero" implies "location 2 nonzero", and then dereferences location 2, an inconsistent read would cause such a bad page fault. The only correct way to handle this is to abort the transaction.

link