Hacker News new | ask | show | jobs
by jqpabc123 1816 days ago
If thread 2 copies done into a register before thread 1 executes, it may keep using that register for the entire loop, never noticing that thread 1 later modifies done.

Alternative solution: Forget all the "atomic" semantics and simply avoid "optimization" of global variables. Access to any global variable should always occur direct from memory. Sure, this will be less than optimal in some cases but such is the price of using globals. Their use should be discouraged anyway.

In other words, make "atomic" the sensible and logical default with globals. Assignment is an "atomic" operation, just don't circumvent it by using a local copy as an "optimization".

3 comments

This problem isn't specific to global variables; it happens with all shared mutable state. I would assume that the author only used global variables because that lets them keep the working examples as short as possible, and minimize irrelevant details.

And yes, you can put a full memory fence around every access to a variable that is shared across threads. But doing so would just destroy the performance of your program. Compared to using a register, accessing main memory typically takes something on the order of 100 times as long. Given that we're talking about concerns that are specific to a relatively low-level approach to parallelism, I think it's safe to assume that performance is the whole point, so that would be an unacceptable tradeoff.

> And yes, you can put a full memory fence around every access to a variable that is shared across threads. But doing so would just destroy the performance of your program. Given that we're talking about concerns that are specific to a relatively low-level approach to parallelism, I think it's safe to assume that performance is the whole point, so that would be an unacceptable tradeoff.

Indeed.

Just a reminder to everyone: your pthreads_mutex_lock() and pthreads_mutex_unlock() functions already contain the appropriate compiler / cache memory barriers in the correct locations.

This "Memory Model" discussion is only for people who want to build faster systems: for people searching for a "better spinlock", or for writing lock-free algorithms / lock-free data structures.

This is the stuff of cutting edge research right now: its a niche subject. Your typical programmer _SHOULD_ just stick a typical pthread_mutex_t onto an otherwise single-threaded data-structure and call it. Locks work. They're not "the best", but "the best" is constantly being researched / developed right now. I'm pretty sure that any new lockfree data-structure with decent performance is pretty much an instant PH.D thesis material.

-----------

Anyway, the reason why "single-threaded data-structure behind a mutex" works is because your data-structure still keeps all of its performance benefits (from sticking to L1 cache, or letting the compiler "manually cache" data to registers when appropriate), and then you only lose performance when associated with the lock() or unlock() calls (which will innately have memory barriers to publish the results)

That's 2 memory barriers (one barrier for lock() and one barrier for unlock()). The thing about lock-free algorithms is that they __might__ get you down to __1__ memory barrier per operation if you're a really, really good programmer. But its not exactly easy. (Or: they might still have 2 memory barriers but the lockfree aspect of "always forward progress" and/or deadlock free might be easier to prove)

Writing a low-performance but otherwise correct lock free algorithm isn't actually that hard. Writing a lock free algorithm that beats your typical mutex + data-structure however, is devilishly hard.

> This "Memory Model" discussion is only for people who want to build faster systems: for people searching for a "better spinlock", or for writing lock-free algorithms / lock-free data structures.

Actually, most practitioner code has bugs from their implicit assumptions that shared variable writes are visible or ordered the way they think they are.

But the practitioner doesn't need to know the memory model (aside from "memory models are complicated").

To solve that problem, the practitioner only needs to know that "mutex.lock()" and "mutex.unlock()" orders reads/writes in a clearly defined manner. If the practitioner is wondering about the difference between load-acquire and load-relaxed, they've probably gone too deep.

> To solve that problem, the practitioner only needs to know that "mutex.lock()"

This is true, but they do not know that. If you do not give some kind of substantiation, they will shrug it off and go back to "nah this thing doesn't need a mutex", like with a polling variable (contrived example).

Can you explain what you mean by a "polling variable" needing a mutex? Usually polling is done using atomic instructions instead of a mutex. Are you referring to condition variables?
Removing optimizations on "global" variables will leave the bug in Singleton objects (which are very similar to global variables, but the compiler doesn't know that they're global)

---------

"Volatile" is close but not good enough semantically to describe what we want. That's why these new Atomic-variables are being declared with seqcst semantics (be it in Java, C++, C, or whatever you program in).

That's the thing: we need a new class of variables that wasn't known 20 years ago. Variables that follow the sequential-consistency requirements, for this use case.

---------

Note: on ARM systems, even if the compiler doesn't mess up, the L1 cache could mess you up. ARM has multiple load/store semantics available. If you have relaxed (default) semantics on a load, it may be on a "stale" value from DDR4.

That is to say: your loop may load a value into L1 cache, then your core will read the variable over and over from L1 cache (not realizing that L3 cache has been updated to a new value). Not only does your compiler need to know to "not store the value in a register", the memory-subsystem also needs to know to read the data from L3 cache over-and-over again (never using L1 cache).

Rearranging loads/stores on x86 is not allowed in this manner. But ARM is more "Relaxed" than x86. If reading the "Freshest" value is important, you need to have the appropriate memory barriers on ARM (or PowerPC).

which are very similar to global variables, but the compiler doesn't know that they're global

Since as you say, they are very similar, wouldn't it be reasonable to assume for access purposes that they are effectively global?

Lets do this example in Java (but it should be simple enough that C#, Python, Javascript and other programmers would understand it).

    public void myFunction(FooObject o){
        o.doSomething();
    }
How does the compiler know if "FooObject o" is a singleton or not? That's the thing about the "Singleton" pattern, you have an effective "global-ish" variable, but all of your code is written with normal pass-the-object style.

EDIT: If you're not aware, the way this works is that you call myFunction(getTheSingleton());, where "getTheSingleton()" fetches the Singleton object. myFunction() has no way of "knowing" its actually interacting with the singleton. This is a useful pattern, because you can create unit-tests over the "global" variable by simply mocking out the Singleton for a mock object (or maybe an object with preset state for better unit testing). Among other benefits (but also similar downsides to using a global variable: difficult to reason because you have this "shared state" being used all over the place)

Is there anything special about singletons here?

In Java, there's no real difference between a singleton and any other object. A singleton is an object that just happens to have a single instance. Practically speaking, they're typically used as a clever design pattern to "work around" Java's lack of language-level support for global variables, so there's that. But I think that that fact might not be relevant to the issue at hand?

The more basic issue is, if you have two different threads concurrently executing `myFunction`, what happens when they're both operating on the same instance of `FooObject`?

> Is there anything special about singletons here?

No, aside from the fact that the root commenter clearly understands the issue with global variables, but not necessarily singletons.

I'm trying to use the singleton concept as a "teaching bridge" moment, as the Singleton is clearly "like a global variable" in terms of the data-race, but generalizes to any object in your code.

The commenter I'm replying seems to think that global-variables are the only kind of variable where this problem occurs. He's wrong. All objects and all variables have this problem.

> Access to any global variable should always occur direct from memory.

What if your function takes a pointer that might be pointing to a global variable? Does that mean that all accesses through a pointer are now excempt from optimization unless the compiler can prove that the pointer will never point to a global variable?

The great thing about simple memory models is they work really well until you think about it. ;-)
What if your function takes a pointer that might be pointing to an "atomic" variable?

Pointers can be used to circumvent most safety measures. If you obscure the access, you should assume responsibility for the result.

At least in C++, atomicity is part of the type system, and you would have to explicitly reinterpret cast it away.