| Though that's 10 pages and short for a paper, I'll give a shot at a simpler explanation. Compilers are expected to optimize your code, and the primary way to optimize code is through rearranging your statements. Consider the following: int i=0;
i++;
sleep(1); // Yeah, sleep isn't a proper memory barrier.
i++;
sleep(1); // But in my experience, beginners understand sleep. So shoot me.
i++;
sleep(1);
i++;
sleep(1);
i++;
sleep(1);
The compiler will often "rearrange" these ++ statements to all happen on the same line, ultimately as follows: int i=0; i++; i++; i++; i++; i++;
sleep(1);
sleep(1);
sleep(1);
sleep(1);
sleep(1);
Then int i=5;
sleep(5);
Simple enough. Except... what if Thread#2 had the following: while(i<2); // Infinite loop waiting for i to equal 2
foo();
Then in Thread#3: while(i<3); // Infinite loop waiting for i to become 3
bar();
Then in Thread#4: while(i<4); // Infinite loop waiting for i to become 4
baz();
Then in Thread#5: while(i<5); // Infinite loop waiting for i to become 5
foobar();
As we can see here, "i" is a synchronization variable. We only know this fact if we know how another thread works. Now that i no longer steps from 1 to 2 to 3 to 4 to 5, your threads no longer synchronize and the code gains a race condition (all threads might execute at once, since i starts off as 5).----------- For better or worse, modern programmers must think about the messages passed between threads. After all, semaphores are often i++ and i-- statements at the lowest level (maybe with a touch of atomic_swap or maybe a lock-statement depending on your architecture). Modern code must note when a variable is important to inter-thread synchronization, to selectively disable the Compilers optimizer (funny enough: it also is needed to strongly order the L1 cache, as well as the Out-of-order core of modern processors). As such, proper threading requires a top-to-bottom language-level memory model. The "knowledge" that the i++ cannot be optimized / combined beyond the sleep statements. --------- This is no longer an issue on modern platforms. Today, we have C++11's memory model which strongly defines where and when optimizations can occur, with "seq_cst" memory ordering. There is also a faster, but slightly harder to understand, memory model of acquire and release. This acquire / release memory model is useful on more relaxed systems like ARM / POWER9. Your mutex_lock() and mutex_unlock() statements will have these memory-barriers which tell the compiler, CPU, and L1 cache to order the code in ways the programmer expects. No optimizations are allowed "over" the mutex_lock() or mutex_unlock() statements, thanks to the memory model. But back in 2004, before the memory model was formalized, it was impossible to write a truly portable posix-thread implementation. (Fortunately, compilers at the time recognized the issue and solved it in their own ways. Windows had Interlock_Exchange calls, GCC had its own memory model. But the details were non-standard and non-portable). |