Without appropriate memory barriers, you can end up with an inconsistent view of the memory due to out-of-order execution and other weird stuff the CPU does behind the scenes.
There's significant footguns around low level concurrency primitives.
The real problem is resource management. If you have a threaded GC then you can just read, and the act of reading will cause the value you read to be alive. If you don't have a GC, however, you probably use reference counts, and now you may need to compose an atomic read of a pointer with an atomic reference count increment, but by the time you're doing the second step the object you're referencing may have been destructed.
> As long as there are no writes during those reads... everything should be stable, no?
Yes, but how do you ensure there are no writes during those reads?
You have to protect the reads against concurrent writes.
The simplest way is to use a mutex, but that doesn't support concurrent reads.
The next way, which is fairly common, is to use a read-write lock. That allows reads that are concurrent with each other, but only one write at a time, and no concurrency between reads and writes.
A standard read-write lock is not particularly fast for concurrent reads. The lock-for-read operation is required so that reads prevent a concurrent write from starting, and wait for a write already started to finish. That's not fast on a multi-core system because it forces cache line bouncing between cores.
That is, unless particularly fancy types of read-write locks optimised for mostly reading are used, such as rwlock-per-core, and those are slow for writes. They are somewhat fast for reads on architectures with fast atomic operations, but not as fast as possible.
Read-write locks also come in different flavours, depending on whether you want new reads to be blocked and queued when there's a write blocked waiting for current reads to finish. Fairness is an issue. This can get complicated, and bugs in libc rwlocks are not unheard of because of the complication.
A seqlock can be used which has fast reads when there are no writes. They are fast on a multi-core system because there's no cache line bouncing between cores when there are only reads. But if there is a high rate of writes in one thread it can block all reads continuously, by causing them to livelock in loops. This is called spinning. More commonly, the writes tend to slow down seqlocked reads by a large factor in some scenarios, without blocking them completely. Just wasting a lot of CPU time and running slowly, out of proportion to the amount of blocking you would expect is necessary.
Rather like an over-contended spinlock. Spinlocks, which can come in a mutex flavour or read-write flavour, should rarely be used in threaded code outside a kernel. Because they spin as described above, out of proportion to the amount of blocking that's really required, and the effect is much worse in pre-empted userspace threads than in a non-premptible kernel.
It's possible to reduce seqlock and spinlock CPU spinning by transitioning to a different type of lock after some number of spins. Sometimes a dynamically estimated number of spins. This makes them behave better in userspace threaded code outside a kernel. But now your lock is rather complicated, and still not consistently fast at reads.
An approach which works really well is RCU. Concurrent reads can be very simple and never spin because there's no loop. There's no multi-core cache line bouncing. Writes are more complicated, and the reads have to adhere to certain patterns because the kind of concurrency allowed between reads and writes is different in RCU than with locks. It works best if your program has some kind of top-level event loop that is returned to often, to provide the "quiescent states" RCU requires. But there are other ways to do it, if there's no top level, they just require reads to do a bit more work than almost nothing.
Even RCU requires a little something in reads though, to get correct data. This is a data-dependency memory barrier. These barrier operations require zero instructions on nearly all CPUs because of how memory systems are designed, but are famously not free on the DEC Alpha which shows that it's not a "no operation", it just happens to be a side effect that is usually baked in. Even with zero instructions on nearly all CPUs, they limit which code optimisations the compiler is allowed to do, so have a non-zero average overhead, but it is very small in practice.