| False sharing is a bit of a niche optimization compared to the architectural optimization of synchronizing less and doing more work in larger chunks. All the same, it is pretty easy. It has been a while, don't take this as gospel - A cache line is 64 bytes. Cache line boundaries will be every 64 bytes. Make sure your atomics don't fall on two cache lines. If a pointer is 8 bytes and 4 bytes are on 60-64 with the other 4 bytes on 65-68, accessing that atomic will introduce false sharing because it will access two cache lines. This is generally not an optimization that will need to be worried about. If you have a hotly contested atomic variable though, you might benefit from aligning it to not overlap two cache lines. You can do this by allocating extra memory and just putting at an address evenly divisible by 64 (a 64 bit atomic would then use the first 8 bytes of that cache line). Optimizations fall in to two camps. The first is architecture, which comes from experience of what you will need up front. The second camp is what you find out from optimization after the fact. > - Are there other best practices like cache padding/alignment/locality that improve performance? This is pretty much it as far as I can remember. 128 bit atomics will only work when aligned to 128 bit boundaries interestingly, while 64 bit and below are fine, you just have the possibility of false sharing. The simple recipe for locality is not really about concurrency - use arrays and loop through memory sequentially so that it can be prefetched. > - What is an alignment directive and how does one use it? There is an alignment keyword in C++ now. I have not used it yet but there are good explanations out there I'm sure. It will come down to automatically doing what I described in a manual way earlier. |
It's about having multiple CPUs actively use the same cache line. This is what happens when you try to make your data independent to remove locks, and you end up with extremely high contention with others when accessing your data, because the cache line cannot be in exclusive state in each core, so you spend your time flushing 64 bytes at once for each 4-byte write.
The typical case is this:
And have each thread perform counter[thread_id]++; Without noticing this is packing 16 threads on the same cache line. It can make your 16 threads work at roughly the same speed as a single one. And sometimes worse.The solution here is to identify all the data that are changed together within a same thread, and pack them together in a struct which you arrange in arrays:
Then you can safely access that stuff without false sharing (provided the base of array is itself aligned).That's where "pahole" is extremely useful: seeing the amount of free space in the structure often encourages you to add more data there for free.