| - "A high-performance implementation would of course
use padding or special alignment directives to avoid false sharing." - "Cache alignment and padding often improves performance by reducing false sharing." - "Software can use the alignment directives available
in many compilers to avoid false sharing, and adding such directives is a common step
in tuning parallel software." There's just essentially two or three lines in the book that I could find that refer to optimizing cache use through padding/alignment, and that was after knowing what to search for. Although I wouldn't be surprised if I missed something. In my experience, the discourse stops at the surface level, which makes the topic appear like it's obvious or trivial. But there are many follow-up questions that naturally arise for me: - What are the trade-offs of cache-padding shared data? Why does it degrade performance for certain problems? - What is a good rule-of-thumb for when to prioritize cache padding/alignment over cache locality? - Are there other best practices like cache padding/alignment/locality that improve performance? - What is an alignment directive and how does one use it? I agree with what you've said, I was merely pointing out that I wish parallel programming resources delved more into this subject, as I feel that it's a practical and common issue. I'm sure it's not trivial and requires a fair bit of expertise, but that's why one would reach for a book like this after all. |
All the same, it is pretty easy. It has been a while, don't take this as gospel - A cache line is 64 bytes. Cache line boundaries will be every 64 bytes. Make sure your atomics don't fall on two cache lines.
If a pointer is 8 bytes and 4 bytes are on 60-64 with the other 4 bytes on 65-68, accessing that atomic will introduce false sharing because it will access two cache lines.
This is generally not an optimization that will need to be worried about. If you have a hotly contested atomic variable though, you might benefit from aligning it to not overlap two cache lines. You can do this by allocating extra memory and just putting at an address evenly divisible by 64 (a 64 bit atomic would then use the first 8 bytes of that cache line).
Optimizations fall in to two camps. The first is architecture, which comes from experience of what you will need up front. The second camp is what you find out from optimization after the fact.
> - Are there other best practices like cache padding/alignment/locality that improve performance?
This is pretty much it as far as I can remember. 128 bit atomics will only work when aligned to 128 bit boundaries interestingly, while 64 bit and below are fine, you just have the possibility of false sharing.
The simple recipe for locality is not really about concurrency - use arrays and loop through memory sequentially so that it can be prefetched.
> - What is an alignment directive and how does one use it?
There is an alignment keyword in C++ now. I have not used it yet but there are good explanations out there I'm sure. It will come down to automatically doing what I described in a manual way earlier.