Hacker News new | ask | show | jobs
by CyberDildonics 1918 days ago
Where is this a problem for you? The broad goal to minimize cache misses is the same as concurrency in general - minimize synchronization. This means understanding dependencies and doing work on each core in chunks that are as large as possible for the amount of latency you can tolerate between synchronizing.
1 comments

- "A high-performance implementation would of course use padding or special alignment directives to avoid false sharing."

- "Cache alignment and padding often improves performance by reducing false sharing."

- "Software can use the alignment directives available in many compilers to avoid false sharing, and adding such directives is a common step in tuning parallel software."

There's just essentially two or three lines in the book that I could find that refer to optimizing cache use through padding/alignment, and that was after knowing what to search for. Although I wouldn't be surprised if I missed something.

In my experience, the discourse stops at the surface level, which makes the topic appear like it's obvious or trivial. But there are many follow-up questions that naturally arise for me:

- What are the trade-offs of cache-padding shared data? Why does it degrade performance for certain problems?

- What is a good rule-of-thumb for when to prioritize cache padding/alignment over cache locality?

- Are there other best practices like cache padding/alignment/locality that improve performance?

- What is an alignment directive and how does one use it?

I agree with what you've said, I was merely pointing out that I wish parallel programming resources delved more into this subject, as I feel that it's a practical and common issue. I'm sure it's not trivial and requires a fair bit of expertise, but that's why one would reach for a book like this after all.

False sharing is a bit of a niche optimization compared to the architectural optimization of synchronizing less and doing more work in larger chunks.

All the same, it is pretty easy. It has been a while, don't take this as gospel - A cache line is 64 bytes. Cache line boundaries will be every 64 bytes. Make sure your atomics don't fall on two cache lines.

If a pointer is 8 bytes and 4 bytes are on 60-64 with the other 4 bytes on 65-68, accessing that atomic will introduce false sharing because it will access two cache lines.

This is generally not an optimization that will need to be worried about. If you have a hotly contested atomic variable though, you might benefit from aligning it to not overlap two cache lines. You can do this by allocating extra memory and just putting at an address evenly divisible by 64 (a 64 bit atomic would then use the first 8 bytes of that cache line).

Optimizations fall in to two camps. The first is architecture, which comes from experience of what you will need up front. The second camp is what you find out from optimization after the fact.

> - Are there other best practices like cache padding/alignment/locality that improve performance?

This is pretty much it as far as I can remember. 128 bit atomics will only work when aligned to 128 bit boundaries interestingly, while 64 bit and below are fine, you just have the possibility of false sharing.

The simple recipe for locality is not really about concurrency - use arrays and loop through memory sequentially so that it can be prefetched.

> - What is an alignment directive and how does one use it?

There is an alignment keyword in C++ now. I have not used it yet but there are good explanations out there I'm sure. It will come down to automatically doing what I described in a manual way earlier.

False sharing has nothing to do with misaligned atomics.

It's about having multiple CPUs actively use the same cache line. This is what happens when you try to make your data independent to remove locks, and you end up with extremely high contention with others when accessing your data, because the cache line cannot be in exclusive state in each core, so you spend your time flushing 64 bytes at once for each 4-byte write.

The typical case is this:

   uint32_t counter[MAX_THREADS];
And have each thread perform counter[thread_id]++; Without noticing this is packing 16 threads on the same cache line. It can make your 16 threads work at roughly the same speed as a single one. And sometimes worse.

The solution here is to identify all the data that are changed together within a same thread, and pack them together in a struct which you arrange in arrays:

   struct local_stuff {
       uint32_t counter;
       uint32_t max;
       ...
       __attribute__((aligned(64)));
   } per_thread_stuff[MAX_THREADS] __attribute__((aligned(64)));
Then you can safely access that stuff without false sharing (provided the base of array is itself aligned).

That's where "pahole" is extremely useful: seeing the amount of free space in the structure often encourages you to add more data there for free.

Thanks for the thorough explanation!