| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wtarreau 1918 days ago

False sharing has nothing to do with misaligned atomics.

It's about having multiple CPUs actively use the same cache line. This is what happens when you try to make your data independent to remove locks, and you end up with extremely high contention with others when accessing your data, because the cache line cannot be in exclusive state in each core, so you spend your time flushing 64 bytes at once for each 4-byte write.

The typical case is this:

   uint32_t counter[MAX_THREADS];

And have each thread perform counter[thread_id]++; Without noticing this is packing 16 threads on the same cache line. It can make your 16 threads work at roughly the same speed as a single one. And sometimes worse.

The solution here is to identify all the data that are changed together within a same thread, and pack them together in a struct which you arrange in arrays:

   struct local_stuff {
       uint32_t counter;
       uint32_t max;
       ...
       __attribute__((aligned(64)));
   } per_thread_stuff[MAX_THREADS] __attribute__((aligned(64)));

Then you can safely access that stuff without false sharing (provided the base of array is itself aligned).

That's where "pahole" is extremely useful: seeing the amount of free space in the structure often encourages you to add more data there for free.