| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by BazookaMusic 1101 days ago

I might be wrong on this explanation, but the reason why it was faster might have been the following:

During execution you had two kinds of memory locations, some in CPU caches and some in RAM. By running all the threads on one socket, everything accessed from the cache was just a fast cache access. Everything accessed from the memory was a slower memory load. Frequently loaded/stored locations will tend to go to the cache.

In the NUMA setup, you would have a larger cache (more than one socket) which would mean that more locations were likely to be in the cache. However, if a core on a socket tries to access a location which is on another socket's cache, it will use the interconnect between them to access it.

If you have an unfortunate memory layout, this can make it so that you end up having a large percentage of the accesses using the interconnect (slower than cache access) and values get swapped between the caches constantly, which forces subsequent accesses to also use the interconnect.

Another way to avoid this except using just one socket is for the designer of a program to consider NUMA nodes as separate processing units and design around that. Both should be processing separate data and they should only share small amounts of data for synchronization/communication. Then the caches will be much less affected.

1 comments

malkia 1101 days ago

That's a pretty reasonable explanation, and one day I should sit down and write some artifical test/bench to get more details.

link