Hacker News new | ask | show | jobs
by mtdewcmu 4762 days ago
I'm having a little trouble making sense of this:

"For example, bigtable benefits from cache sharing and would prefer 100 % remote accesses to 50% remote. Search-frontend prefers spreading the threads to multiple caches to reduce cache contention and thus also prefers 100 % remote accesses to 50% remote."

Let me see if I've got this straight:

* bigtable benefits from scheduling related threads on the same cpu so they can share a cache, I'm guessing because multiple threads work on the same data simultaneously

* search benefits from having its threads spread over many cpus, probably because the threads are unrelated to each other and not sharing data, so they like to have their own caches

I'm not sure I understand how this relates to NUMA, or why remote accesses are ever a good thing. Maybe it requires a more sophisticated understanding of computer architecture than what I have.

2 comments

It's not that remote accesses are good, it's that trying to induce them can harm cache usage elsewhere. If the author at High Scalability will allow me another quibble, I'd say that actually, memory locality is still King. It's just that we have to be very careful about trying to improve it; if you try to improve locality in one place (say, induce local accesses from a socket to main memory), you may end up harming it somewhere else (more total number of accesses to main memory because now the cache is thrashing).

The NUMA bit comes in when you said "scheduling related threads on the same cpu" and "threads spread over many cpus". If you schedule related threads on the same socket (cpu), you're more likely to get local accesses. If your threads share data, then that's two good things: local memory accesses, and good cache usage. But if your threads use different data, then the fact that you have local memory accesses may not matter because you may have many more cache misses because the threads are interfering with each other.

A simpler way to think about it: shorter access to main memory does not help you if you end up doing many more total accesses.

Do the bigtable performance characteristics look kind of like cache line ping ponging? My intuition for scenario 3 outperforming scenario 2 (100% remote vs 50% local + 50% remote) is that there are more mutations of data and therefore more interconnect traffic is required to maintain coherency across sockets.
I'm not familiar with this research, but it's possible that sequential accesses to memory would lead to prefetching, in which case going half-local half-remote could actually lead to a slowdown versus all-remote. Another hypothesis is if the memory ends up having to be migrated from one cpu cache to the other, then back. It's better if it's always in the remote cache than if it's getting flipped between the two.

I'm pretty sure it goes without saying that 100% local is always better, assuming you're not trading anything else away (like accessible CPU on other nodes).

Ah. So a remote access may be coming directly from the cache of a different CPU? That's something I didn't consider, and definitely adds another wrinkle.

I sense that the article is saying things in confusing ways, perhaps because that's the way computer architects speak (it always struck me as counterintuitive and confusing to measure a cache by its miss rate rather than its hit rate) or maybe it's this article.