Hacker News new | ask | show | jobs
by feffe 1872 days ago
The memory latency hiding also works with 2-way SMT. I worked on a networking software doing per packet session lookup in large hash tables. SMT with a Sandybridge core in this application gave 40% better performance which is higher than usually mentioned. So for memory bound (as in cache misses) applications, SMT is a boon.
2 comments

I have a graph for this:

https://github.com/raitechnology/raikv/blob/master/graph/mt_...

The CPU in this case is a Threadripper 3970x, 32 cores, 64 SMT.

My experience is this: When the L3 cache is effective, then the memory latency hiding via memory prefetch works well across SMT threads. If the hashtable load requires a chain walk, the SMT latency hiding is less effective because the calculated prefetch location is not the actual hit. I couldn't get prefetching multiple slots as the load increased to be as effective as prefetching a single slot.

I tested this some years ago on a raytracer, and got a tad over 50% more speed when enabling HT compared to disabling it.

As you say, the ray tracer did a lot of cache missing , interspersed with a fair bit of calculations. I'm guessing this is close to the ideal workload, as far as non-synthetic benchmarks go.