Hacker News new | ask | show | jobs
by bayindirh 693 days ago
Current processors contains one FPU per core. When you have a couple of FPU heavy programs in a system, SMT makes sense, because it allows you to keep FPU busy while other lighter threads play in the sand elsewhere.

OTOH, when you run a HPC node, everything you run wants the FPU, Vector Units and the kitchen sink in that general area. Enabling SMT makes the queue longer, and nothing is processed faster in reality.

So, as a result, SMT makes sense sometimes, and is detrimental to performance in other times. Benchmarking, profiling and system tuning is the key. We generally disable SMT in our systems because it lowers performance when the node is fully utilized (which is most of the time).

1 comments

I'm not really sure why you say "one FPU per core". are you talking about the programmer-visible model? all current/mainstream processors have multiple and parallel FP FUs, with multiple FP/SIMD instructions in flight. afaik, the inflight instrs are from both threads if SMT is enabled (that is, tracked like all other uops). I'm also not sure why you say that enabling SMT "makes the queue longer" - do you mean pipeline? Or just that threads are likely to conflict?
Yes, but well optimized math heavy software will already max out the super-scalarity of the FPU. I.e. one cpu thread can already schedule multiple fpu-heavy instructions at the same time. If you run such software twice on the same fpu you will only gain overhead. I guess by queue he meant the processor internal work queue, the processor pipeline is only half of the picture. Processors have a small data-dependency graph of micro-instructions they have to perform. That is used to implement the machine code instructions that are currently in-flight.
> I guess by queue he meant the processor internal work queue...

Yes, I meant the internal one. Also, when you enable SMT, a small tag is added in front of every instruction, noting which logical core owns this instruction for a given physical core. So instead of tagging every instruction with a core-ID, you add a longer tag in the form of core-ID/logical_core-ID.

This extra tagging also makes instructions bigger, so the queue can hold less instructions, adding fuel to already chaotic and choked FPU logistics.

As a result, if you're saturating your FPU(s), SMT can't save you. In fact can make you slower.