Hacker News new | ask | show | jobs
by bayindirh 694 days ago
Rendering is one of the scenarios which was either same or slower with SMT since the beginning. This is because rendering is already math heavy, and your FPU is always active, esp. dividers (which is always the most expensive operation for processors).

SMT shines while waiting for I/O or doing some simple integer stuff. If both your threads can saturate the FPU, SMT is generally slower because of the extra tagging added to the data inside the CPU to note what belongs where.

2 comments

But the way you make rendering embarrassingly parallel is the way you make web servers parallel; treat the system as a large number of discrete tasks with deadlines you work toward and avoid letting them interact with each other as much as possible.

You don’t worry about how long it takes to render one frame of a digital movie, you worry about how many CPU hours it takes to render five minutes of the movie.

Yes, however in an SMT enabled processor, there are one physical FPU per two logical cores. FPU is already busy with other thread's work, so the threads in that SMT enabled core take turns for their computation in the FPU, creating a bottleneck here.

As a result, you don't get any speed boost at best case, or lose some time in the worst case.

Since SMT doesn't magically increase the number of FPUs available in a processor, if what you're doing is math heavy, SMT just doesn't help. Same is true for scientific simulation, too. I observed the same effect, and verified that indeed saturating the FPU with a thread makes SMT moot.

If you have contention around a single part of the CPU then yes, SMT will not help you. The single FPU was an issue on the first Niagara processor as well, but it still had great throughput per socket unless all processes were fighting for the FPU.

If, however, you have multiple FPUs on your processor, then it might be useful to enable SMT. As usual, it pays to tune hardware to the workload you have. For integer-heavy workloads, you might prefer SMT (there are options for up to 8 threads per physical core out there) up to the point either cache misses or backend exhaustion happens.

Current processors contains one FPU per core. When you have a couple of FPU heavy programs in a system, SMT makes sense, because it allows you to keep FPU busy while other lighter threads play in the sand elsewhere.

OTOH, when you run a HPC node, everything you run wants the FPU, Vector Units and the kitchen sink in that general area. Enabling SMT makes the queue longer, and nothing is processed faster in reality.

So, as a result, SMT makes sense sometimes, and is detrimental to performance in other times. Benchmarking, profiling and system tuning is the key. We generally disable SMT in our systems because it lowers performance when the node is fully utilized (which is most of the time).

I'm not really sure why you say "one FPU per core". are you talking about the programmer-visible model? all current/mainstream processors have multiple and parallel FP FUs, with multiple FP/SIMD instructions in flight. afaik, the inflight instrs are from both threads if SMT is enabled (that is, tracked like all other uops). I'm also not sure why you say that enabling SMT "makes the queue longer" - do you mean pipeline? Or just that threads are likely to conflict?
Yes, but well optimized math heavy software will already max out the super-scalarity of the FPU. I.e. one cpu thread can already schedule multiple fpu-heavy instructions at the same time. If you run such software twice on the same fpu you will only gain overhead. I guess by queue he meant the processor internal work queue, the processor pipeline is only half of the picture. Processors have a small data-dependency graph of micro-instructions they have to perform. That is used to implement the machine code instructions that are currently in-flight.
If you're waiting for IO, you're likely getting booted off the processor by the OS anyway. SMT is most useful when your code doesn't have enough instruction-level parallelism but is still mostly compute bound.
I believe "I/O" here is referring to data movement between DRAM and registers. Not drives or NICs.
Yes, exactly. One exception can be Infiniband, since it can put the received data to RAM directly, without CPU intervention.
DMA is a much older technology. It's just that at some point you do need the CPU to actually look at it.
Infiniband uses RDMA, which is different than ordinary DMA. Your IB card sends the data to the client point to point, and the IB card directly writes it to the RAM. IB driver notifies that the data is arrived (generally via IB accelerated MPI), and you directly LOAD your data from the memory location [0].

IOW, your data magically appears in your application's memory, at the correct place. This is what makes Mellanox special, and made NVIDIA to acquire them.

From the linked document:

Instead of sending the packet for processing to the kernel and copying it into the memory of the user application, the host adapter directly places the packet contents in the application buffer.

[0]: https://docs.redhat.com/en/documentation/red_hat_enterprise_...

Linux has had zero copy network support for 15 years. No magic.