| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shaggie76 692 days ago

A grossly over-simplified argument for SMT that resonated with me was that it could keep a precious ALU busy while a thread stalls on a cache miss.

I gather in the early days the LPDDR used on laptops was slower too and since cores were scarce so this was more valuable there. Lately, though, we often have more cores than we can scale with and the value is harder to appreciate. We even avoid scheduling work on a shared with an important thread to avoid cache-contention because we know the single-threaded performance will be the bottleneck.

A while back I was testing Efficient/Performance cores and SMT cores for MT rendering with DirectX 12; on my i7-12700K I found no benefit to either: just using P-cores took about the same time to render a complex scene as P+SMT and P+E+SMT. It's not always a wash, though: on the Xbox Series X we found the same test marginally faster when we scheduled work for SMT too.

7 comments

bayindirh 692 days ago

Rendering is one of the scenarios which was either same or slower with SMT since the beginning. This is because rendering is already math heavy, and your FPU is always active, esp. dividers (which is always the most expensive operation for processors).

SMT shines while waiting for I/O or doing some simple integer stuff. If both your threads can saturate the FPU, SMT is generally slower because of the extra tagging added to the data inside the CPU to note what belongs where.

hinkley 692 days ago

But the way you make rendering embarrassingly parallel is the way you make web servers parallel; treat the system as a large number of discrete tasks with deadlines you work toward and avoid letting them interact with each other as much as possible.

You don’t worry about how long it takes to render one frame of a digital movie, you worry about how many CPU hours it takes to render five minutes of the movie.

bayindirh 692 days ago

Yes, however in an SMT enabled processor, there are one physical FPU per two logical cores. FPU is already busy with other thread's work, so the threads in that SMT enabled core take turns for their computation in the FPU, creating a bottleneck here.

As a result, you don't get any speed boost at best case, or lose some time in the worst case.

Since SMT doesn't magically increase the number of FPUs available in a processor, if what you're doing is math heavy, SMT just doesn't help. Same is true for scientific simulation, too. I observed the same effect, and verified that indeed saturating the FPU with a thread makes SMT moot.

rbanffy 692 days ago

If you have contention around a single part of the CPU then yes, SMT will not help you. The single FPU was an issue on the first Niagara processor as well, but it still had great throughput per socket unless all processes were fighting for the FPU.

If, however, you have multiple FPUs on your processor, then it might be useful to enable SMT. As usual, it pays to tune hardware to the workload you have. For integer-heavy workloads, you might prefer SMT (there are options for up to 8 threads per physical core out there) up to the point either cache misses or backend exhaustion happens.

bayindirh 692 days ago

Current processors contains one FPU per core. When you have a couple of FPU heavy programs in a system, SMT makes sense, because it allows you to keep FPU busy while other lighter threads play in the sand elsewhere.

OTOH, when you run a HPC node, everything you run wants the FPU, Vector Units and the kitchen sink in that general area. Enabling SMT makes the queue longer, and nothing is processed faster in reality.

So, as a result, SMT makes sense sometimes, and is detrimental to performance in other times. Benchmarking, profiling and system tuning is the key. We generally disable SMT in our systems because it lowers performance when the node is fully utilized (which is most of the time).

markhahn 692 days ago

I'm not really sure why you say "one FPU per core". are you talking about the programmer-visible model? all current/mainstream processors have multiple and parallel FP FUs, with multiple FP/SIMD instructions in flight. afaik, the inflight instrs are from both threads if SMT is enabled (that is, tracked like all other uops). I'm also not sure why you say that enabling SMT "makes the queue longer" - do you mean pipeline? Or just that threads are likely to conflict?

rcxdude 692 days ago

If you're waiting for IO, you're likely getting booted off the processor by the OS anyway. SMT is most useful when your code doesn't have enough instruction-level parallelism but is still mostly compute bound.

corysama 692 days ago

I believe "I/O" here is referring to data movement between DRAM and registers. Not drives or NICs.

bayindirh 692 days ago

Yes, exactly. One exception can be Infiniband, since it can put the received data to RAM directly, without CPU intervention.

pjc50 692 days ago

DMA is a much older technology. It's just that at some point you do need the CPU to actually look at it.

bayindirh 692 days ago

Infiniband uses RDMA, which is different than ordinary DMA. Your IB card sends the data to the client point to point, and the IB card directly writes it to the RAM. IB driver notifies that the data is arrived (generally via IB accelerated MPI), and you directly LOAD your data from the memory location [0].

IOW, your data magically appears in your application's memory, at the correct place. This is what makes Mellanox special, and made NVIDIA to acquire them.

From the linked document:

Instead of sending the packet for processing to the kernel and copying it into the memory of the user application, the host adapter directly places the packet contents in the application buffer.

[0]: https://docs.redhat.com/en/documentation/red_hat_enterprise_...

gonzo 692 days ago

Intel’s hyperthreading is really a write pipe hack.

It’s not so much cache misses as allowing the core to run something else while the write completes.

This is why some code scales poorly and other code achieves near linear speed ups.

gpderetta 692 days ago

Why would the core have to wait for the write to complete?

A core stalls on a write only if the store buffer is full. As hyper threads share the write buffer, SMT makes store stalls more likely, not less ( but still unlikely to be the bottleneck).

hinkley 692 days ago

At this point, especially with backside power, I wonder how much cache stalls on one processor result in less thermal throttling both on that processor and neighboring ones.

Maybe we should just be letting these procs take their little naps?

immibis 692 days ago

This leads, in the extreme, to the idea of a huge array of very simple cores, which I believe is something that has been tried but never really caught on.

orbat 692 days ago

That description reminds me of GreenArrays' (https://www.greenarraychips.com) Forth chips that have 144 cores – although they call them "computers" because they're more independent than regular CPU cores, and eg. each has its own memory and so on. Each "computer" is very simple and small – with a 180nm geometry they can cram 8 of them in 1mm^2, and the chip is fairly energy-efficient.

Programming for these chips apparently a bit of a nightmare though. Because the "computers" are so simple, even eg. calculating MD5 turns into a fairly tricky proposition as you have to spread out the algorithm to multiple computers with very small amounts of memory, so something that would be very simple on a more classic processor turns into a very low level multithreaded ordeal

adastra22 692 days ago

Worth noting that the GreenArrays chip is 15 years old. 144 cores was a BIG DEAL back then. I wonder what a similar architecture compiled with a modern process could achieve. 1440 cores? More?

markhahn 692 days ago

those weren't "real" cores. you know what current chip has FUs that it falsely calls "cores"? that's right, Nvidia GPUs. I think that's the answer to your question (pushing 20k).

adastra22 691 days ago

In what way were they not “real” cores? They had their own operating environment completely independent of other cores. GPU execution units on the other hand are SIMD--a single instruction stream.

orbat 690 days ago

How are GA144's nodes / "computers" not real cores? They're fully independent, each has its own memory (RAM and ROM), stacks, and registers, its own I/O ports and GPIO pins (some of them), and so on.

https://www.greenarraychips.com/home/documents/greg/PB003-11...

markhahn 691 days ago

it's a very interesting product - sort of the lovechild of a 1980s bitsliced microcode system optimized for application-specific pipelines (systolic array, etc).

do you know whehter it's had any serious design wins? I can easily imagine it being interesting for missile guidance, maybe high-speed trading, password cracking/coin mining. you could certainly do AI and GPUs with it, but I'm not sure it would have an advantage, even given the same number of transistors and memory resources.

orbat 690 days ago

I think they were more meant to minimize power use than be very performant. Honestly no clue how widely the GA144 is used in real-world applications, but I guess that since the company still exists they have to be getting money from somewhere

dwattttt 692 days ago

Calling all TIS-100 fans

makerofthings 692 days ago

Sounds like gpu to me.

pavlov 692 days ago

The Xeon Phi was a “manycore” x86 design with lots of tiny CPU cores, something like the original Pentium, but with the addition of 512-bit SIMD and hyperthreading:

https://en.m.wikipedia.org/wiki/Xeon_Phi

rbanffy 692 days ago

IIRC the first Phi has SMT4 in a round robin fashion similar to the Cell PPUs. To make a core run at full speed, you should schedule 4 threads on it.

p_l 692 days ago

The very, very first Phi still had its ROPs and texture units, being essentially a failed GPU with identifying marks filed off (yes, the first units were Larrabee prototypes with video outputs unpopulated)

ColonelPhantom 692 days ago

GPUs are actually SMT'd to the extreme. For example, Intel's Xe-HPG has 8-wide SMT. Other vendors have even bigger SMT: RDNA2 can have up to 16 threads in flight per core.

loa_in_ 691 days ago

Transputers.

bobmcnamara 692 days ago

>I gather in the early days the LPDDR used on laptops was slower too

Oddly, the latency hasn't improved too much - CAS is often 5-10ns for DDR2/3/4/5. Bus width, transfers per second, queueing, power per bit transfered and stored have, but if a program depends on something that's not in cache and was poorly predicted, the RAM latency is the issue.

gary_0 692 days ago

I wonder if instead of having SMT, processors could briefly power off the unused ALUs/FPUs while waiting for something further up the pipeline, and focus on reducing heat and power consumption rather than maximizing utilization.

rcxdude 692 days ago

They basically do: it's pretty common to clock gate inactive parts of the ALU, which reduces their power consumption greatly. Modern processor power usage is very workload-dependent for this reason.

usrusr 692 days ago

I consider SMT a relic left over from the days when CPU design was all about performance per square millimeter. We are in the process of substituting that goal with that of performance per watt, or in the process of slowly realizing that our goals have shifted quite a while ago.

I really don't expect SMT to stay much longer. Even more so with timing visibility crosstalk issues lurking and big/small architectures offering more parallelism per chip area where single thread performance isn't in the spotlight. Or perhaps the marketing challenge of removing a feature that had once been the pride of the company is so big that SMT stays forever.

cherioo 692 days ago

Intel is removing SMT from their next gen mobile processor.

My guess is this will help them improve ST perf. We will see how well it works, and if amd will follow

hinkley 692 days ago

Could you, do they, put the “extra” LUs right next to the parts of the chip with the highest average thermal dissipation to even out the thermal load across the chip?

Or stack them vertically, so the least consistently used parts of the chip are farthest away from the heat sink, delaying throttling.

bcrl 689 days ago

Intel has internal papers that investigated the use of the third dimension and the effect it would have on power consumption and performance. Of course it improves things, but it is very difficult to implement in the real world. The first real use of this technique by Intel is coming soon in the form of backside power delivery.

AMD 3D-Vcache technology shows that stacking an additional layer of transistors has a significant effect on thermal limits of a modern CPU. The extra cache is strategically placed over parts of the CPU die that use less power, yet those CPUs still have to run at lower temperatures and power settings compared to their non-vcache models. Just because you can build it doesn't mean that it will be a good fit for the mass market.

kijiki 692 days ago

> on the Xbox Series X we found the same test marginally faster when we scheduled work for SMT too.

This makes sense, the Series X has GDDR RAM, and so has substantially worse latency than DDR/LPDDR. SMT can help cover that latency, and the higher GDDR bandwidth mitigates the higher memory bandwidth needed to feed both threads.

immibis 692 days ago

Anecdotally, mkp224o (.onion vanity address miner, supposedly compute-bound with little memory access) runs about 5-10% faster on my 24-core AMD with 48 threads than with 24 threads. However, I haven't tried the same benchmark with SMT disabled in firmware.