Hacker News new | ask | show | jobs
by amelius 2927 days ago
> The reason is that two threads of the same program will often end up executing similar instruction streams

Why is that bad?

4 comments

Your processor has a certain number of execution units which can actually execute individual instructions, maybe like 4 floating point units, maybe 8 arithmetic ones, and maybe 1 that can do vector processing (these numbers are not real, but are like, good enough for sake of message).

So the idea with SMT is that most of the time, lots of the execution units are unused because the thread a) isn't using them at all (e.g. a process to do encryption won't use the floating point units) and/or b) can't use them all because of how the program's written (for example, if I say 'load a random memory address, then add it to a register, then load another random memory address, then add it, etc' I'm going to be spending most of my time waiting for memory to be loaded.

SMT basically means that you run another program at the same time, so even if the encryption process can't use the floating point units, maybe there's another process that we can schedule that will.

However, imagine my encryption process can use 6 of the 8 arithmetic units. If I have 2 encryption processes scheduled on the same core, I have demand for 12 when there are only 8. So now I have contention for resources, and I won't see a speedup from using SMT.

Other comments mention registers and not execution units: I'm suspicious of this, since modern processors have many registers (for Skylake, 250+) which they remap between aggressively as part of pipelining. Maybe this is different for the SIMD units.

That said, I haven't looked at this stuff since university so could well be wrong on the execution unit vs register comparison.

The contention would actually not be on a per-EU level, but one level higher up. The reservation station has a bunch (~5-8) of ports and typically multiple EUs are connected to one port. Can't use one port for two different things at the same time.

Here's a simplified block diagram of a Skylake core: https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

Thanks! Yeah figured I'd be wrong somewhere in there!
There's also some funkiness around the CPU cache, at least at one stage. If your two HT threads are working on the same data, there was a chance you'd get some great cache performance out of it. However the hyperthread when faced with a cache miss, can cause the cache to get evicted to be replaced with the data it needs. Under those circumstances, performance takes quite a nose dive as both threads are stomping over each other somewhat.

Hyperthreading can be a real mixed bag for performance, though generally good and a lot of engineering effort has gone in to making it shine. As ever it's strongly advisable that people benchmark real world conditions on a server, and it's worth giving a shot with hyperthreading turned on and off.

You are pretty much right. A couple of things to add: even hand optimized asm code wont be able to use all ports all the time with a single instruction steam; the biggest win for hyperthreading is filling the pipeline bubbles caused by memory loads out of L1 (there is only so mach that OoO scheduling can do on your average load)
So on a RISC architecture, this would happen even for non-similar programs, because the number of instructions is smaller? Or would they just duplicate the processing units?
The units aren't dedicated to esoteric instructions, you have functions like "small alu", "big alu", "fp adder", "address generator and memory load".

RISC will perform about the same and you can hyperthread one fine.

Different instruction streams use different registers. It’s like sharing a bathroom. I can shower while you brush your teeth. There’s more contention when we’re both trying to shower.
"both are using vector instructions (these registers are shared between the two hyperthreads)"

So I guessing GP meant there's going to be contention for those registers, and thus no speedup?

Taking a guess, but since they are running similar streams, they have similar loads at a specific time. Competition between main thread and hyperthread could hurt performance instead?