| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mww09 2927 days ago
	>This suggests a long-term compromise solution where threads within a process can use hyperthreading to share a core, but threads in different processes can't. Given that hyperthreads share L1 cache, this might also be better for performance. Intuitively this may sound logical, however in practice it's often not the case. For many workloads putting two threads of the same program on a core ends up being worse than co-locating with threads from different programs. The reason is that two threads of the same program will often end up executing similar instruction streams (a really good example is when both are using vector instructions (these registers are shared between the two hyperthreads)).

2 comments

ajross 2927 days ago

In practice it sometimes is the case, though.

SMT/hyperthreading is complicated. If you have a workload dominated by non-local DRAM fetches, it's a huge win because when the CPU pipeline is stalled on one thread it can still issue instructions from the other.

If you have a workload dominated by L1 cache bandwidth, the opposite is true because the threads compete for the same resource.

On balance, on typical workloads, it's a win. But there are real-world problems for which turning it off is a legitimate performance choice.

link

adrianratnapala 2927 days ago

> workload dominated by non-local DRAM fetches,

How often is that a polite way of saying "software that is inefficient"?

link

ajross 2927 days ago

Often. But software is what software does, and a CPU that only worked well on "efficient" code will always fail when compared with one that runs the junk faster than the competition.

Also, to be fair: sometimes a DRAM fetch is just inherent in the problem. Big RAM-resident databases are all about DRAM latency because while sure, it's a lot slower than L1 cache, it's still faster than flash. I mean, memcached is a giant monument in praise of the pipeline stall, and it's hugely successful.

link

adrianratnapala 2927 days ago

> But software is what software does, and a CPU that only

Indeed. It is arguably rational for Intel to take on the burden in a centralised place rather than expecting every two-bit software shop to to do it.

But then the existence of this kind of security issue shows that the added complexity is not always worthwhile. We might be forced to to accept that computers which actually behave well are a little bit slower than we thought. But in return they will be simpler and more amenable to software optimisation.

link

imtringued 2927 days ago

I'm not sure there is a correlation. I can think of many situations in which non-local DRAM fetches are more efficient and I can think of many other situations where the opposite is true.

Trees or hashmaps which use non-local DRAM fetches can be more efficient than a brute force linear search through a continguous array given a sufficiently high enough number of elements.

At the same time continguous arrays can be significantly more efficient than linked lists which use non-local DRAM fetches.

link

yk 2927 days ago

With most software, well most software is pretty inefficent and profits from HT. However there are a lot of reasons for that, writing in something interpreted because it is faster to develop and the software does not need to be very efficient in the first place would be one application. (Not to say that all Python/JS/etc is inefficient, just that software that needs to be efficient is precisely the kind were one would consider an unmanaged language.) Additionally, things like webservers or dbs often just don't know which piece of data they need next, simply because they don't know the next query, have a tendency to profit from HT, even though the software is hardly known for being inefficient.

link

paulirwin 2927 days ago

FWIW, you mention databases, but even some database workloads can have better performance with HT turned off. I first learned this from a DBA at a former job when I was curious as to why they turned HT off. A member of the SQL Server team back in 2005 ran some experiments and found that you can get a 10% performance improvement in some workloads with HT off [1]. I don't know how much of that is still true today, however, as nearly all of my recent experience is PaaS in the cloud.

[1]: https://blogs.msdn.microsoft.com/slavao/2005/11/12/be-aware-...

link

naikrovek 2927 days ago

> How often is that a polite way of saying "software that is inefficient"?

One could also say "software written with strong OOP patterns" because those are almost always written to benefit the developer later, rather than the CPU and RAM at runtime.

link

willvarfar 2927 days ago

There are plenty of problems with poor mechanical sympathy.

To take an extreme example, traversing graphs is notorious. Cray and Sun iirc have some fascinating processors with many many hyperthreads because all the programs do is wait on dram but luckily there are lots of searches that can be done in parallel.

link

greglindahl 2927 days ago

Typical workloads? What's that? People run hugely diverse workloads on cpus, and they change over time.

link

ajross 2927 days ago

Building software, serving web pages, executing database queries, running a DOM layout, managing game logic... I mean, come on. You knew what I meant. Those are all tasks with "medium" cache residency and "occasional" stalls on DRAM. Anything that does a bunch of different things with a big-ish world of data.

Conversely: finding a task that is L1-cache-bound but does not frequently have to stall for memory is much harder. The only ones off the top of my head are streaming tasks like software video decode.

link

greglindahl 2927 days ago

Oh, you meant typical for you.

One task that is L1 cache bound and does not frequently stall for memory (if you code it up well) is matrix multiply.

link

kbenson 2927 days ago

> Oh, you meant typical for you.

I'm pretty sure those are meant to be, and I think are, "typical" for the general purpose CPU in use, and thus the general case.

Both mobile and desktop CPUs will be doing DOM layout, DB queries (whether to SQLite or the registry or just the filesystem), and possibly computing game logic on a regular basis.

link

greglindahl 2927 days ago

It's becoming popular to want to push machine learning tasks onto edge devices like mobile and desktop CPUs, for example apps that include some machine learning. Some of these machine learning algorithms do a lot of matrix multiplies.

"Typical" is highly varied, and it changes.

Edit: here's an example: Google brings on-device machine learning to mobile with TensorFlow Lite

https://thenextweb.com/artificial-intelligence/2017/11/15/go...

link

adrianN 2927 days ago

I don't know whether it's still true, but a couple of years ago a majority of the world's CPU cycles were spent sorting things.

link

smaddox 2927 days ago

That's an interesting claim. Do you remember the source?

link

amelius 2927 days ago

> The reason is that two threads of the same program will often end up executing similar instruction streams

Why is that bad?

link

CHY872 2927 days ago

Your processor has a certain number of execution units which can actually execute individual instructions, maybe like 4 floating point units, maybe 8 arithmetic ones, and maybe 1 that can do vector processing (these numbers are not real, but are like, good enough for sake of message).

So the idea with SMT is that most of the time, lots of the execution units are unused because the thread a) isn't using them at all (e.g. a process to do encryption won't use the floating point units) and/or b) can't use them all because of how the program's written (for example, if I say 'load a random memory address, then add it to a register, then load another random memory address, then add it, etc' I'm going to be spending most of my time waiting for memory to be loaded.

SMT basically means that you run another program at the same time, so even if the encryption process can't use the floating point units, maybe there's another process that we can schedule that will.

However, imagine my encryption process can use 6 of the 8 arithmetic units. If I have 2 encryption processes scheduled on the same core, I have demand for 12 when there are only 8. So now I have contention for resources, and I won't see a speedup from using SMT.

Other comments mention registers and not execution units: I'm suspicious of this, since modern processors have many registers (for Skylake, 250+) which they remap between aggressively as part of pipelining. Maybe this is different for the SIMD units.

That said, I haven't looked at this stuff since university so could well be wrong on the execution unit vs register comparison.

link

blattimwind 2927 days ago

The contention would actually not be on a per-EU level, but one level higher up. The reservation station has a bunch (~5-8) of ports and typically multiple EUs are connected to one port. Can't use one port for two different things at the same time.

Here's a simplified block diagram of a Skylake core: https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

link

CHY872 2927 days ago

Thanks! Yeah figured I'd be wrong somewhere in there!

link

Twirrim 2927 days ago

There's also some funkiness around the CPU cache, at least at one stage. If your two HT threads are working on the same data, there was a chance you'd get some great cache performance out of it. However the hyperthread when faced with a cache miss, can cause the cache to get evicted to be replaced with the data it needs. Under those circumstances, performance takes quite a nose dive as both threads are stomping over each other somewhat.

Hyperthreading can be a real mixed bag for performance, though generally good and a lot of engineering effort has gone in to making it shine. As ever it's strongly advisable that people benchmark real world conditions on a server, and it's worth giving a shot with hyperthreading turned on and off.

link

gpderetta 2927 days ago

You are pretty much right. A couple of things to add: even hand optimized asm code wont be able to use all ports all the time with a single instruction steam; the biggest win for hyperthreading is filling the pipeline bubbles caused by memory loads out of L1 (there is only so mach that OoO scheduling can do on your average load)

link

amelius 2927 days ago

So on a RISC architecture, this would happen even for non-similar programs, because the number of instructions is smaller? Or would they just duplicate the processing units?

link

Dylan16807 2927 days ago

The units aren't dedicated to esoteric instructions, you have functions like "small alu", "big alu", "fp adder", "address generator and memory load".

RISC will perform about the same and you can hyperthread one fine.

link

jfoutz 2927 days ago

Different instruction streams use different registers. It’s like sharing a bathroom. I can shower while you brush your teeth. There’s more contention when we’re both trying to shower.

link

uryga 2927 days ago

"both are using vector instructions (these registers are shared between the two hyperthreads)"

So I guessing GP meant there's going to be contention for those registers, and thus no speedup?

link

xeroaura 2927 days ago

Taking a guess, but since they are running similar streams, they have similar loads at a specific time. Competition between main thread and hyperthread could hurt performance instead?

link