Hacker News new | ask | show | jobs
by Xcelerate 4310 days ago
Could someone give me a simple explanation of what exactly hyperthreading does? They tout 16 logical cores and 8 physical cores in this new chip. I've read the Wikipedia page on it, but it gets too technical.

I do molecular dynamics simulations with LAMMPS, and I've noticed performance on my laptop is best with 4 cores. Using all 8 "virtual cores" is actually quite a bit slower.

6 comments

A core is a mostly independent processing unit within a larger package. Some hardware resources (like the memory controller, at least in non-NUMA devices) are shared between all cores, but many are duplicated for each core. Some examples of core-local resources would be their separate integer, floating point, and sometimes vector execution units (boxes that you can stick some data into and get a result out some number of cycles later), and some (but not all, depending on the chip) of the various layers of caches that sit between each core and main memory.

In hyperthreaded processors, each core can be further split into two "threads". These threads share most of their hardware resources; you can think of them as a thin veneer over a single core. These threads execute simultaneously, making use of whatever resources their partner isn't using at the moment.

Some examples (assume a single core processor with 2 hardware threads for each): Imagine you're running a thread, and it needs to access main memory before it can continue. Depending on the chip, this will take hundreds or even thousands of cycles before the thread can continue. Hyperthreading is one way to make use of this time; the other thread can run at full steam while the first is waiting to get its results back from memory.

Another positive example: you're running some floating point DSP code (perhaps your music player's equalizer) at the same time that you are compiling a new build of a program. The DSP code will make use of a mix of integer and floating point resources, while the compiler will probably not need to use the floating point units at all. Hyper threading allows the music player to use those resources that would otherwise be idle while the compiler is running. The DSP code will slow down the compiler because it is competing for things like integer resources (which are needed for pointer arithmetic, for instance), however there will still likely be an improvement over normal multitasking on a single hardware thread.

Now, for a negative example: you are running two very demanding threads. These threads are painstakingly programmed to use almost every resource they possibly can at any moment, they very rarely need to stall to access memory, etc. In this case, the two threads will only waste time fighting over the same resources, kicking each other out of cache, etc, and it would ultimately be more efficient to disregard hyper threading and run each thread sequentially.

Another negative example: you are running two instances of the same thread. This will result in good utilization of some resources (such as code cache, because each thread is executing the same program) but practically guarantees contention over the execution units, even if the program isn't that demanding.

To sum it up, hyperthreading is usually a net positive for desktops where you have a very heterogenous (and often not anywhere close to optimally programmed) mix of programs that need to run at once, and usually a net negative for high performance computing programs like your molecular dynamics simulation where every thread is executing the same extremely demanding program at once.

EDIT: And to go a bit further and explain what makes GPUs special, they're basically the inverse of a hyper-threaded CPU, great at running a lot of homogenous threads. Instead of having independent threads sharing the same resources, they have the same logical thread (many designs sharing the instruction pointer amongst many hardware threads, thus causing each to execute the same instruction at any given moment with different inputs) shared across cores that have their own indepedent execution units.

That is a great explanation, ill be saving tht one next time someone asks me! Thanks!
This is a great explanation, thanks.
What cpu? some of the earlier hyperthreaded systems were notably less effective than the current stuff.

The simple explanation is that you have a core with its set of execution resources. Instead of using those resources to satisfy just a single execution context the processor has two execution contexts, which run independently of each other sharing the resources. This can potentially result in large gains when you have a workload which often leaves a execution stalled waiting on ram, though less than you might guess because there are overheads and because modern processors are already able to extract a fair amount of parallelism out of a single thread.

It works out less well for software that sees non-trivial overhead when running more threads, or when more threads increase cache pressure too much.

A Haswell core can execute four instructions per cycle, but sometimes a thread doesn't have four instructions that are ready to execute because they're waiting for something (like a cache miss). In that case, SMT allows the processor to use that idle capacity to execute instructions from a different thread.
Hyperthreading is basically a way to emulate multiple cores, but sharing the more rarely used units (like floating point) between them. This way, a normal application can use the multiple cores, and actually run in parallel most of the time. You save a lot of silicon area, but when both threads try to execute the same rare instruction at the same time, they can't run in parallel.

The problem of molecular simulation is that it's almost entirely composed of floating point instructions. Thus, hyperthreading can't run them in parallel at all.

If you want the academic answer, here's a survey of the literature that I wrote on this topic back in 1996. (Section 3 is where it starts to talk about hardware).

http://oirase.annexia.org/multithreading.ps

Yeah, I suppose that for your case, HT does nothing at best

The more CPU-bound and similar (between them) the tasks are, the less HT is going to make a difference

It made sense for single-core P4 and Atoms, but for an 8-core processor, the efficacy of HT is debatable.