Hacker News new | ask | show | jobs
by gpderetta 2363 days ago
Also if both threads are pinned to separate cores and nothing else is supposed to run on those cores, it is pointless to use anything but spinlocks as there is no other thread that could better use the core (and probably you do not want the core to go to a low power syate waiting for an interrupt).
2 comments

You're discounting energy use. This is a bad strategy on a battery powered device.
In addition to energy, power use is another reason; parking a core will allow it to cool down thermally, so that when it is put back in use (milli)seconds later, it can run at a higher clock speed for longer.
Intel has a low-power PAUSE instruction that is literally a ‘rep nop’. I assume Arm has one too.
That's not extremely low power compared to real low power states. The main advantage of PAUSE is the scheduling of the other hyperthread (if it exists) and maybe not generating a gratuitous L1 / MESI workload at a crazy rate (well if programmed correctly that should be quite cheap in lots of cases, but still...). To my knowledge this does not cut any clock, so the power economy is going to be minimal.
IIRC the mov imm, %ecx; rep nop sequences are somewhat special cased by modern architectures (and this fact is the only reason why you even would want to execute such code). On the other hand the energy savings are mostly negligible and it is simply an SMT-level equivalent of sched_yield()
Actually I heard that the last few generations of intel (from skylake) enter power state mode more aggressively with pause and the latency of getting out of a pause went up from tens of cycles to hundreds. No first hand testing though.
Yes, you wouldn't use this strategy on a battery powered device. It ia for very specialised applications.
> and nothing else is supposed to run on those cores

That's quite the corner case.

This is exactly the situation for a well-balanced parallel work queue. You want to start as many threads as there are cores and run them full tilt pulling work off the queue until it is empty. If you're running a large scale cluster that is dedicated to a particular task (e.g. like servicing a special kind of query, or encoding videos, rendering, etc), this is very common, or even a parallel Photoshop filter.
> This is exactly the situation for a well-balanced parallel work queue.

What if your work queues are running on a multitasking operating system that runs services? And what about a hypervisor?

For this technique you generally dedicate some core(s) to those miscellaneous threads and flag the rest as unscheduable unless a thread is specifically assigned to them.

If you’re not sharing cores between VMs it’s typical to do the same at the hypervisor layer.

This is the normal use case for any DPDK software. I think anyone involved in HPC or high-speed networking knows that this is pretty common.
Yes, in practice you have to dedicate the whole machine for a specific application, but the one thread per isolated core is a proven one for high performance/low latency applications.