Hacker News new | ask | show | jobs
by uudecoded 1693 days ago
I am curious, since Intel is relying more and more on P and E cores as well, is there any reference or research available for optimizing multithreaded userland process tasks with varying QoS?

A lot of the pthreads books I see are from the late 90s. Is there a more recent reference? What's the best way to write cross-platform (e.g. not Grand Central Dispatch) multithreaded apps with these new chip architectures?

4 comments

I just ran into this tweet for Intel: https://twitter.com/DeepSchneider/status/1456314755380097027

They move everything that isn't foreground to an efficiency core, which is awful for compiling or video processing.

There's apparently a BIOS option that will use ScrollLock for disabling the efficiency cores entirely.

Thank you for sharing this, it's interesting - I've also gotten the impression (but lack citation) that Intel E cores are targeted at thermal isolation instead of power minimization as the M1 may target.

This is front of mind for me since reading a Cloudflare blog regarding AVX-512 instructions invoking dynamic frequency scaling to manage power/thermal capacity on chip. (https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...)

If this is happening on Xeons, it's probably happening on consumer dies as well, in addition to other non-obvious power/performance optimizations. Perhaps this is why Alder Lake is pumping up the TDP[1]?

edit: [1] https://news.ycombinator.com/item?id=29106860

> They move everything that isn't foreground to an efficiency core, which is awful for compiling or video processing.

Windows has had that (foreground boost) for a long time, Intel probably piggybacks on it. It'll be interesting to see how it will behave on Linux, which AFAIK never had that mechanism (except perhaps on Android).

For Linux, I believe it will dispatch based on the niceness level and overall CPU utilization - past a certain threshold, it will start putting work at default or higher priority onto the performance cores.

For the Mac, I believe you have equivalent access for scheduling between posix and GCD, but the scheduling configuration is likely way more approachable in GCD.

Also: On M1, there is an added capability to run in a stricter memory model to speed up x86_64 emulation. This only is available on the performance cores, which is one of the reasons people observe non-native code draining the battery quicker.

M1's cores are homogenous and all of them support TSO.
Saying that the M1's cores are homogenous is pretty misleading / confusing as the icestorm and firestorm cores are rather different. big.LITTLE/DIQ-type architectures are usually considered heterogenous even if all the actors share an ISA (because you can't treat all the cores

But as to the latter assertion, you're indeed correct per Joe Groff (Swift compiler engineer at Apple): https://twitter.com/jckarter/status/1332045390057639939

> The A12 only supported TSO on the performance cores. The M1 supports it on all cores.

Yeah, when I said "homogenous" I was solely referring to the ISA. Trying to enable TSO on a Tempest core will fail with an undefined instruction exception, but I think A12Z is ISA homogenous in userspace.
My understanding is as long as you specify the QoS currently, GCD takes care of it (as it has done for Apple Ax SoCs on iPhone).
I think people are just now starting that research and blog posts like this one are all we have so far.
Asymmetric multiprocessing has been a big topic of research for many, many years.

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C33&q=asy...

Yeah, but I'd bet 90% of that research makes wacky assumptions that don't apply to real processors. When real hardware becomes available you start over from scratch. (Source: I am a former CS researcher.)
isn't arm big little architecture the norm in a wildly used processors for a decade or so?