Hacker News new | ask | show | jobs
by hajile 745 days ago
I think the answer here is dedicated cores of different types on the same die.

Some cores will be high-performance, OoO CPU cores.

Now you make another core with the same ISA, but built for a different workload. It should be in-order. It should have a narrow ALU with fairly basic branch prediction. Most of the core will be occupied with two 1024-bit SIMD units and a 8-16x SMT implementation to hide the latency of the threads.

If your CPU and/or OS detects that a thread is packed with SIMD instructions, it will move the thread over to the wide, slow core with latency hiding. Normal threads with low SIMD instruction counts will be put through the high-performance CPU core.

3 comments

Different vector widths for different cores isn't currently feasible, even with SVE. So all cores would need to support 1024-bit SIMD.

I think it's reasonable for the non-SIMD focused cores to do so via splitting into multiple micro-ops or double/quadruple/whatever pumping.

I do think that would be an interesting design to experiment with.

I actually think the CPU and GPU meeting at the idea of SIMT would be very apropos. AVX-512/AVX10 has mask registers which work just like CUDA lanes in the sense of allowing lockstep iteration while masking off lanes where it “doesn’t happen” to preserve the illusion of thread individuality. With a mask register, an AVX lane is now a CUDA thread.

Obviously there are compromises in terms of bandwidth but it’s also a lot easier to mix into a broader program if you don’t have to send data across the bus, which also gives it other potential use-cases.

But, if you take the CUDA lane idea one step further and add Independent Thread Scheduling, you can also generalize the idea of these lanes having their own “independent” instruction pointer and flow, which means you’re free to reorder and speculate across the whole 1024b window, independently of your warp/execution width.

The optimization problem you solve is now to move all instruction pointers until they hit a threadfence, with the optimized/lowest-total-cost execution. And technically you may not know where that fence is specifically going to be! Things like self-modifying code etc are another headache not allowed gpgpu too - there certainly will be some idioms that don’t translate well, but I think that stuff is at least thankfully rare in AVX code.

This is what happening now with NPUs and other co-processors. Just not fully OS managed / directed yet but Microsoft is most likely working on that part at least.

The key part is that now there are far more use cases than there were in the early dozer days and that the current main CPU design does not compromise on vector performance like the original AMD design did (outside of extreme cases of very wide vector instructions).

And they are also targeting new use cases such as edge compute AI rather than trying to push the industry to move traditional applications towards GPU compute with HSA.

I've had thoughts along the same lines, but this would require big changes in kernel schedulers, ELF to provide the information, and probably other things.
+1 : Heterogeneous/Non uniform core configuration always require a lot of very complex adjustment to the kernel schedulers and core binding policies. Even now after almost a decade of big-little (from arm) configuration and/or chiplet design(from amd) the (linux) kernel scheduling still requires a lot tuning for things like games etc... Adding cores with very different performance characteristics would probably require the thread scheduling to be delegated to the CPU it self with only hint from the kernel scheduler
There are a couple methods that could be used.

Static analysis would probably work in this case because the in-order core would be very GPU-like while the other core would not.

In cases where performance characteristics are closer, the OS could switch cores, monitor the runtimes, and add metadata about which core worked best (potentially even about which core worked best at which times).