Hacker News new | ask | show | jobs
by dogma1138 746 days ago
AMD tried that with HSA in the past it doesn’t really work. Unless your CPU can magically offload vector processing to the GPU or another sub-processor you are still reliant on new code to get this working which means you break backward compatibility with previously compiled code.

The best case scenario here is if you can have the compiler do all the heavy lifting but more realistically you’ll end up having to make developers switch to a whole new programming paradigm.

3 comments

I understand that you can't convince developers to rewrite/recompile their applications for a processor that breaks compatibility. I'm wondering how many existing applications would be negatively impacted by cutting down vector throughput. With some searching, I see that some applications make mild use of it like Firefox. However there are applications that would negatively affected, such as noise suppression in Microsoft Teams, and crypto acceleration in libssl and the Linux kernel. Acceleration of crypto functions seems essential enough to warrant not touching vector throughput, so it seems vector operations are here to stay in CPUs.
Modern hash table implementations use vector instructions for lookups:

- Folly: https://github.com/facebook/folly/blob/main/folly/container/...

- Abseil: https://abseil.io/about/design/swisstables

Sure; but it’s hard to do and very few programs get optimised to this point. Before reaching for vector instructions, I’ll:

- Benchmark, and verify that the code is hot.

- Rewrite from Python, Ruby, JS into a systems language (if necessary). Honorary mention for C# / Go / Java, which are often fast enough.

- Change to better data structures. Bad data structure choices are still so common.

- Reduce heap allocations. They’re more expensive than you think, especially when you take into account the effect on the cpu cache

Do those things well, and you can often get 3 or more orders of magnitude improved performance. At that point, is it worth reaching for SIMD intrinsics? Maybe. But I just haven’t written many programs where fast code written in a fast language (c, rust, etc) still wasn’t fast enough.

I think it would be different if languages like rust had a high level wrapper around simd that gave you similar performance to hand written simd. But right now, simd is horrible to use and debug. And you usually need to write it per-architecture. Even Intel and amd need different code paths because Intel has dumped avx2.

Outside generic tools like Unicode validation, json parsing and video decoding, I doubt modern simd gets much use. Llvm does what it can but ….

Indeed, people really fixate on “slow languages” but for all but the most demanding of applications, the right algorithm and data structures makes the lions share of the difference.
Reaching for SIMD intrinsics or an abstraction has been historically quite painful in C and C++. But cross-platform SIMD abstractions in C#, Swift and Mojo are changing the picture. You can write a vectorized algorithm in C# and practically not lose performance versus hand-intrinsified C, and CoreLib heavily relies on that.
Newer SoCs come with co-processors such as NPUs so it’s just a question of how long it would take for those workloads to move there.

And this would highly depend on how ubiquitous they’ll become and how standardized the APIs will be so you won’t have to target IHV specific hardware through their own libraries all the time.

Basically we need a DirectX equivalent for general purpose accelerated compute.

It’s a lot more work to push data to a GPU or NPU than to just to a couple vector ops. Crypto is important enough many architectures have hardware accelerators just for that.
For servers no, but we’re talking about endpoints here. Also this isn’t only about reducing the existing vector bandwidth but also about not increasing it outside of dedicated co-processors.
I think the answer here is dedicated cores of different types on the same die.

Some cores will be high-performance, OoO CPU cores.

Now you make another core with the same ISA, but built for a different workload. It should be in-order. It should have a narrow ALU with fairly basic branch prediction. Most of the core will be occupied with two 1024-bit SIMD units and a 8-16x SMT implementation to hide the latency of the threads.

If your CPU and/or OS detects that a thread is packed with SIMD instructions, it will move the thread over to the wide, slow core with latency hiding. Normal threads with low SIMD instruction counts will be put through the high-performance CPU core.

Different vector widths for different cores isn't currently feasible, even with SVE. So all cores would need to support 1024-bit SIMD.

I think it's reasonable for the non-SIMD focused cores to do so via splitting into multiple micro-ops or double/quadruple/whatever pumping.

I do think that would be an interesting design to experiment with.

I actually think the CPU and GPU meeting at the idea of SIMT would be very apropos. AVX-512/AVX10 has mask registers which work just like CUDA lanes in the sense of allowing lockstep iteration while masking off lanes where it “doesn’t happen” to preserve the illusion of thread individuality. With a mask register, an AVX lane is now a CUDA thread.

Obviously there are compromises in terms of bandwidth but it’s also a lot easier to mix into a broader program if you don’t have to send data across the bus, which also gives it other potential use-cases.

But, if you take the CUDA lane idea one step further and add Independent Thread Scheduling, you can also generalize the idea of these lanes having their own “independent” instruction pointer and flow, which means you’re free to reorder and speculate across the whole 1024b window, independently of your warp/execution width.

The optimization problem you solve is now to move all instruction pointers until they hit a threadfence, with the optimized/lowest-total-cost execution. And technically you may not know where that fence is specifically going to be! Things like self-modifying code etc are another headache not allowed gpgpu too - there certainly will be some idioms that don’t translate well, but I think that stuff is at least thankfully rare in AVX code.

This is what happening now with NPUs and other co-processors. Just not fully OS managed / directed yet but Microsoft is most likely working on that part at least.

The key part is that now there are far more use cases than there were in the early dozer days and that the current main CPU design does not compromise on vector performance like the original AMD design did (outside of extreme cases of very wide vector instructions).

And they are also targeting new use cases such as edge compute AI rather than trying to push the industry to move traditional applications towards GPU compute with HSA.

I've had thoughts along the same lines, but this would require big changes in kernel schedulers, ELF to provide the information, and probably other things.
+1 : Heterogeneous/Non uniform core configuration always require a lot of very complex adjustment to the kernel schedulers and core binding policies. Even now after almost a decade of big-little (from arm) configuration and/or chiplet design(from amd) the (linux) kernel scheduling still requires a lot tuning for things like games etc... Adding cores with very different performance characteristics would probably require the thread scheduling to be delegated to the CPU it self with only hint from the kernel scheduler
There are a couple methods that could be used.

Static analysis would probably work in this case because the in-order core would be very GPU-like while the other core would not.

In cases where performance characteristics are closer, the OS could switch cores, monitor the runtimes, and add metadata about which core worked best (potentially even about which core worked best at which times).

Persuading people to write their C++ as a graph for heterogeneous execution hasn't gone well. The machinery works though, and it's the right thing for heterogeneous compute, so should see adoption from XLA / pytorch etc.