Hacker News new | ask | show | jobs
by andrewia 746 days ago
It's interesting to see that modern processor optimization still revolves around balancing hardware for specific tasks. In this case, the vector scheduler has been separated from the integer scheduler, and the integer pipeline has been made much wider. I'm sure it made sense for this revision, but I wonder if things will change in a few generations in the pendulum will swing back to simplifying and integrating more parts of the arithmrtic scheduler(s) and ALUs.

It's also interesting to see that FPGA integration hasn't gone far, and good vector performance is still important (if less important than integer). I wonder what percentage of consumer and professional workloads make significant use of vector operations, and how much GPU and FPGA offload would alleviate the need for good vector performance. I only know of vector operations in the context of multimedia processing, which is also suited for GPU acceleration.

5 comments

> good vector performance is still important (if less important than integer)

This is in part (major part IMHO) because few languages support vector operations as first class operators. We are still trapped in the tyranny that assumes a C abstract machine.

And so because so few languages support vectors, the instruction mix doesn’t emphasize it, therefore there’s less incentive to work on new language paradigms, and we remained tapped in a suboptimal loop.

I’m not claiming there are any villains here, we’re just stuck in a hill-climbing failure.

It’s not obvious that that’s what’s happened here. Eg vector scheduling is separated but there are more units for actually doing certain vector operations. It may be that lots of vector workloads are more limited by memory bandwidth than ILP so adding another port to the scheduler mightn’t add much. Being able to run other parts of the cpu faster when vectorised instructions aren’t being used could be worth a lot.
That matches with recent material I've read on vectorized workloads: memory bandwidth can become the limiting factor.
Always nice to see people rediscovering the roofline model.
But isn’t that why we have things like CUDA? Who exactly is “we” here, people who only have access to CPU’s? :)
I’m not saying that you cannot write vector code, but that it’s typically a special case. CUDA APIs and annotations are bolted onto existing languages rather than reflecting languages with vector operations as natural first class operations.

C or Java have no concept of `a + b` being a vector operation the way a language like, say, APL does. You can come closer in C++, but in the end the memory model of C and C++ hobbles you. FORTRAN is better in this regard.

I see two options from this perspective.

It is always possible to inline assembler in C, and present vector operators as functions in a library.

Otherwise, R does perceive vectors, so another language that performs well might be a better choice. Julia comes to mind, but I have little familiarity with it.

With Java, linking the JRE via JNI would be an (ugly) option.

Makes sense. I guess that’s why some python libs use it under the hood
What about Rust?
When the data is generated on CPU shoveling it to the GPU to do possibly a single or few vector operations and then shoveling it back to the CPU to continue is most likely going to be more expensive than the time saved.

And CUDA is Nvidia specific.

Doesn’t CUDA also let you execute on the CPU? I wonder how efficiently.
No - a CUDA program consists of parts that run on the CPU as well as on the GPU, but the CPU (aka host) code is just orchestrating the process - allocating memory, copying data to/from the GPU, and queuing CUDA kernels to run on the GPU. All the work (i.e. running kernels) is done on the GPU.

There are other libraries (e.g. OpenMP, Intel's oneAPI) and languages (e.g. SYCL) that do let the same code be run on either CPU or GPU.

When you use a GPU, you are using a different processor with a different ISA, running its own barebones OS, with which you communicate mostly by pushing large blocks of memory through the PCIe bus. It’s a very different feel from, say, adding AVX512 instructions to your program flow.
The CPU vector performance is important for throughput-oriented processing of data e.g. databases. A powerful vector implementation gives you most of the benefits of an FPGA for a tiny fraction of the effort but has fewer limitations than a GPU. This hits a price-performance sweet spot for a lot of workloads and the CPU companies have been increasingly making this a first-class "every day" feature of their processors.
AMD tried that with HSA in the past it doesn’t really work. Unless your CPU can magically offload vector processing to the GPU or another sub-processor you are still reliant on new code to get this working which means you break backward compatibility with previously compiled code.

The best case scenario here is if you can have the compiler do all the heavy lifting but more realistically you’ll end up having to make developers switch to a whole new programming paradigm.

I understand that you can't convince developers to rewrite/recompile their applications for a processor that breaks compatibility. I'm wondering how many existing applications would be negatively impacted by cutting down vector throughput. With some searching, I see that some applications make mild use of it like Firefox. However there are applications that would negatively affected, such as noise suppression in Microsoft Teams, and crypto acceleration in libssl and the Linux kernel. Acceleration of crypto functions seems essential enough to warrant not touching vector throughput, so it seems vector operations are here to stay in CPUs.
Modern hash table implementations use vector instructions for lookups:

- Folly: https://github.com/facebook/folly/blob/main/folly/container/...

- Abseil: https://abseil.io/about/design/swisstables

Sure; but it’s hard to do and very few programs get optimised to this point. Before reaching for vector instructions, I’ll:

- Benchmark, and verify that the code is hot.

- Rewrite from Python, Ruby, JS into a systems language (if necessary). Honorary mention for C# / Go / Java, which are often fast enough.

- Change to better data structures. Bad data structure choices are still so common.

- Reduce heap allocations. They’re more expensive than you think, especially when you take into account the effect on the cpu cache

Do those things well, and you can often get 3 or more orders of magnitude improved performance. At that point, is it worth reaching for SIMD intrinsics? Maybe. But I just haven’t written many programs where fast code written in a fast language (c, rust, etc) still wasn’t fast enough.

I think it would be different if languages like rust had a high level wrapper around simd that gave you similar performance to hand written simd. But right now, simd is horrible to use and debug. And you usually need to write it per-architecture. Even Intel and amd need different code paths because Intel has dumped avx2.

Outside generic tools like Unicode validation, json parsing and video decoding, I doubt modern simd gets much use. Llvm does what it can but ….

Indeed, people really fixate on “slow languages” but for all but the most demanding of applications, the right algorithm and data structures makes the lions share of the difference.
Reaching for SIMD intrinsics or an abstraction has been historically quite painful in C and C++. But cross-platform SIMD abstractions in C#, Swift and Mojo are changing the picture. You can write a vectorized algorithm in C# and practically not lose performance versus hand-intrinsified C, and CoreLib heavily relies on that.
Newer SoCs come with co-processors such as NPUs so it’s just a question of how long it would take for those workloads to move there.

And this would highly depend on how ubiquitous they’ll become and how standardized the APIs will be so you won’t have to target IHV specific hardware through their own libraries all the time.

Basically we need a DirectX equivalent for general purpose accelerated compute.

It’s a lot more work to push data to a GPU or NPU than to just to a couple vector ops. Crypto is important enough many architectures have hardware accelerators just for that.
For servers no, but we’re talking about endpoints here. Also this isn’t only about reducing the existing vector bandwidth but also about not increasing it outside of dedicated co-processors.
I think the answer here is dedicated cores of different types on the same die.

Some cores will be high-performance, OoO CPU cores.

Now you make another core with the same ISA, but built for a different workload. It should be in-order. It should have a narrow ALU with fairly basic branch prediction. Most of the core will be occupied with two 1024-bit SIMD units and a 8-16x SMT implementation to hide the latency of the threads.

If your CPU and/or OS detects that a thread is packed with SIMD instructions, it will move the thread over to the wide, slow core with latency hiding. Normal threads with low SIMD instruction counts will be put through the high-performance CPU core.

Different vector widths for different cores isn't currently feasible, even with SVE. So all cores would need to support 1024-bit SIMD.

I think it's reasonable for the non-SIMD focused cores to do so via splitting into multiple micro-ops or double/quadruple/whatever pumping.

I do think that would be an interesting design to experiment with.

I actually think the CPU and GPU meeting at the idea of SIMT would be very apropos. AVX-512/AVX10 has mask registers which work just like CUDA lanes in the sense of allowing lockstep iteration while masking off lanes where it “doesn’t happen” to preserve the illusion of thread individuality. With a mask register, an AVX lane is now a CUDA thread.

Obviously there are compromises in terms of bandwidth but it’s also a lot easier to mix into a broader program if you don’t have to send data across the bus, which also gives it other potential use-cases.

But, if you take the CUDA lane idea one step further and add Independent Thread Scheduling, you can also generalize the idea of these lanes having their own “independent” instruction pointer and flow, which means you’re free to reorder and speculate across the whole 1024b window, independently of your warp/execution width.

The optimization problem you solve is now to move all instruction pointers until they hit a threadfence, with the optimized/lowest-total-cost execution. And technically you may not know where that fence is specifically going to be! Things like self-modifying code etc are another headache not allowed gpgpu too - there certainly will be some idioms that don’t translate well, but I think that stuff is at least thankfully rare in AVX code.

This is what happening now with NPUs and other co-processors. Just not fully OS managed / directed yet but Microsoft is most likely working on that part at least.

The key part is that now there are far more use cases than there were in the early dozer days and that the current main CPU design does not compromise on vector performance like the original AMD design did (outside of extreme cases of very wide vector instructions).

And they are also targeting new use cases such as edge compute AI rather than trying to push the industry to move traditional applications towards GPU compute with HSA.

I've had thoughts along the same lines, but this would require big changes in kernel schedulers, ELF to provide the information, and probably other things.
+1 : Heterogeneous/Non uniform core configuration always require a lot of very complex adjustment to the kernel schedulers and core binding policies. Even now after almost a decade of big-little (from arm) configuration and/or chiplet design(from amd) the (linux) kernel scheduling still requires a lot tuning for things like games etc... Adding cores with very different performance characteristics would probably require the thread scheduling to be delegated to the CPU it self with only hint from the kernel scheduler
There are a couple methods that could be used.

Static analysis would probably work in this case because the in-order core would be very GPU-like while the other core would not.

In cases where performance characteristics are closer, the OS could switch cores, monitor the runtimes, and add metadata about which core worked best (potentially even about which core worked best at which times).

Persuading people to write their C++ as a graph for heterogeneous execution hasn't gone well. The machinery works though, and it's the right thing for heterogeneous compute, so should see adoption from XLA / pytorch etc.
As CPU cores get larger and larger it makes sense to always keep looking for opportunities to decouple things. AMD went with separate schedulers in the Athalon three architectural overhauls ago and hasn't reversed their decision.
> It's interesting to see that modern processor optimization still revolves around balancing hardware for specific tasks

Asking sincerely: what’s specifically so interesting about that? That is what I would naively expect.

It's also important to note that in modern hardware the processor core proper is just one piece in a very large system.

Hardware designers are adding a lot of speciality hardware, they're just not putting it into the core, which also makes a lot of sense.

https://www.researchgate.net/figure/Architectural-specializa...