Hacker News new | ask | show | jobs
by andrewia 746 days ago
I understand that you can't convince developers to rewrite/recompile their applications for a processor that breaks compatibility. I'm wondering how many existing applications would be negatively impacted by cutting down vector throughput. With some searching, I see that some applications make mild use of it like Firefox. However there are applications that would negatively affected, such as noise suppression in Microsoft Teams, and crypto acceleration in libssl and the Linux kernel. Acceleration of crypto functions seems essential enough to warrant not touching vector throughput, so it seems vector operations are here to stay in CPUs.
2 comments

Modern hash table implementations use vector instructions for lookups:

- Folly: https://github.com/facebook/folly/blob/main/folly/container/...

- Abseil: https://abseil.io/about/design/swisstables

Sure; but it’s hard to do and very few programs get optimised to this point. Before reaching for vector instructions, I’ll:

- Benchmark, and verify that the code is hot.

- Rewrite from Python, Ruby, JS into a systems language (if necessary). Honorary mention for C# / Go / Java, which are often fast enough.

- Change to better data structures. Bad data structure choices are still so common.

- Reduce heap allocations. They’re more expensive than you think, especially when you take into account the effect on the cpu cache

Do those things well, and you can often get 3 or more orders of magnitude improved performance. At that point, is it worth reaching for SIMD intrinsics? Maybe. But I just haven’t written many programs where fast code written in a fast language (c, rust, etc) still wasn’t fast enough.

I think it would be different if languages like rust had a high level wrapper around simd that gave you similar performance to hand written simd. But right now, simd is horrible to use and debug. And you usually need to write it per-architecture. Even Intel and amd need different code paths because Intel has dumped avx2.

Outside generic tools like Unicode validation, json parsing and video decoding, I doubt modern simd gets much use. Llvm does what it can but ….

Indeed, people really fixate on “slow languages” but for all but the most demanding of applications, the right algorithm and data structures makes the lions share of the difference.
Reaching for SIMD intrinsics or an abstraction has been historically quite painful in C and C++. But cross-platform SIMD abstractions in C#, Swift and Mojo are changing the picture. You can write a vectorized algorithm in C# and practically not lose performance versus hand-intrinsified C, and CoreLib heavily relies on that.
Newer SoCs come with co-processors such as NPUs so it’s just a question of how long it would take for those workloads to move there.

And this would highly depend on how ubiquitous they’ll become and how standardized the APIs will be so you won’t have to target IHV specific hardware through their own libraries all the time.

Basically we need a DirectX equivalent for general purpose accelerated compute.

It’s a lot more work to push data to a GPU or NPU than to just to a couple vector ops. Crypto is important enough many architectures have hardware accelerators just for that.
For servers no, but we’re talking about endpoints here. Also this isn’t only about reducing the existing vector bandwidth but also about not increasing it outside of dedicated co-processors.