| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mgaunard 35 days ago
	For me the main issue is that if you're serious about SIMD, you need to use a state-of-the-art library and can't rely on some standard library whose quality is variable, unreliable, and which is by design always behind.

2 comments

jandrewrogers 35 days ago

For some algorithms you have to compromise the data layout for compatibility across the widest number of microarchitectures by nerfing the performance on advanced SIMD microarchitectures working on the same data structures. There really isn’t a way to square that circle. You can make it portable or you can make it optimal, and the performance gap across those two implementations can be vast.

In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.

link

mgaunard 35 days ago

For Boost.SIMD (which is what became Eve), a large part of what we did to tackle those problems was building an overload dispatching system so that we could easily inject increasingly specialized implementations depending on the types and instruction set available, in such a way that operations could combine efficiently.

That, however, performed quite poorly at compile-time, and was not really ODR-safe (forceinline was used as a workaround). At least one of the forks moved to using a dedicated meta-language and a custom compiler to generate the code instead. There are better ways to do that in modern C++ now.

We also focused on higher-level constructs trying to capture the intent rather than trying to abstract away too low-level features; some of the features were explicitly provided as kernels or algorithms instead of plain vector operations.

link

mattip 35 days ago

NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?

link

hansvm 35 days ago

Numpy is interesting in that regard since its dispatch mechanism adds up to a lot of overhead. There are a lot of problems where a naive list comprehension is faster, even when SIMD could be used to great effect.

link

camel-cdr 35 days ago

The data layout can often be done dynamically based on your target architecture.

link

portly 35 days ago

Don't let the best be the enemy of the good. I got amazing performance for swapping for-loops with some simple SIMD patterns. Moreover. By doing this. I noticed that the codebase started to become better shaped for performance as well. By writing SIMD patterns, you get into the mindset of tight, hot loops.

link

imtringued 35 days ago

The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.

If you wanted to explicitly opt into bundling/batching of operations, you wouldn't actually want to define a fixed register size. You'd want a data type that represents an arbitrarily sized register and exposes some across batch operations. Then the compiler can make use of this mini DSL to optimize your SIMD code to actual instructions.

The problem is solvable, but it requires cooperation from all parties. CPU vendors must offer a basic set of vector instructions that is supported on all architectures. The language committee must be willing to support function local variable size data types that are never exposed in the ABI. The compiler developers must increase the quality of their auto vectorizers.

link

SkiFire13 35 days ago

> The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.

This will work only for the most basic SIMD usages.

> CPU vendors must offer a basic set of vector instructions that is supported on all architectures.

This will take decades because you cannot change existing architectures/processors.

link

camel-cdr 35 days ago

> This will take decades because you cannot change existing architectures/processors.

I think once, AVX-512, SVE and RVV are wide spread enough, you'll have a rather powerfull baselevel you can target. But this will take a lot of time.

link

SkiFire13 35 days ago

> AVX-512

Which subset though? Some of them are not supported by some recent CPUs (e.g. 2024).

Not to mention Alder Lake not supporting AVX512.

link

sgerenser 35 days ago

Yeah AVX-512 is basically dead as a universal target for x86, the future is now AVX-10. But I believe there is a reasonable subset that will work on both.

link

janwas 35 days ago

This works today :) Highway provides such an abstraction for arbitrary vector lengths and maps them to intrinsics. All on the library level, no need to wait years for compiler or language updates.

link

vrighter 35 days ago

what you effectively said is "there should be only one isa".

Because if that was all it took, why wouldn't it also apply to every other instruction set too?

link