Hacker News new | ask | show | jobs
by portly 29 days ago
Don't let the best be the enemy of the good. I got amazing performance for swapping for-loops with some simple SIMD patterns. Moreover. By doing this. I noticed that the codebase started to become better shaped for performance as well. By writing SIMD patterns, you get into the mindset of tight, hot loops.
1 comments

The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.

If you wanted to explicitly opt into bundling/batching of operations, you wouldn't actually want to define a fixed register size. You'd want a data type that represents an arbitrarily sized register and exposes some across batch operations. Then the compiler can make use of this mini DSL to optimize your SIMD code to actual instructions.

The problem is solvable, but it requires cooperation from all parties. CPU vendors must offer a basic set of vector instructions that is supported on all architectures. The language committee must be willing to support function local variable size data types that are never exposed in the ABI. The compiler developers must increase the quality of their auto vectorizers.

> The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.

This will work only for the most basic SIMD usages.

> CPU vendors must offer a basic set of vector instructions that is supported on all architectures.

This will take decades because you cannot change existing architectures/processors.

> This will take decades because you cannot change existing architectures/processors.

I think once, AVX-512, SVE and RVV are wide spread enough, you'll have a rather powerfull baselevel you can target. But this will take a lot of time.

> AVX-512

Which subset though? Some of them are not supported by some recent CPUs (e.g. 2024).

Not to mention Alder Lake not supporting AVX512.

Yeah AVX-512 is basically dead as a universal target for x86, the future is now AVX-10. But I believe there is a reasonable subset that will work on both.
It's a little dramatic to say avx512 is dead versus 10 - rather, I would say that avx10 finalizes a universally available set of avx512 extensions. For AVX 10.1, there's essentially, no difference after Intel backed out of reducing the vector length.

For at least the next decade AVX 512 will be the high performance target, reaching all of the zen4/5/6 CPUs as well as whatever avx-10 enabled CPUs Intel producers.

This works today :) Highway provides such an abstraction for arbitrary vector lengths and maps them to intrinsics. All on the library level, no need to wait years for compiler or language updates.
what you effectively said is "there should be only one isa".

Because if that was all it took, why wouldn't it also apply to every other instruction set too?