There are dozens of libraries, frameworks, and compiler toolchains that try to abstract away SIMD capabilities, but I don't think it's a great approach.
The only 2 approaches that still make sense to me:
A. Writing serial vectorization-aware code in a native compiled language, hoping your compiler will auto-vectorize.
B. Implementing natively for every hardware platform, as the ISA differences are too big to efficiently abstract away anything beyond 128-register float multiplication and addition.
This article, in a way, an attempt to show how big the differences even for simple data-parallel floating-point tasks.
Numerics in .NET are not a high-level abstraction and do out of box what many mature vectorized libraries end up doing themselves - there is significant overlap between NEON, SSE* and, if we overlook vector width, AVX2/512 and WASMs PackedSIMD.
.NET has roughly three vector APIs:
- Vector<T> which is platform-defined width vector that exposes common set of operations
- Vector64/128/256/512<T> which has wider API than the previous one
- Platform intrinsics - basically immintrin.h
Notably, platform intrinsics use respective VectorXXX<T> types which allows to write common parts of the algorithm in a portable way and apply platform intrinsics in specific areas where it makes sense. Also some method have 'Unsafe' and 'Native' variants to allow for vector to exhibit platform-specific behavior like shuffles since in many situations this is still the desired output for the common case.
The .NET's compiler produces competitive with GCC and sometimes Clang codegen for these. It's gotten particularly good at lowering AVX512.
I will respectfully disagree with your statement, with the caveat that I mostly dabble in arithmetic with 128b/256b float and int vectors.
Using C or C++ with vector extensions (Gcc/Clang) or Rust (nightly) std::simd is very easy and you get code that is portable to different CPUs and ISAs.
But most importantly they have a zero cost fallback option to CPU-specific intrinsics when you need them. An f32x8 can be passed at zero cost as __mm256 to any core::arch::x86_64::__mm_intrinsic (or xmmintrin.h in C++ land).
You gain portable arithmetic and swizzles and SIMD vector types, but lose nothing. Not having to write everything for x86_64 and aarch64 is a huge win even if doesn't quite cover everything.
Additionally you can use wider vectors than your hardware supports, the compiler is able to split your f64x64 to 128, 256 or 512 bit registers as needed depending on the compile target.
There's the middle-ground approach of having primarily target-specific operations but with intersecting ones named the same, and allowing easily building custom abstractions on top of such to paper over the differences how best it makes sense for the given application. That's the approach https://github.com/mlochbaum/Singeli takes.
There's a good amount of stuff that can clearly utilize SIMD without much platform-specificness, but doesn't easily autovectorize - early-exit checks in a loop, packed bit boolean stuff, some data rearranging, probing hashmap checks, some very-short-variable-length-loop things. And while there might often be some parts that do just need to be entirely target-specific, they'll usually be surrounded by stuff that doesn't (the loop, trip count calculation, loads/stores, probably some arithmetic).
The only 2 approaches that still make sense to me:
A. Writing serial vectorization-aware code in a native compiled language, hoping your compiler will auto-vectorize.
B. Implementing natively for every hardware platform, as the ISA differences are too big to efficiently abstract away anything beyond 128-register float multiplication and addition.
This article, in a way, an attempt to show how big the differences even for simple data-parallel floating-point tasks.