Hacker News new | ask | show | jobs
by corysama 1289 days ago
Bad news. For SIMD there are not cross-platform intrinsics. Intel intrinsics map directly to SSE/AVX instructions and ARM intrinsics map directly to NEON instructions.

For cross-platform, your best bet is probably https://github.com/VcDevel/std-simd

There's https://eigen.tuxfamily.org/index.php?title=Main_Page But, it's tremendously complicated for anything other than large-scale linear algebra.

And, there's https://github.com/microsoft/DirectXMath But, it has obvious biases :P

2 comments

I beg to differ :) std::experimental::simd has a very limited set of operations: mostly just math, very few shuffles/swizzles. Last I checked, it also only worked in a recent version of GCC.

We do indeed have cross-platform intrinsics here: github.com/google/highway. Disclosure: I am the main author.

cool; thanks for pointing out your project!

Do you have any advice on how someone limited to c99/c11 can still leverage the wisdom and techniques inside it?

:) Tricky. Is it an option to build some source files with C++, and use C functions (the usual FFI) as the interface between them?
Not really, unfortunately, and it’s a pre-existing framework for teaching a class, so simplicity of compilation is extra important. Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.

For those that are new to this, can you give an example of a kind of computation or algorithm which is well-served by your project, but not possible with vector extensions like https://clang.llvm.org/docs/LanguageExtensions.html#vectors-... ?

> Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.

Agreed. Usually the interface would be something like RunEntireAlgorithm(), not DotProduct().

> For those that are new to this, can you give an example of a kind of computation or algorithm which is well-served by your project but not possible with vector extensions

Sure. Vector extensions are OKish for simple math but JPEG XL includes nontrivial cross-lane operations such as transpose and boundary handling for convolution. __builtin_shufflevector requires a known vector length, and can be pessimized (fusing two into one general all-to-all permute which is more expensive than two simple shuffles).

Also, vqsort (https://github.com/google/highway/tree/master/hwy/contrib/so...) almost entirely consists of operations not supported by the extensions, and actually works out of the box on variable-length RISC-V and SVE, which compiler extensions cannot.

This is very helpful; thank you.
Intel wrote a header that maps NEON intrinsics onto SSE to help people port to x86 Android: https://github.com/intel/ARM_NEON_2_x86_SSE
Just a heads up, as far as I know that’s more of a porting/learning tool than a production tool.

I remember us looking deeply into this and decided to hand write the SSE intrinsics. They usually map 1:1 but we had some unexpected differences in algorithm output between the x86 binary and the ARM binary when compiled with this.

But this was also back in 2019 or so, maybe it’s better now!