Hacker News new | ask | show | jobs
by an1sotropy 1289 days ago
Thanks. It would be simpler if I were working on only one platform that I knew supported a specific set of instructions. But, even though the code does involve some convolutions and things that are a good fit for SIMD, it needs to be cross-platform, so that the intrinsics should compile to SSE/AVX on Intel and NEON (?) on ARM where possible, but to something slow but workable on older chips. Delineating and illustrating using the most cross-platform intrinsics is what I'm looking for guidance on.
1 comments

Bad news. For SIMD there are not cross-platform intrinsics. Intel intrinsics map directly to SSE/AVX instructions and ARM intrinsics map directly to NEON instructions.

For cross-platform, your best bet is probably https://github.com/VcDevel/std-simd

There's https://eigen.tuxfamily.org/index.php?title=Main_Page But, it's tremendously complicated for anything other than large-scale linear algebra.

And, there's https://github.com/microsoft/DirectXMath But, it has obvious biases :P

I beg to differ :) std::experimental::simd has a very limited set of operations: mostly just math, very few shuffles/swizzles. Last I checked, it also only worked in a recent version of GCC.

We do indeed have cross-platform intrinsics here: github.com/google/highway. Disclosure: I am the main author.

cool; thanks for pointing out your project!

Do you have any advice on how someone limited to c99/c11 can still leverage the wisdom and techniques inside it?

:) Tricky. Is it an option to build some source files with C++, and use C functions (the usual FFI) as the interface between them?
Not really, unfortunately, and it’s a pre-existing framework for teaching a class, so simplicity of compilation is extra important. Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.

For those that are new to this, can you give an example of a kind of computation or algorithm which is well-served by your project, but not possible with vector extensions like https://clang.llvm.org/docs/LanguageExtensions.html#vectors-... ?

> Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.

Agreed. Usually the interface would be something like RunEntireAlgorithm(), not DotProduct().

> For those that are new to this, can you give an example of a kind of computation or algorithm which is well-served by your project but not possible with vector extensions

Sure. Vector extensions are OKish for simple math but JPEG XL includes nontrivial cross-lane operations such as transpose and boundary handling for convolution. __builtin_shufflevector requires a known vector length, and can be pessimized (fusing two into one general all-to-all permute which is more expensive than two simple shuffles).

Also, vqsort (https://github.com/google/highway/tree/master/hwy/contrib/so...) almost entirely consists of operations not supported by the extensions, and actually works out of the box on variable-length RISC-V and SVE, which compiler extensions cannot.

Intel wrote a header that maps NEON intrinsics onto SSE to help people port to x86 Android: https://github.com/intel/ARM_NEON_2_x86_SSE
Just a heads up, as far as I know that’s more of a porting/learning tool than a production tool.

I remember us looking deeply into this and decided to hand write the SSE intrinsics. They usually map 1:1 but we had some unexpected differences in algorithm output between the x86 binary and the ARM binary when compiled with this.

But this was also back in 2019 or so, maybe it’s better now!