| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by an1sotropy 1289 days ago
	Thanks. It would be simpler if I were working on only one platform that I knew supported a specific set of instructions. But, even though the code does involve some convolutions and things that are a good fit for SIMD, it needs to be cross-platform, so that the intrinsics should compile to SSE/AVX on Intel and NEON (?) on ARM where possible, but to something slow but workable on older chips. Delineating and illustrating using the most cross-platform intrinsics is what I'm looking for guidance on.

1 comments

corysama 1289 days ago

Bad news. For SIMD there are not cross-platform intrinsics. Intel intrinsics map directly to SSE/AVX instructions and ARM intrinsics map directly to NEON instructions.

For cross-platform, your best bet is probably https://github.com/VcDevel/std-simd

There's https://eigen.tuxfamily.org/index.php?title=Main_Page But, it's tremendously complicated for anything other than large-scale linear algebra.

And, there's https://github.com/microsoft/DirectXMath But, it has obvious biases :P

link

janwas 1289 days ago

I beg to differ :) std::experimental::simd has a very limited set of operations: mostly just math, very few shuffles/swizzles. Last I checked, it also only worked in a recent version of GCC.

We do indeed have cross-platform intrinsics here: github.com/google/highway. Disclosure: I am the main author.

link

an1sotropy 1289 days ago

cool; thanks for pointing out your project!

Do you have any advice on how someone limited to c99/c11 can still leverage the wisdom and techniques inside it?

link

janwas 1288 days ago

:) Tricky. Is it an option to build some source files with C++, and use C functions (the usual FFI) as the interface between them?

link

an1sotropy 1288 days ago

Not really, unfortunately, and it’s a pre-existing framework for teaching a class, so simplicity of compilation is extra important. Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.

For those that are new to this, can you give an example of a kind of computation or algorithm which is well-served by your project, but not possible with vector extensions like https://clang.llvm.org/docs/LanguageExtensions.html#vectors-... ?

link

janwas 1288 days ago

> Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.

Agreed. Usually the interface would be something like RunEntireAlgorithm(), not DotProduct().

> For those that are new to this, can you give an example of a kind of computation or algorithm which is well-served by your project but not possible with vector extensions

Sure. Vector extensions are OKish for simple math but JPEG XL includes nontrivial cross-lane operations such as transpose and boundary handling for convolution. __builtin_shufflevector requires a known vector length, and can be pessimized (fusing two into one general all-to-all permute which is more expensive than two simple shuffles).

Also, vqsort (https://github.com/google/highway/tree/master/hwy/contrib/so...) almost entirely consists of operations not supported by the extensions, and actually works out of the box on variable-length RISC-V and SVE, which compiler extensions cannot.

link

jmgao 1289 days ago

Intel wrote a header that maps NEON intrinsics onto SSE to help people port to x86 Android: https://github.com/intel/ARM_NEON_2_x86_SSE

link

teux 1289 days ago

Just a heads up, as far as I know that’s more of a porting/learning tool than a production tool.

I remember us looking deeply into this and decided to hand write the SSE intrinsics. They usually map 1:1 but we had some unexpected differences in algorithm output between the x86 binary and the ARM binary when compiled with this.

But this was also back in 2019 or so, maybe it’s better now!

link