Hacker News new | ask | show | jobs
by teux 1289 days ago
Not OP but also work with this.

There’s some tutorials but honestly the best thing is to just use them.

Write an image processing routine that does something like apply a gaussian blur to a black and white image. The c++ code for this is everywhere. You have a fixed kernel (2d matrix) and you have to do repeat multiplication and addition to each pixel for each element in the kernel.

Write it in C++ or Rust. Then read the Arm SIMD manual, find the instructions that do the math you want, and switch it over to intrinsics. You are doing the same exact operations with the intrinsics as the raw c++. Just 8 or 16 of them at a single time.

Run them side by side for parity and to check speed, tweak the simd, etc.

Arm has good (well ,okay) documentation

https://developer.arm.com/documentation/den0018/a/?lang=en

https://arm-software.github.io/acle/neon_intrinsics/advsimd....

* Edit: you also have to do this on a supported architecture. Raspberry pi’s have a neon core at least in the 3’s. Not sure about the 4’s but I believe so too!

3 comments

Adding on:

Go to https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

Start with SSE, SSE2, SSE3

Write small functions in https://godbolt.org/ . Watch the assembly and the program output.

> https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

Intel's Intrinsics Guide is exactly what I used and it was before I learned about Compiler Explorer.

I already had a large and thorough suite of unit tests for inputs and expected outputs, including happy and sad paths. So it was pretty easy to poke around and learn what works, what doesn't.

It was definitely time intensive (took about three months for about 50 lines of code) but it also saved the company a few million dollars in hardware (DNA analysis software to compare a couple TiB of data requires a _lot_ of performance). I have since moved to a different company, partly because I never saw a bonus for saving all that money.

The intrinsics guide does good to show what's available but it does not do a good job of documenting how each instruction actually works... many intrinsics are missing pseudocode and some pseudocode can have ambiguous cases. I used GDB in assembly mode to compare that table against the register content instruction-by-instruction to figure out where I misunderstood something if something went awry.

Frustratingly, some operations are available in 64-bits but not bigger, some in 128-bits but not bigger, etc. So I wrote up a rough draft in LibreOffice Calc with 64, 128, and 256 columns to follow the bits around every intended operation. I then correlated against the intrinsics guide to determine what instructions are available to me in what bit sizes. For a given test run, each row in the spreadsheet was colored by what the original data contained, another row for what I needed the answer to be for that test case, then auto-color another row's cell green or red if the register after a candidate set of instructions did or didn't match the desired output. Any time I had to move columns around (the data was 4-bits wide), I'd color a set of 4 columns to follow where they go during swizzling.

I know both gcc (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) and clang (https://clang.llvm.org/docs/LanguageExtensions.html#vectors-...) have vector extensions that are intended to make it easier to write SIMD code (you can write c = a + b; instead of having to know the specific vector instruction needed to add two vectors, for example), but don’t know how well these reach that goal.

Are these helpful, or not good enough to write performant vector code?

I’ll try them!
Thanks. It would be simpler if I were working on only one platform that I knew supported a specific set of instructions. But, even though the code does involve some convolutions and things that are a good fit for SIMD, it needs to be cross-platform, so that the intrinsics should compile to SSE/AVX on Intel and NEON (?) on ARM where possible, but to something slow but workable on older chips. Delineating and illustrating using the most cross-platform intrinsics is what I'm looking for guidance on.
Bad news. For SIMD there are not cross-platform intrinsics. Intel intrinsics map directly to SSE/AVX instructions and ARM intrinsics map directly to NEON instructions.

For cross-platform, your best bet is probably https://github.com/VcDevel/std-simd

There's https://eigen.tuxfamily.org/index.php?title=Main_Page But, it's tremendously complicated for anything other than large-scale linear algebra.

And, there's https://github.com/microsoft/DirectXMath But, it has obvious biases :P

I beg to differ :) std::experimental::simd has a very limited set of operations: mostly just math, very few shuffles/swizzles. Last I checked, it also only worked in a recent version of GCC.

We do indeed have cross-platform intrinsics here: github.com/google/highway. Disclosure: I am the main author.

cool; thanks for pointing out your project!

Do you have any advice on how someone limited to c99/c11 can still leverage the wisdom and techniques inside it?

:) Tricky. Is it an option to build some source files with C++, and use C functions (the usual FFI) as the interface between them?
Not really, unfortunately, and it’s a pre-existing framework for teaching a class, so simplicity of compilation is extra important. Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.

For those that are new to this, can you give an example of a kind of computation or algorithm which is well-served by your project, but not possible with vector extensions like https://clang.llvm.org/docs/LanguageExtensions.html#vectors-... ?

Intel wrote a header that maps NEON intrinsics onto SSE to help people port to x86 Android: https://github.com/intel/ARM_NEON_2_x86_SSE
Just a heads up, as far as I know that’s more of a porting/learning tool than a production tool.

I remember us looking deeply into this and decided to hand write the SSE intrinsics. They usually map 1:1 but we had some unexpected differences in algorithm output between the x86 binary and the ARM binary when compiled with this.

But this was also back in 2019 or so, maybe it’s better now!