| Not OP but also work with this. There’s some tutorials but honestly the best thing is to just use them. Write an image processing routine that does something like apply a gaussian blur to a black and white image. The c++ code for this is everywhere. You have a fixed kernel (2d matrix) and you have to do repeat multiplication and addition to each pixel for each element in the kernel. Write it in C++ or Rust. Then read the Arm SIMD manual, find the instructions that do the math you want, and switch it over to intrinsics. You are doing the same exact operations with the intrinsics as the raw c++. Just 8 or 16 of them at a single time. Run them side by side for parity and to check speed, tweak the simd, etc. Arm has good (well ,okay) documentation https://developer.arm.com/documentation/den0018/a/?lang=en https://arm-software.github.io/acle/neon_intrinsics/advsimd.... * Edit: you also have to do this on a supported architecture. Raspberry pi’s have a neon core at least in the 3’s. Not sure about the 4’s but I believe so too! |
Go to https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
Start with SSE, SSE2, SSE3
Write small functions in https://godbolt.org/ . Watch the assembly and the program output.