Hacker News new | ask | show | jobs
by Guzba 1143 days ago
Unless something has changed I really wish Zig was open to SIMD intrinsics. Imo, if you're manually writing SIMD, you are doing complex performance-oriented programming and you really do end up needing to know what the instruction set you're using gives you for tools. Eg arm64 has pretty cool interlacing/deinterlacing which would be goofy to re-create on amd64 and there is subtlety to multiplication and lots of other things. SIMD instructions also sidestep lots of compiler-ey stuff like strict aliasing and types don't matter, sizes and lane positions do. It is an interesting beast.
4 comments

This has been mentioned before:

https://github.com/ziglang/zig/issues/7702

I don't think anyone disagrees about the need for intrinsics. In fact, I have actually taken a crack at implementing the AVX512 intrinsics into the Zig compiler as builtin functions on my personal fork of the repo. But it is a non-trivial task - there are over 450 distinct instructions across the entire AVX512 feature set, and over 100 for AVX2. And I'm only focusing on support for the LLVM backend, which does the heavy lifting in the codegen phase. Getting the register allocation and instruction scheduling correct for all the intrinsics in the self hosted backend would involve a lot more work.

What I do for D is implement the intrinsics following the semantics of the x86 instructions. Target x86, x86_64, arm32, arm64 with D compilers, that smoothes out the difference. It's a lot of work, and very similar to the simd-everywhere library that does it for C++. There is not so much impendence mismatch between x86 and arm. I wish more people would understand that you absolutely need such intrinsics for fast software, there is no way around that. You're not going to write your 4x-at-once pow function for each arch, also you won't find a better name for `_mm_madd_epi16`. (EDIT: I guess nowadays you could do that but with taking ARM semantics as source of truth).

https://github.com/AuburnSounds/intel-intrinsics

Mostly agree, but there is actually a mismatch between madd_epi16 and Arm. Implementing Arm semantics or x86 on the other requires ~5 instructions, but if we generalize the definition to allow reordering (e.g. Highway's ReorderWidenMulAccumulate [1]), it's only 2 instructions.

1: https://github.com/google/highway/blob/master/g3doc/quick_re...

Indeed, and your comment led me to find additional issues with my port of _mm_madd_epi16.

I agree it would perhaps be possible to find better semantics for SIMD that kinda gloss over all the differences. That would be cleaner but require a lot of names. Well I suppose that's what Highway does, isn't it?

:) Yes indeed! Always happy to discuss suggestions for new intrinsics via Github issues.
I have not been monitoring the SIMD situation in Zig so it is nice to hear that there is some general support for intrinsics even if they are not yet added.

Thanks for your effort working on an implementation too. I am aware how large these instruction sets have gotten so I can at certainly imagine at least some of the effort of the undertaking.

Writting SIMD code with intrinsics is kind of ugly / non-portable and close to assembly language.

But it is useful and given the peculiarities of those SIMD instructions, I am not convinced that it will ever be sufficient to use "vectorized" types + a few hints and let the compiler do the work. That would be nice though.

I understand the hesitation of a language design team to replicate the full intrinsics mess, they are probably hoping to find something better.

In the mean time we call still fallback to C to write SIMD heavy code.

For anybody interested in this, here is an article discussing a very similar problem using arm neon intrinsics, also using the interleaved loads: https://branchfree.org/2019/04/01/fitting-my-head-through-th...
Or even all target-specific intrinsics - not limited to SIMD ones.