Hacker News new | ask | show | jobs
by p0nce 1145 days ago
What I do for D is implement the intrinsics following the semantics of the x86 instructions. Target x86, x86_64, arm32, arm64 with D compilers, that smoothes out the difference. It's a lot of work, and very similar to the simd-everywhere library that does it for C++. There is not so much impendence mismatch between x86 and arm. I wish more people would understand that you absolutely need such intrinsics for fast software, there is no way around that. You're not going to write your 4x-at-once pow function for each arch, also you won't find a better name for `_mm_madd_epi16`. (EDIT: I guess nowadays you could do that but with taking ARM semantics as source of truth).

https://github.com/AuburnSounds/intel-intrinsics

1 comments

Mostly agree, but there is actually a mismatch between madd_epi16 and Arm. Implementing Arm semantics or x86 on the other requires ~5 instructions, but if we generalize the definition to allow reordering (e.g. Highway's ReorderWidenMulAccumulate [1]), it's only 2 instructions.

1: https://github.com/google/highway/blob/master/g3doc/quick_re...

Indeed, and your comment led me to find additional issues with my port of _mm_madd_epi16.

I agree it would perhaps be possible to find better semantics for SIMD that kinda gloss over all the differences. That would be cleaner but require a lot of names. Well I suppose that's what Highway does, isn't it?

:) Yes indeed! Always happy to discuss suggestions for new intrinsics via Github issues.