|
|
|
|
|
by p0nce
1145 days ago
|
|
What I do for D is implement the intrinsics following the semantics of the x86 instructions. Target x86, x86_64, arm32, arm64 with D compilers, that smoothes out the difference. It's a lot of work, and very similar to the simd-everywhere library that does it for C++. There is not so much impendence mismatch between x86 and arm.
I wish more people would understand that you absolutely need such intrinsics for fast software, there is no way around that. You're not going to write your 4x-at-once pow function for each arch, also you won't find a better name for `_mm_madd_epi16`. (EDIT: I guess nowadays you could do that but with taking ARM semantics as source of truth). https://github.com/AuburnSounds/intel-intrinsics |
|
1: https://github.com/google/highway/blob/master/g3doc/quick_re...