|
> I'd like to commend the authors for embarking on this. As the crazy bastard who started SIMDe, thanks! I'm very concerned about portability, so unfortunately just using clang isn't really an option. Most of my code has to work not only on GCC and clang, but also on MSVC (/me cries) and ICC, and I generally try to make it work on other compilers (PGI, IAR, etc.). In my experience you're right, clang does do much better with vectors that don't match the hardware by default, but in SIMDe we actually have explicit fallbacks which call shorter functions twice and the result is pretty good on both compilers. For example, here is what `_mm256_add_ps` looks like on GCC and clang when targeting SSE2: <https://godbolt.org/z/n68Ecn>. Length-agnostic instruction sets like SVE are definitely very interesting, but honestly I'd rather see them de-emphasize non-portable APIs like SVE and instead work on improving the compiler's ability to recognize the relevant code patterns to work with things like OpenMP SIMD (which, to be clear, does not require the OpenMP runtime). I'd also be happy to see more builtins which work cross-platform… for example, I'd love to see builtins for saturated operations which could easily be auto-vectorized by the compiler when used in an OpenMP SIMD loop. Currently for new SIMD code, I start with an OpenMP SIMD implementation, then profile. If I see any spots which perform particularly poorly and/or are particularly hot, I'll work on some optimized implementations for those spots using intrinsics. In the future I hope to need to hand-optimize less code, but for now I think this offers a pretty good trade-off between portability, performance, and development time. |