|
I'm the lead developer of SIMDe. I wouldn't say that portability is the main focus. The first step is to get portable implementations up and running, but a huge number of functions have optimized implementations for NEON, AltiVec/VSX, and WASM SIMD 128, and we're working on adding more. We go to a lot of trouble to get good performance on multiple architectures, basically writing each implementation several times and using ifdefs to switch depending on what the fastest version available to a given architecture will be. Even just for the portable implementations, we use a lot of hints to help the compiler auto-vectorize. Almost every portable implementation has a loop which uses a pragma to try to get the compiler do the right thing (OpenMP SIMD, clang loop-specific pragmas, GCC ivdep, etc.). On top of that we take advantage of lots of compiler-sepecific features to speed things up where possible, including GCC-style vector extensions, __builtin_shuffle/__builtin_shufflevector, and __builtin_convertvector. SIMDe never going to be as fast as someone who knows what they're doing writing an optimized implementation for a given target. However, it should be as fast (or faster) than someone who is just trying to do a direct port where they just try to match the existing code as closely as possible. |
[0]: http://web.archive.org/web/20140807014206/http://x264dev.mul...