It's much better to use any of the numerous SIMD wrappers such as libsimdpp or Vc and get various benefits for free. It's possible to target everything from SSE and NEON to AVX512 with what is essentially a single code path.
Realistically the vast majority of C and C++ codebases today will never touch anything more than x86 and ARM, and I wouldn't be surprised if most never even get past x86, so I don't buy the portability argument. Portability between SSE and AVX is a better argument.
But in any case, if you're using SIMD in anger, chances are you have hard performance requirements that you really care about, and a one size fits all approach is going to leave valuable performance on the table. Whether you just have to target your own servers, or any x86 CPU made in the past 6 years, or that plus NEON-equipped ARMs, it will probably be worth the effort to duplicate the code paths, especially in comparison to the initial effort of figuring out how to vectorize your problem in the first place.
And while it's nowhere near "leftpad", if you really want an SIMD wrapper and know what you're doing, it should be well within your capabilities to write your own. Maybe not quite as spiffy as the one on github, but when I get anywhere close to assembly I find that I get more value out of doing everything from scratch and truly understanding what I'm dealing with, rather than leaving anything in someone else's hands.
> Realistically the vast majority of C and C++ codebases today will never touch anything more than x86 and ARM, and I wouldn't be surprised if most never even get past x86, so I don't buy the portability argument.
Just recently a Gentoo developer ported GHC to m68k and found some portability issues who fixed in the process, which benefit all architectures. This is also why OpenBSD devs are still on gcc3.
RISC and POWER are just two very modern ISAs to mention and not something you can ignore easily. We need more ISAs like in the past, not just two. It's very dangerous to limit ourselves to just ARM/x86 and diversity is a plus for writing more correct code and having more options. lowRISC is a nice fit for many things as is POWER, while of course ARM and x86 are here to stay. I'd count Nvidia's and AMD's GPUs as the other major architectures, but we don't usually deal directly at that level with GPUs. You choose the right chip for the job, just as phones select different SoCs for different use cases.
The idea that compiling your code for 68000 or MIPS can reveal bugs in your code does not change the fact that x86 and ARM are pretty much the only relevant CPU architectures that all but the most entrenched of government contractors could ship a product on today or in the foreseeable future that would have any use for SIMD. If you actually have a need to do extensive SIMD optimizations (say, it could shave 5ms off your frame time in a game, or save you $XXXXXX/year in your data center), PowerPC does not enter your mind at any moment.
You see it as weeding out bugs and future proofing your code in case x86 or ARM disappears tomorrow, I see it as a load of completely wasted work and optimization opportunities.
Also lowRISC learned nearly nothing from the past 20 years of CPU architecture advancement. It is not modern, it is a naive copy of a very outdated design.
By saying single code path, I don't mean single instruction stream. libsimdpp, for example, supports building same code for different instruction sets, linking into the same executable and then dispatching dynamically. Doing this by hand would mean that either:
- lots of time is wasted creating slightly different versions of code. I'm talking about e.g. AVX vs. AVX2 for floating-point code not SSE2 vs. AVX.
- micro-optimization opportunities are wasted by only coding for major revisions of the instruction set
Even when optimal performance may only be achieved via completely different approaches, the SIMD wrappers are easier to use, because they present consistent interface. Any specialized instructions may be used by simply falling back to native intrinsics.
Thus I don't see much benefit of writing SIMD code without a wrapper. The only advantage is that it's harder to shoot oneself into the foot with naive use of these wrappers, e.g. if one doesn't actually look into the generated assembly code.
Yeah, I understood what you meant, I've used wrappers like that before. My contention was with your original comment,
>It's possible to target everything from SSE and NEON to AVX512 with what is essentially a single code path.
the practice of which does not generally make the best usage of any particular instruction set, emulating certain operations that aren't available on a platform with multiple instructions, etc. It might be good enough for many light optimization jobs, in which case I'd say go for it, you're doing so much better than the vast majority of programmers writing Python or whatever. But what I was trying to argue was that if you really need to crunch the hell out of some numbers, then you probably have a small set of target platforms that you can justify directly using intrinsics (or even assembly) for.
This claim, however:
>I'm talking about e.g. AVX vs. AVX2 for floating-point code not SSE2 vs. AVX.
is a lot more reasonable, but you could do the same with some strategically placed #ifdefs with native intrinsics or assembly.
Not sure about "single code path". Differences amid SIMD flavors are significant, there are cases when translation one-to-one is either impossible or unpractical. A bright example are AVX2 instructions operating on 128-bit lanes rather whole 256-bit registers.
And wrappers exists in the C++ ecosystem, C programmers are stuck to intrinsics.
> And wrappers exists in the C++ ecosystem, C programmers are stuck to intrinsics.
If you can accept working with GNU extensions that are available in recent-ish GCC and Clang (but not MSVC, not sure about Intel ICC), there are pretty nice vector extensions [0].
With them you can get standard binary operators working for arithmetic (+,-,*,/ etc) and shuffling with __builtin_shuffle. These are CPU independent, the same code compiles neatly to ARM NEON as well as x86 SSE+AVX+FMA. All you need is a typedef with an __attribute__.
The vector extension functions don't cover the whole instruction sets but the vector types are compatible with _mm128 and NEON native formats so you can resort to intrinsics when necessary.
However, for a lot of SIMD tasks I encounter, just basic arithmetic + shuffles is more than 80% of what I need.
If you want to see some examples, take a look at my collection of 3d graphics and physics related SIMD routines [1]. (note: this project could use some help, let me know if you're interested in doing something with it or porting some of the hand optimized routines to more used math libs like glm)
> If you can accept working with GNU extensions that are available in recent-ish GCC and Clang
I do my private project in C++ so it's not a case, but at my current company we use also MSVC. I wish we could abandon that compiler and work with GCC or clang only.
> However, for a lot of SIMD tasks I encounter, just basic arithmetic + shuffles is more than 80% of what I need.
> ... but at my current company we use also MSVC. I wish we could abandon that compiler and work with GCC or clang only.
Good news! These days you can produce MSVC compatible binaries with Clang or even use Clang as a compiler from the C++ IDE.
Whether or not you can do this in practice is another matter, but it can be done.
> Your remaining 20% is my 80%. :)
Yeah, if you look at my examples, they're rather straightforward arithmetic with 4 dimensional vectors. There's very little need for any integer arithmetic or more exotic combinations of operations. A little fused multiply-and-add here and there.
But I haven't seen a better method for this, most of the code is CPU-agnostic and will compile to x86 or ARM code using all the available instruction sets (depending on compiler arguments, e.g. -mavx2 or -march=native). I really haven't seen a SIMD math lib with so little duplication for different CPUs elsewhere.
The property of AVX and AVX2 you mentioned actually helps having single code path. If the SIMD wrapper allows parameterization on vector width (most do that), you can simply increase vector width when compiling for AVX and that's it.
I understand you point, however it not as simple as it seems. Of course, for trivial code transition between different SIMD flavors could be seamless. But the world is cruel. :)
Think about shuffling instructions (pshufb), lookup vector for the instruction are different in AVX2 and SSE. Even if an AVX2 vector could be created by cloning SSE vector twice, this must be a programmer decision.
Another example is algorithm using video-encoding instruction mpsadbw to locate substrings (http://0x80.pl/articles/sse4_substring_locate.html#introduct...). AVX2 instruction vmpsadw operates on 128-bit lanes and the algorithm have to be rewritten in some parts to align with this limitation.
Would you be able to point me towards a shipping product/library that does this? It's easy to find examples of people hardcoding x64 assembly (x264, zlib, libyuv) but I haven't stumbled across anybody making good use of a high level wrapper.
Though I must note in this case the SIMD wrapper has significant problems. Due certain design decisions the wrapper performs suboptimally on mixed float-integer code on AVX for example.
But in any case, if you're using SIMD in anger, chances are you have hard performance requirements that you really care about, and a one size fits all approach is going to leave valuable performance on the table. Whether you just have to target your own servers, or any x86 CPU made in the past 6 years, or that plus NEON-equipped ARMs, it will probably be worth the effort to duplicate the code paths, especially in comparison to the initial effort of figuring out how to vectorize your problem in the first place.
And while it's nowhere near "leftpad", if you really want an SIMD wrapper and know what you're doing, it should be well within your capabilities to write your own. Maybe not quite as spiffy as the one on github, but when I get anywhere close to assembly I find that I get more value out of doing everything from scratch and truly understanding what I'm dealing with, rather than leaving anything in someone else's hands.