Hacker News new | ask | show | jobs
by clevernickname 3733 days ago
Realistically the vast majority of C and C++ codebases today will never touch anything more than x86 and ARM, and I wouldn't be surprised if most never even get past x86, so I don't buy the portability argument. Portability between SSE and AVX is a better argument.

But in any case, if you're using SIMD in anger, chances are you have hard performance requirements that you really care about, and a one size fits all approach is going to leave valuable performance on the table. Whether you just have to target your own servers, or any x86 CPU made in the past 6 years, or that plus NEON-equipped ARMs, it will probably be worth the effort to duplicate the code paths, especially in comparison to the initial effort of figuring out how to vectorize your problem in the first place.

And while it's nowhere near "leftpad", if you really want an SIMD wrapper and know what you're doing, it should be well within your capabilities to write your own. Maybe not quite as spiffy as the one on github, but when I get anywhere close to assembly I find that I get more value out of doing everything from scratch and truly understanding what I'm dealing with, rather than leaving anything in someone else's hands.

2 comments

> Realistically the vast majority of C and C++ codebases today will never touch anything more than x86 and ARM, and I wouldn't be surprised if most never even get past x86, so I don't buy the portability argument.

Just recently a Gentoo developer ported GHC to m68k and found some portability issues who fixed in the process, which benefit all architectures. This is also why OpenBSD devs are still on gcc3.

RISC and POWER are just two very modern ISAs to mention and not something you can ignore easily. We need more ISAs like in the past, not just two. It's very dangerous to limit ourselves to just ARM/x86 and diversity is a plus for writing more correct code and having more options. lowRISC is a nice fit for many things as is POWER, while of course ARM and x86 are here to stay. I'd count Nvidia's and AMD's GPUs as the other major architectures, but we don't usually deal directly at that level with GPUs. You choose the right chip for the job, just as phones select different SoCs for different use cases.

The idea that compiling your code for 68000 or MIPS can reveal bugs in your code does not change the fact that x86 and ARM are pretty much the only relevant CPU architectures that all but the most entrenched of government contractors could ship a product on today or in the foreseeable future that would have any use for SIMD. If you actually have a need to do extensive SIMD optimizations (say, it could shave 5ms off your frame time in a game, or save you $XXXXXX/year in your data center), PowerPC does not enter your mind at any moment.

You see it as weeding out bugs and future proofing your code in case x86 or ARM disappears tomorrow, I see it as a load of completely wasted work and optimization opportunities.

Also lowRISC learned nearly nothing from the past 20 years of CPU architecture advancement. It is not modern, it is a naive copy of a very outdated design.

By saying single code path, I don't mean single instruction stream. libsimdpp, for example, supports building same code for different instruction sets, linking into the same executable and then dispatching dynamically. Doing this by hand would mean that either:

- lots of time is wasted creating slightly different versions of code. I'm talking about e.g. AVX vs. AVX2 for floating-point code not SSE2 vs. AVX.

- micro-optimization opportunities are wasted by only coding for major revisions of the instruction set

Even when optimal performance may only be achieved via completely different approaches, the SIMD wrappers are easier to use, because they present consistent interface. Any specialized instructions may be used by simply falling back to native intrinsics.

Thus I don't see much benefit of writing SIMD code without a wrapper. The only advantage is that it's harder to shoot oneself into the foot with naive use of these wrappers, e.g. if one doesn't actually look into the generated assembly code.

Yeah, I understood what you meant, I've used wrappers like that before. My contention was with your original comment,

>It's possible to target everything from SSE and NEON to AVX512 with what is essentially a single code path.

the practice of which does not generally make the best usage of any particular instruction set, emulating certain operations that aren't available on a platform with multiple instructions, etc. It might be good enough for many light optimization jobs, in which case I'd say go for it, you're doing so much better than the vast majority of programmers writing Python or whatever. But what I was trying to argue was that if you really need to crunch the hell out of some numbers, then you probably have a small set of target platforms that you can justify directly using intrinsics (or even assembly) for.

This claim, however:

>I'm talking about e.g. AVX vs. AVX2 for floating-point code not SSE2 vs. AVX.

is a lot more reasonable, but you could do the same with some strategically placed #ifdefs with native intrinsics or assembly.