Hacker News new | ask | show | jobs
by nemequ1729 2146 days ago
(Lead developer of SIMDe here.)

It sounds like you actually know what you're doing, so in this case you're probably right, at least if all you do is compile your x86 code with SIMDe.

That said, SIMDe also provides support for other architectures, notably WASM SIMD 128 and AltiVec/VSX, as well as portable implementations which work everywhere, including on CPUs I'd never heard of until people told me SIMDe was working well on them (I'm thinking of Kalray, which supports vectors but doesn't have an API and instead relies on compiler auto-vectorization support).

One use case for SIMDe which may be interesting for you is that you can freely mix calls to different APIs. Say, for example, that you already have a bunch of x86 code written and want a NEON port. You can add SIMDe and you get a NEON port basically for free, then you can start adding some ifdefs to add optimizations for NEON without having to rewrite the whole thing. SIMDe doesn't in any way prevent you from optimizing your NEON (or whatever) port.

The way I tend to look at it is that SIMDe never makes your code slower, only more portable.

1 comments

I might give it a try next time I’ll need to do something for AMD64 + NEON. Not going to rewrite code I have already written, debugged and shipped.

Also, about this

> we have an extensive test suite to verify our implementations

Don’t forget about MXCSR register in that suite, esp. the rounding bits of that. I avoid changing it as much as possible ‘coz the state is preserved across context switches and causes funny things in OpenMP and other thread pools, but not all people are aware of that. Also, there’s non-trivial amount of code written for SSE < 4.1 (the 4.1 introduced proper rounding instruction, roundps) where you sometimes forced to mess with MXCSR rounding bits because the alternatives are much slower.

> Not going to rewrite code I have already written, debugged and shipped.

I wouldn't, either. At least unless you want a WASM/AltiVec/etc. version. But if you already have good implementations SIMDe probably won't help.

OTOH, if all you have is a x86 implementation and a portable fallback, the SIMDe version of the x86 implementation will probably be faster than your portable version. That's what happened with MMseqs2 (<https://github.com/soedinglab/MMseqs2>).

> Don’t forget about MXCSR register in that suite

Yeah, rounding is definitely PITA. It's actually something I completely screwed up on in the beginning of the project and had to go back and correct :(. We do have some tests now which fiddle with the rounding mode to verify correctness, but could definitely use more, and obviously we can't always set a dedicated register to control behavior, so on some platforms `_mm_getcsr`/`_mm_setcsr`/`_MM_GET_ROUNDING_MODE`/`_MM_SET_ROUNDING_MODE` becomes `fegetround`/`fesetround`, which probably won't be a problem but still makes me uncomfortable.

The other area where we could really use more tests is replicating behavior for NaNs. By default we try to replicate the behavior of the function we're trying to emulate, but we currently only test NaN handling on a few functions :(. If you use -ffast-math or -ffinite-math-only we disable that code (compilers define __FINITE_MATH_ONLY__), though, and just use the fastest implementation we can.