|
|
|
|
|
by Const-me
2146 days ago
|
|
I might give it a try next time I’ll need to do something for AMD64 + NEON. Not going to rewrite code I have already written, debugged and shipped. Also, about this > we have an extensive test suite to verify our implementations Don’t forget about MXCSR register in that suite, esp. the rounding bits of that. I avoid changing it as much as possible ‘coz the state is preserved across context switches and causes funny things in OpenMP and other thread pools, but not all people are aware of that. Also, there’s non-trivial amount of code written for SSE < 4.1 (the 4.1 introduced proper rounding instruction, roundps) where you sometimes forced to mess with MXCSR rounding bits because the alternatives are much slower. |
|
I wouldn't, either. At least unless you want a WASM/AltiVec/etc. version. But if you already have good implementations SIMDe probably won't help.
OTOH, if all you have is a x86 implementation and a portable fallback, the SIMDe version of the x86 implementation will probably be faster than your portable version. That's what happened with MMseqs2 (<https://github.com/soedinglab/MMseqs2>).
> Don’t forget about MXCSR register in that suite
Yeah, rounding is definitely PITA. It's actually something I completely screwed up on in the beginning of the project and had to go back and correct :(. We do have some tests now which fiddle with the rounding mode to verify correctness, but could definitely use more, and obviously we can't always set a dedicated register to control behavior, so on some platforms `_mm_getcsr`/`_mm_setcsr`/`_MM_GET_ROUNDING_MODE`/`_MM_SET_ROUNDING_MODE` becomes `fegetround`/`fesetround`, which probably won't be a problem but still makes me uncomfortable.
The other area where we could really use more tests is replicating behavior for NaNs. By default we try to replicate the behavior of the function we're trying to emulate, but we currently only test NaN handling on a few functions :(. If you use -ffast-math or -ffinite-math-only we disable that code (compilers define __FINITE_MATH_ONLY__), though, and just use the fastest implementation we can.