| > Not going to rewrite code I have already written, debugged and shipped. I wouldn't, either. At least unless you want a WASM/AltiVec/etc. version. But if you already have good implementations SIMDe probably won't help. OTOH, if all you have is a x86 implementation and a portable fallback, the SIMDe version of the x86 implementation will probably be faster than your portable version. That's what happened with MMseqs2 (<https://github.com/soedinglab/MMseqs2>). > Don’t forget about MXCSR register in that suite Yeah, rounding is definitely PITA. It's actually something I completely screwed up on in the beginning of the project and had to go back and correct :(. We do have some tests now which fiddle with the rounding mode to verify correctness, but could definitely use more, and obviously we can't always set a dedicated register to control behavior, so on some platforms `_mm_getcsr`/`_mm_setcsr`/`_MM_GET_ROUNDING_MODE`/`_MM_SET_ROUNDING_MODE` becomes `fegetround`/`fesetround`, which probably won't be a problem but still makes me uncomfortable. The other area where we could really use more tests is replicating behavior for NaNs. By default we try to replicate the behavior of the function we're trying to emulate, but we currently only test NaN handling on a few functions :(. If you use -ffast-math or -ffinite-math-only we disable that code (compilers define __FINITE_MATH_ONLY__), though, and just use the fastest implementation we can. |