Hacker News new | ask | show | jobs
by red2awn 1066 days ago
I experimented with different optimizations and ended with 128x speedup. The improvement mainly comes from manual SIMD intrinsics, but you can go a long way just by making the code more auto-vectorization friendly as some other comments have mentioned. See:

https://ipthomas.com/blog/2023/07/n-times-faster-than-c-wher...