Hacker News new | ask | show | jobs
by conistonwater 1909 days ago
I wonder how it happened that the inner loop here (https://github.com/oliverphilcox/Keplers-Goat-Herd/blob/3a0b...) with N_it=5 is 2 times slower than the inner loop here (https://github.com/oliverphilcox/Keplers-Goat-Herd/blob/3a0b...) with N_it=18. It doesn't look two times faster at all, and I've spent a lot of time optimizing numerical code. Is it possible that the compiler managed to vectorize the faster loop but not the slower one, or something like that? Or is it that specifically the divisions are too many and too expensive? Or the N_it-1 extra evaluations of sincos?
1 comments

I see a lot of sin's and cos's in both of those, and SSE and AVX don't have either of these operations for vectors. So I doubt it is that.