I wonder how it happened that the inner loop here (https://github.com/oliverphilcox/Keplers-Goat-Herd/blob/3a0b...) with N_it=5 is 2 times slower than the inner loop here (https://github.com/oliverphilcox/Keplers-Goat-Herd/blob/3a0b...) with N_it=18. It doesn't look two times faster at all, and I've spent a lot of time optimizing numerical code. Is it possible that the compiler managed to vectorize the faster loop but not the slower one, or something like that? Or is it that specifically the divisions are too many and too expensive? Or the N_it-1 extra evaluations of sincos?