|
|
|
|
|
by janwas
1480 days ago
|
|
Thanks :) I'd be interested to hear how it goes for you. Agree that 4x unrolling is getting most of the low-hanging fruit without excessive code size. I saw only very slightly better performance on SKX with 8x. You're right that it's nicer when the compiler can decide about the unrolling - for example with knowledge whether we have 16 or 32 regs. The unsafe/fast-math flags are pretty dangerous, though :/ https://simonbyrne.github.io/notes/fastmath/
Especially when they enable flush-to-zero, which would be unacceptable for a library loaded into some other application. |
|