|
|
|
|
|
by brigade
4244 days ago
|
|
Well if it's between auto vectorization or intrinsics... Lately I've been rather disappointed in how minimal the gains are in reducing register spills from intrinsics on modern CPUs, with their wide decode/issue, 16 registers, and dual load pipelines - by the time a loop is complex enough that a compiler spills, extra load/store uops are almost free from a micro benchmark perspective. The macro gains from smaller code and reduced cache usage are a bit bigger, but still depressingly minor for the effort expended. But if you care about 32-bit x86 that's another story of course. |
|