Well if it's between auto vectorization or intrinsics...
Lately I've been rather disappointed in how minimal the gains are in reducing register spills from intrinsics on modern CPUs, with their wide decode/issue, 16 registers, and dual load pipelines - by the time a loop is complex enough that a compiler spills, extra load/store uops are almost free from a micro benchmark perspective. The macro gains from smaller code and reduced cache usage are a bit bigger, but still depressingly minor for the effort expended.
But if you care about 32-bit x86 that's another story of course.
So, one of the real reasons reducing register spills does not help is not related to what you suggest, it's because on modern x86, they play games with what looks "memory" to you, so you really aren't actually spilling into "memory" anyway :)
That's the magic I'm talking about because it's not true; memory is memory and stack memory isn't treated specially by the processor. What it does have is a store buffer, which applies to all memory accesses and is what store forwarding uses to bypass L1.
Lately I've been rather disappointed in how minimal the gains are in reducing register spills from intrinsics on modern CPUs, with their wide decode/issue, 16 registers, and dual load pipelines - by the time a loop is complex enough that a compiler spills, extra load/store uops are almost free from a micro benchmark perspective. The macro gains from smaller code and reduced cache usage are a bit bigger, but still depressingly minor for the effort expended.
But if you care about 32-bit x86 that's another story of course.