Hacker News new | ask | show | jobs
by brigade 4244 days ago
Well if it's between auto vectorization or intrinsics...

Lately I've been rather disappointed in how minimal the gains are in reducing register spills from intrinsics on modern CPUs, with their wide decode/issue, 16 registers, and dual load pipelines - by the time a loop is complex enough that a compiler spills, extra load/store uops are almost free from a micro benchmark perspective. The macro gains from smaller code and reduced cache usage are a bit bigger, but still depressingly minor for the effort expended.

But if you care about 32-bit x86 that's another story of course.

1 comments

So, one of the real reasons reducing register spills does not help is not related to what you suggest, it's because on modern x86, they play games with what looks "memory" to you, so you really aren't actually spilling into "memory" anyway :)
You (and a lot of people) make it sound like its magic but it's not - http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation...
It's not magic. But it's not what that blog post is talking about.

On some of these processors, 128 bytes of stack or so is not really "memory" (in the sense of being stored with memory), so spilling is not that bad.

That's the magic I'm talking about because it's not true; memory is memory and stack memory isn't treated specially by the processor. What it does have is a store buffer, which applies to all memory accesses and is what store forwarding uses to bypass L1.
I'm simply going to disagree with you on this one, because i can't make my evidence public :)