| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nullc 4291 days ago
	> video codecs, etc.) prefer the intrinsics. Prefer assembly. Intrinsics usually make a disaster of register allocation and you lose much of your performance to needless load/stores.

1 comments

brigade 4291 days ago

Well if it's between auto vectorization or intrinsics...

Lately I've been rather disappointed in how minimal the gains are in reducing register spills from intrinsics on modern CPUs, with their wide decode/issue, 16 registers, and dual load pipelines - by the time a loop is complex enough that a compiler spills, extra load/store uops are almost free from a micro benchmark perspective. The macro gains from smaller code and reduced cache usage are a bit bigger, but still depressingly minor for the effort expended.

But if you care about 32-bit x86 that's another story of course.

link

DannyBee 4291 days ago

So, one of the real reasons reducing register spills does not help is not related to what you suggest, it's because on modern x86, they play games with what looks "memory" to you, so you really aren't actually spilling into "memory" anyway :)

link

brigade 4291 days ago

You (and a lot of people) make it sound like its magic but it's not - http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation...

link

DannyBee 4291 days ago

It's not magic. But it's not what that blog post is talking about.

On some of these processors, 128 bytes of stack or so is not really "memory" (in the sense of being stored with memory), so spilling is not that bad.

link

brigade 4291 days ago

That's the magic I'm talking about because it's not true; memory is memory and stack memory isn't treated specially by the processor. What it does have is a store buffer, which applies to all memory accesses and is what store forwarding uses to bypass L1.

link

DannyBee 4291 days ago

I'm simply going to disagree with you on this one, because i can't make my evidence public :)

link