| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Nyan 1456 days ago

Nice!

> For deferring the madd specifically, using two separate sum2 vectors and splitting the `mad` vector into two

Actually, the idea was to accumulate into 16-bit sums, and only do madd to 32-bit every 4 loop cycles. I'm not sure splitting it up like that actually helps, since the latency can be easily hidden by an OoO processor, and could actually be detrimental adding more uOps.

One thing to note is that you've got a dependent add chain on sum2_v, so using two independent sums instead of one could help.

> Plus I am sure there are many other opportunities to optimize this I have not thought of :)

Other implementations I've seen don't go any further, e.g. https://github.com/zlib-ng/zlib-ng/blob/develop/arch/x86/adl... https://github.com/veluca93/fpnge/blob/9a9fc023870bacd06674f...

So perhaps as you allude to, it isn't really worth it.