Hacker News new | ask | show | jobs
by pjc50 1928 days ago
It's commonly used. There's a huge amount of equations that look like (a * b) + (c * d) + ... and so on. So if that's the operation you're doing, it saves an additional instruction and therefore instruction bandwidth and cache. Within actually doing the operation, the extra add is a very small amount of overhead.

Having looked in the ARM reference manual, the "MUL" instruction is just an alias for MADD with an addition of zero!

I can't find timings for this instruction with 30 seconds of googling, has anyone got a spec with instruction timings?

1 comments

Apple M1, can do four fused multiply-adds per cycle with latency of 4 cycles. Interestingly enough it seems that the latency on the vector FMA is even lower. So it’s 16 float FMA per cycle.

Source: https://dougallj.github.io/applecpu/firestorm-simd.html