Hacker News new | ask | show | jobs
by gpderetta 3513 days ago
Multiplies are significantly cheaper than divisions in most recent processors.

First of all latency is the most important parameter: after memory bandwith, the latency of the longest dependency chain is tipically the bottleneck, especially for floating point code.

For example on Skylake float muls have 4 cycle latency (same as adds and MADs) vs over a minimum of 14 cycles for divisions.

But even when only cosidering thoughput, Skylake has two fully pipelined MAD units and can start 2 multiplies every clock cycle, while its single division unit is only partially pipelined and can start a new div only every fourth clock cycle (it is also, IIRC, only 128 bit wide so 256 bits vector divs are more expensive still).

Avoiding divs (and mods) is something that it is still worth optimising for.