| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lacedeconstruct 61 days ago
	The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable

3 comments

exyi 61 days ago

It's 3 cycles for float multiplication (and 1 for shift right):

https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...

https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

link

account42 60 days ago

Shift right isn't even relevant here - if you shift before conversion to float all your values end up 0 and if you want to divide afterwards its no longer a simple shift.

link

exyi 60 days ago

Exactly. Although if you do >> 8 while working with uint8, it will be the fastest :)

link

userbinator 60 days ago

It's 3 cycles for float multiplication (and 1 for shift right):

3x faster

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

50% faster

link

Tuna-Fish 61 days ago

FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.

link

pixelesque 61 days ago

Only with things like -ffast-math enabled will compilers do the reciprocal. It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code.

link

mgaunard 61 days ago

That's only valid to do if the reciprocal is representable exactly.

link

hansvm 61 days ago

That's not totally true. It's sufficient to be exactly representable, but you only need the reciprocal rounding error to be small enough to guarantee the multiplication rounding step fixes it across the entire range of numerators. For IEEE754 f16 values, there are 28 such extra values, the positive and negative sides of 1705/x where x is a power of 2 at least as great as 2048.

link

mgaunard 60 days ago

Interesting, but pretty limited corner case. Would compilers even identify those 28 values and do the transformation?

link

hansvm 59 days ago

Maybe for f16. The compiler's implementation could just be checking all numerators to see if the transformation is safe. The corner cases are messy and not quickly brute-forceable for f32 or brute-forceable at all for f64 though, so I doubt they'd bother, especially when I bet those constants have showed up literally zero times across all programs.

link

Sesse__ 61 days ago

Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)

link