| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dosshell 789 days ago

> I can get away with a smaller sized float

When talking about not assuming optimizations...

32bit float is slower than 64bit float on reasonable modern x86-64.

The reason is that 32bit float is emulated by using 64bit.

Of course if you have several floats you need to optimize against cache.

3 comments

jcranmer 789 days ago

Um... no. This is 100% completely and totally wrong.

x86-64 requires the hardware to support SSE2, which has native single-precision and double-precision instructions for floating-point (e.g., scalar multiply is MULSS and MULSD, respectively). Both the single precision and the double precision instructions will take the same time, except for DIVSS/DIVSD, where the 32-bit float version is slightly faster (about 2 cycles latency faster, and reciprocal throughput of 3 versus 5 per Agner's tables).

You might be thinking of x87 floating-point units, where all arithmetic is done internally using 80-bit floating-point types. But all x86 chips in like the last 20 years have had SSE units--which are faster anyways. Even in the days when it was the major floating-point units, it wasn't any slower, since all floating-point operations took the same time independent of format. It might be slower if you insisted that code compilation strictly follow IEEE 754 rules, but the solution everybody did was to not do that and that's why things like Java's strictfp or C's FLT_EVAL_METHOD were born. Even in that case, however, 32-bit floats would likely be faster than 64-bit for the simple fact that 32-bit floats can safely be emulated in 80-bit without fear of double rounding but 64-bit floats cannot.

link

dosshell 789 days ago

I agree with you. It should take the same time when thinking more about it. I remember learning this in ~2016 and I did performance test on Skylake which confirmed (Windows VS2015). I think I remember that i only tested with addsd/addss. Definitely not x87. But as always, if the result can not be reproduced... I stand corrected until then.

link

dosshell 789 days ago

I tried to reproduce it on Ivybridge (Windows VS20122) and failed (mulss and muldd) [0]. single and double precision takes the same time. I also found a behavior where the first batch of iterations takes more time regardless of precision. It is possible that this tricked me last time.

[0] https://gist.github.com/dosshell/495680f0f768ae84a106eb054f2...

Sorry for the confusion and spreading false information.

link

tombert 789 days ago

Sure, I clarified this in a sibling comment, but I kind of meant that I will use the slower "money" or "decimal" types by default. Usually those are more accurate and less error-prone, and then if it actually matters I might go back to a floating point or integer-based solution.

link

sgerenser 789 days ago

I think this is only true if using x87 floating point, which anything computationally intensive is generally avoiding these days in favor of SSE/AVX floats. In the latter case, for a given vector width, the cpu can process twice as many 32 bit floats as 64 bit floats per clock cycle.

link

dosshell 789 days ago

Yes, as I wrote, it is only true for one float value.

SIMD/MIMD will benefit of working on smaller width. This is not only true because they do more work per clock but because memory is slow. Super slow compared to the cpu. Optimization is alot about cache misses optimization.

(But remember that the cache line is 64 bytes, so reading a single value smaller than that will take the same time. So it does not matter in theory when comparing one f32 against one f64)

link