| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by boulos 660 days ago

This seems to keep coming up, and I see confusion in the comments. There is a standard: IEEE 754-2008. There are additional things people add like approximate reciprocals and approximate sqrt. But if you don't use those, and you don't make an association error, you get consistent results.

The question here with association for summation is what you want to match. OP chose to match the scalar for-loop equivalent. You can just as easily make an 8-wide or 16-wide "virtual vector" and use that instead.

I suspect that an 8-wide virtual vector is the right default for people currently, since systems since Haswell support it, all recent AMD, and if you're using vectorization, you can afford to pay some overhead on Arm with a double-width virtual vector. You don't often gain enough from AVX512 to make the default 16-wide, but if you wanted to focus on Skylake+ (really Cascadelake+) or Genoa+ systems, it would be a fine choice.

2 comments

LegionMammal978 660 days ago

> OP chose to match the scalar for-loop equivalent.

Isn't it the other way around? The scalar for-loop was changed to match the vector loop's associativity. "To solve this problem for astcenc I decided to change our reference no-SIMD implementation to use 4-wide vectors."

subharmonicon 660 days ago

The latest standard is from 2019: https://ieeexplore.ieee.org/document/8766229

There is still some flexibility in implementation, for example how and whether FMAs are formed for a given compiler when FP_CONTRACT is ON, and in the standard itself in things like when tininess is detected.

But to your point if you stick to the basic operations in the standard, and don’t enable FP_CONTRACT and FENV_ACCESS in C/C++, have a bug free compiler and don’t use fast-math, you’re good to go.

[edit to add a caveat about compile-time constant folding which is a whole can of worms]

[edit again to point out that the C/C++ standards allow for implementations to compute intermediate results at higher precision, so a compliant implementation can use all 80 bits on x87 when computing expressions]