Hacker News new | ask | show | jobs
by kardos 659 days ago
My takeaway from the linked post is that the author is more concerned with floating point invariance across platforms than speed (although improvements in speed are of course welcome).

If the data confined to a certain range of exponents, one could reduce the size of the accumulator, perhaps significantly.

Re 4-8x -- the large option in xsum was benchmarked at less than 2x the cost of a direct sum. Not so bad?

1 comments

The author wants the best possible performance as long as its result is reproducible. I think this is evident from the fact that the non-SIMD code uses 4-wide vectors, so the author is definitely willing to trade accuracy if higher performance with a reproducible result is possible.

> Re 4-8x -- the large option in xsum was benchmarked at less than 2x the cost of a direct sum. Not so bad?

I don't know where you did take that number, because xsum-paper.pdf clearly indicates larger performance difference. I'm specifically looking at the ratio between the minimum of any superaccumulator results and the minimum of any simple sum results, and among results I think relevant today (x86-64, no earlier than 2012), AMD Opteron 6348 is the only case where the actual difference is only about 1.5x and everything else hovers much higher.