Hacker News new | ask | show | jobs
by kardos 661 days ago
Exact floating point accumulating is more or less solved with xsum [1] -- would it work in this context?

[1] https://gitlab.com/radfordneal/xsum

1 comments

Because it would be slower when the exact calculation is not necessary. The xsum paper does have performance numbers, but all of them came from at least decade-old processors and almost every result indicates that superaccumulators are still 4--8x slower than the simple sum (but faster than the traditional Kahan summation). Superaccumulators require extensive scatter-gather operations due to its large memory footprint and I think the gap should have been even widen today as they would be harder to vectorize efficiently.
My takeaway from the linked post is that the author is more concerned with floating point invariance across platforms than speed (although improvements in speed are of course welcome).

If the data confined to a certain range of exponents, one could reduce the size of the accumulator, perhaps significantly.

Re 4-8x -- the large option in xsum was benchmarked at less than 2x the cost of a direct sum. Not so bad?

The author wants the best possible performance as long as its result is reproducible. I think this is evident from the fact that the non-SIMD code uses 4-wide vectors, so the author is definitely willing to trade accuracy if higher performance with a reproducible result is possible.

> Re 4-8x -- the large option in xsum was benchmarked at less than 2x the cost of a direct sum. Not so bad?

I don't know where you did take that number, because xsum-paper.pdf clearly indicates larger performance difference. I'm specifically looking at the ratio between the minimum of any superaccumulator results and the minimum of any simple sum results, and among results I think relevant today (x86-64, no earlier than 2012), AMD Opteron 6348 is the only case where the actual difference is only about 1.5x and everything else hovers much higher.