| > Now, one might have hoped to derive some bound on the relative error of sums with a small number of addends. My point is that already with three terms this is a lost cause. Difficult? Yes. Lost cause? Nope. Although it really helps to define error in terms of your largest addend. The proper solution to addition / subtraction problems in IEEE754 Floating point is to sort the numbers by magnitude. That ensures that the smallest numbers have the best chance of being "detected" in the floating point operation. For example... 1 + 2^53 + 1 needs to be sorted into 1 + 1 + 2^53. Or 1 + 2^53 + 1 - 2^53 needs to be sorted into 1 + 1 + 2^53 - 2^53 (which equals 2. Hurrah! We're exactly correct!). If all the numbers are sorted from smallest to largest, then the worst case error is relative to 2^53, the largest number in this addend sequence. I admit that I'm cheating and moving the goalposts a little bit... But that's what you gotta do when you work with Doubles. ---------- > Well, the general case is that there might be a subtraction. In IEEE arithmetic the sum (or difference) of two numbers is guaranteed to be within 1 ulp, which is a very good bound on the relative error. The problem begins when you feed this slightly inaccurate result into some other computation that magnifies the error. Overall, this statement rings true. In practice, all that sorting and stuff only minimizes the cancellation error. Because the "relative error" as you define it is most important in multiplication and division problems, and as you noted... relative error is unbounded. But if you keep positives and negatives separated, you can ensure that only one cancellation occurs in any sequence. IE: 2 * (25 * sum(array1) - sum(array2)) Normally, this has a cancellation error, which is then multiplied with the 2 (which makes the cancellation error grow). You distribute the multiply and split the arrays as follows: 50 * (sum(array1_positives) - sum(array1_negatives)) - 2 * (sum(array2_positives) - sum(array2_negatives)) Then you rearrange the sequence to minimize your cancellations. 50 * sum(array1_positives)+ 2 * sum(array2_negatives) - ( 2 * sum(array2_positives) + 50 * sum(array1_negatives)) This sequence guarantees one cancellation error at the very end, and it also ensures that there are no multiplications that occur after the cancellation error. I wouldn't call this a "lost cause", but this sort of finagling is tedious, error prone, and certainly difficult. But necessary if you want to maximize accuracy in a long-running simulation. |
Of course, if you move the goalposts and measure the error compared to the magnitude of the largest number it gets significantly more manageable! Such a situation is benign, relatively speaking.
In my primary domain of computational geometry we don't have this luxury as the sign must be correct. Even ordering by magnitude is insufficient then; for example evaluating 2^54 - 2^54 + 2 - 1 using this method gives 0 when it should have been 1 (and incidentally, in this case a simple left associative ordering would have gotten it right so it's not even guaranteed to improve matters -- not that anyone said otherwise, but it's yet another example that extreme caution is warranted).
Eventually you end up with ideas like http://www.cs.cmu.edu/~quake/robust.html. The effort that goes into getting even the sign of a sum of products correct in every case is absurd.
Unfortunately, most of the computational geometry code I see usually involves starting with the naive implementation and when things go awry various tricks like Kahan sums or ordering by magnitude are applied hoping that the problem goes away or becomes sufficiently obscure, but never is there an attempt to clarify the exact numerical requirements of the result. Even expensive industrial software is prone to this and will sometimes crash if you happen to present it with a slightly unusual piece of geometry.