Hacker News new | ask | show | jobs
by kraghen 3129 days ago
It gets even worse: (1 + 2 ^ 53) - 2 ^ 53 evaluates to 0 while 1 + (2 ^ 53 - 2 ^ 53) evaluates to 1 (the correct result), which means that even the operation of adding three floating point numbers has unbounded relative error in the general case.

This is a great source of frustration in computational geometry, where this turns from a quantitative issue to a qualitative one, since if we get the sign of an expression wrong our data structures might no longer reflect the actual topology of the problem.

1 comments

> which means that even the operation of adding three floating point numbers has unbounded relative error in the general case.

Its not quite "in the general case".

In "any case with subtraction", there is unbounded relative error. And remember that addition with negative numbers can become subtraction.

Error-management is very important. But if you can GUARANTEE that your operations have the same sign, and that you're only using addition, multiplication, and division... then error is easier to track.

Its that subtraction (aka: cancellation error) that gets ya.

It is true that cancellation errors are often the source of wrong results.

However, computations involving IEEE floating point arithmetic can go horribly wrong even if there is no addition and no subtraction whatsoever, no divisions, and only a very small number of rounding errors. Here is one such example:

http://people.eecs.berkeley.edu/~wkahan/WrongR.pdf

Even professional users working in this domain will have a hard time to locate the cause of such problems involving IEEE floating point arithmetic. The format is extremely error-prone to use for casual users and experts alike.

Ehhh... I think that paper kinda "tricks" you with point #3.

> Only 257 algebraic operations, so no hordes of rounding errors

Rounding errors in Floating-point math grows exponentially in the common case. So 257 operations is more than enough to wipe out 53-bits of precision. Its not "hordes" of rounding errors that cause issues. Its the exponential nature of rounding errors, exponentially increasing every step of the way.

Yes, that is true.
Well, the general case is that there might be a subtraction. In IEEE arithmetic the sum (or difference) of two numbers is guaranteed to be within 1 ulp, which is a very good bound on the relative error. The problem begins when you feed this slightly inaccurate result into some other computation that magnifies the error.

Now, one might have hoped to derive some bound on the relative error of sums with a small number of addends. My point is that already with three terms this is a lost cause.

Of course with extra assumptions on the nature of the input it's possible to state something stronger. In my example of computational geometry this is never the case since unstable arithmetic corresponds to geometric singularities and the real world is never in general position.

> Now, one might have hoped to derive some bound on the relative error of sums with a small number of addends. My point is that already with three terms this is a lost cause.

Difficult? Yes. Lost cause? Nope. Although it really helps to define error in terms of your largest addend.

The proper solution to addition / subtraction problems in IEEE754 Floating point is to sort the numbers by magnitude. That ensures that the smallest numbers have the best chance of being "detected" in the floating point operation.

For example... 1 + 2^53 + 1 needs to be sorted into 1 + 1 + 2^53.

Or 1 + 2^53 + 1 - 2^53 needs to be sorted into 1 + 1 + 2^53 - 2^53 (which equals 2. Hurrah! We're exactly correct!).

If all the numbers are sorted from smallest to largest, then the worst case error is relative to 2^53, the largest number in this addend sequence.

I admit that I'm cheating and moving the goalposts a little bit... But that's what you gotta do when you work with Doubles.

----------

> Well, the general case is that there might be a subtraction. In IEEE arithmetic the sum (or difference) of two numbers is guaranteed to be within 1 ulp, which is a very good bound on the relative error. The problem begins when you feed this slightly inaccurate result into some other computation that magnifies the error.

Overall, this statement rings true. In practice, all that sorting and stuff only minimizes the cancellation error. Because the "relative error" as you define it is most important in multiplication and division problems, and as you noted... relative error is unbounded.

But if you keep positives and negatives separated, you can ensure that only one cancellation occurs in any sequence. IE:

2 * (25 * sum(array1) - sum(array2))

Normally, this has a cancellation error, which is then multiplied with the 2 (which makes the cancellation error grow). You distribute the multiply and split the arrays as follows:

50 * (sum(array1_positives) - sum(array1_negatives)) - 2 * (sum(array2_positives) - sum(array2_negatives))

Then you rearrange the sequence to minimize your cancellations.

50 * sum(array1_positives)+ 2 * sum(array2_negatives) - ( 2 * sum(array2_positives) + 50 * sum(array1_negatives))

This sequence guarantees one cancellation error at the very end, and it also ensures that there are no multiplications that occur after the cancellation error.

I wouldn't call this a "lost cause", but this sort of finagling is tedious, error prone, and certainly difficult. But necessary if you want to maximize accuracy in a long-running simulation.

The lost cause I am referring to is deriving a useful relative error bound on evaluating a+b+c. The implied conclusion is not that floating point arithmetic itself is a lost cause, merely that it's a minefield from step one.

Of course, if you move the goalposts and measure the error compared to the magnitude of the largest number it gets significantly more manageable! Such a situation is benign, relatively speaking.

In my primary domain of computational geometry we don't have this luxury as the sign must be correct. Even ordering by magnitude is insufficient then; for example evaluating 2^54 - 2^54 + 2 - 1 using this method gives 0 when it should have been 1 (and incidentally, in this case a simple left associative ordering would have gotten it right so it's not even guaranteed to improve matters -- not that anyone said otherwise, but it's yet another example that extreme caution is warranted).

Eventually you end up with ideas like http://www.cs.cmu.edu/~quake/robust.html. The effort that goes into getting even the sign of a sum of products correct in every case is absurd.

Unfortunately, most of the computational geometry code I see usually involves starting with the naive implementation and when things go awry various tricks like Kahan sums or ordering by magnitude are applied hoping that the problem goes away or becomes sufficiently obscure, but never is there an attempt to clarify the exact numerical requirements of the result. Even expensive industrial software is prone to this and will sometimes crash if you happen to present it with a slightly unusual piece of geometry.

Sorting and then summing is a commonly used technique. Still, there are also better methods such as Kahan summation:

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Using pairwise addition is a very good alternative with the added benefit that it can be easily parallelized.

Personally, I would gladly use a number format that does not require such special considerations for summing a set of numbers, and still yields good results. I am really looking forward to alternative representations.