Hacker News new | ask | show | jobs
by lmilcin 1895 days ago
> The misleading action is to compare the input notation of 0.1 and 0.2 to the printed output of the sum, rather than consistently compare nothing but values printed using the same precision.

I think the problem is the act of caring for the least significant bits.

If you care for least significant bits of a floating point number it means you are doing something wrong. FP numbers should be treated as approximations.

More specifically, the problem above is assuming that floating point addition is associative to the point of giving you results that you can compare. In floating point order of operations matters for the least significant bits.

FP operations should be treated as incurring inherent error on each operation.

IEEE standard is there to make it easier to do repeatable calculations (for example be able to find regression in your code, compare against another implementation) and for you to be able to reason about the magnitude of the error.

1 comments

Problem is, most people think that 0.3 being an approximation refers to the fact that it's one significant figure measurement of some sort, like 0.3V on a multimeter. Not that it's inherently an approximation.

Pencil-and-paper floating-point numbers like 1.23 x 10^5 are approximations of measurements (if we are doing science or engineering), but are inherently exact. Calculators bear that out, because calculators use base 10 floating-point, like pencil-and-paper calculations.

0.3 being inexact is only an artifact of the floating-point system being in a different base. No matter how many digits we throw at it, we cannot represent 0.3 in binary floating point. Not 64 bits, not 1024 bits, not 65535 bits.

If we use binary notation for floating-point numbers, they likewise become exact, in terms of representation. The inexactness we deal with then is the familiar type that we know from pencil-and-paper calculations: truncation to a certain number of digits after performing an operation like addition or multiplication.

But that truncation will not happen in a calculation in which both input operands are exactly represented, and the result is also exactly representable!!!

If base ten were used, 0.1 + 0.2 would be 0.3, exactly.

If we use power-of-two values, and combinations thereof, we don't have the problem:

  1> (= 0.625 (+ 0.125 0.5))
  t
No problem.

  2> (set *print-flo-precision* 17)
  17
  3> 0.625
  0.625
  4> 0.5
  0.5
  5> 0.125
  0.125
  6> (+ 0.125 0.5)
  0.625
No junk digits.

  7> (= 0.25 (sqrt 0.0625))
  t
Wee ...