Hacker News new | ask | show | jobs
by augustt 2241 days ago
Didn't everyone learn scientific notation in high school? It's pretty much exactly that, and you could put the coefficient/exponent into whatever bit pattern you'd like.
4 comments

It is scientific notation with a twist. The twist is the implicit 1. in the mantissa.

There is no easy way to make it work in decimal. So you need to learn how fractional parts work in binary, then move up to scientific notation in binary. Then you probably noticed that all numbers start with 1, so you can knock it off to save space. Oh, and add a special case for zero, because it is the only number that doesn't start with 1, and if you are feeling adventurous, there are subnormal numbers.

That's why I find the explanation in the article absolutely brilliant. By treating the exponent as a window, not only you don't need to bother with scientific notation in binary, but zero and subnormal numbers fit very nicely.

In fact, I've been working for years with floating point numbers and feel stupid for not realizing it works just as described in the article. And feeling stupid after reading something is usually a very good sign. I clicked, expected an article I could dismiss in a typical Hacker News fashion, but no, not this time.

It's an implicit 1 or 0 (because it's binary floating point). The implicit 0 is for subnormal numbers. That's the part people usually don't explain when they're first introducing it. It's literally {1,0}.xxxxxx where x is also a 1 or 0. I.e. a binary floating point number in scientific notation. I've seen a lot of explanations that kind of gloss over that part of it (not saying you don't understand it, just that even the article doesn't make that clear enough IMO).
In practice, subnormal are very rarely used. Most compiler disable subnormals when compiling with anything other than -O0. It takes over a hundred cycle to complete an operation.

Demo: #include <stdio.h>

    int
    main ()
    {
    
        volatile float v;
        float acc = 0;
        float den = 1.40129846432e-45;
    
        for (size_t i; i < (1ul<<33); i++) {
            acc += den;
        }
    
        v = acc;
    
        return 0;
    }
With -01: $ gcc float.c -o float -O1 && time ./float ./float 8.93s user 0.00s system 99% cpu 8.933 total

With -O0: $ gcc float.c -o float -O1 && time ./float ./float 20.60s user 0.00s system 99% cpu 20.610 total

I fixed your example a bit and here is what I get

  /tmp>cat t.c
  #include <stdio.h>
  int main() {
      float acc = 0;
      float den = 1.40129846432e-45;
      for (size_t i = 0; i < (1ul<<33); i++) acc += den;
      printf("%g\n", acc);
      return 0;
  }
  /tmp>gcc -O3 t.c && time ./a.out
  2.35099e-38
  ./a.out  5.94s user 0.00s system 99% cpu 5.944 total
  /tmp>gcc -O3 -ffast-math t.c && time ./a.out
  0
  ./a.out  1.50s user 0.00s system 99% cpu 1.502 total
So subnormal numbers are supported in -O3 unless you specify -ffast-math. And it definitely makes a difference (gcc 9.3, Debian 11, Ryzen 7 3700X).

EDIT That one is interesting too

  /tmp>clang -O3 t.c && time ./a.out
  2.35099e-38
  ./a.out  6.04s user 0.00s system 99% cpu 6.044 total
  /tmp>clang -O3 -ffast-math t.c && time ./a.out
  0
  ./a.out  0.10s user 0.00s system 99% cpu 0.101 total
clang (9.0.1) performs about the same without -ffast-math; but with it, it managed to optimize the loop away.
I looked at the asm generate from my original example and they generate very different codes, gcc applies other optimization when compiled with -O1.

I've been fighting the compiler to generate a minimal working example of the subnormals, but didn't have any success.

Some things take need to be taken in account (from the top of my head):

- Rounding. You don't want to get stuck in the same number. - The FPU have some accumulator register that are larger than the floating point register. - Using more register than the architecture has it not trivial because the register renaming and code reordering. The CPU might optimize in a way that the data never leaves those register.

Trying to make a mwe, I found this code:

    #include <stdio.h>
    
    int
    main ()
    {
        double x = 5e-324;
        double acc = x;
    
        for (size_t i; i < (1ul<<46); i++) {
                acc += x;
    
        }
    
        printf ("%e\n", acc);
    
        return 0;
    }

Runs is fraction of seconds with -O0:

    gcc double.c -o double -O0
But takes forever (killed after 5 minutes) with -O1:

    gcc double.c -o double -O1
I'm using gcc (Arch Linux 9.3.0-1) 9.3.0 on i7-8700

I also manage to create a code that sometimes run in 1s, but in others would take 30s. Didn't matter if I recompiled.

Floating point is hard.

Shouldn't i be initialized?
lol, how the compiler didn't warn about uninitialized variable
Shouldn't i be initialized?
It's not exactly scientific notation because a leading bit is assumed for normalized numbers.

The window and offset explanation naturally accounts for the leading bit, whereas the scientific notation explanation needs further explanation to explain how the leading bit works.

In short, 10^2 * .001 is not allowed in floating point. You can't have a mantissa that starts with 0. The leading bit means all mantissas must be greater than 1.

Without this understanding you won't have an intuitive understanding of the range and precision of floating point, which is why I think the window+offset explanation is much more natural.

I think this is worth saying about the leading bit (why 0.001 is not a valid mantissa):

In binary, we always know the first non-zero digit of any number - so there's no need to write it down. We know it's 1 because, well, binary.

So we don't waste space, and only write down all the other digits, and then use the exponent to put the 'point' into the right place.

We save a bit of space doing this.

Thats not the right way of thinking about it. Sure, the first non-zero digit of any binary number is 1, but who is to say that the number has a bit that isn't 0? Couldn't the number be zero?

The mantissa can only be a value from [1,2).

i.e. in scientific notation: exponent * mantissa, the mantissa cannot be less than 1, or >= 2 in floating point.

So given that the mantissa can go from 000000 to 111111, what should the values represnet? Obviously it should represent the values from [1,2). Calling it a leading bit is more confusing than it needs to be. Its better to just call it an assumed minimum value.

> Thats not the right way of thinking about it. Sure, the first non-zero digit of any binary number is 1, but who is to say that the number has a bit that isn't 0? Couldn't the number be zero?

When I was taught scientific notation in middle school, high school, and at college, it was always explicitly stated that:

1. You can have only one digit before the decimal point.

2. That digit cannot be 0 (unless your number is 0, of course). So 0.3 * 10^5 is not scientific notation.

This is no different. The "twist" is that there is only one possible non-zero bit, whereas in decimal it could be [1-9].

I think this explanation is neat. However, the "usual" formulation is just scientific notation, with the optimization that one bit is redundant.

Personally, I prefer the alternative notation: M * 2^(exponent-precision+1), with M being a p-bit integer. It's easier to work with when you know that M is always an integer, and you don't need to deal with fractions in base 2. In fact, FP made a lot more sense when I took this formulation in decimal, and worked with that.

I think most people do learn scientific notation, but the correspondence with floating point representation is probably not learned. It is not necessarily an obvious connection for a person who uses floats as "decimal numbers", which is the mental model I assume many programmers have. This post is enlightening because it shows how obvious the connection is.
You just repeated the "how everybody hates floating point to be explained to them" section of this article.