Hacker News new | ask | show | jobs
by GuB-42 2241 days ago
It is scientific notation with a twist. The twist is the implicit 1. in the mantissa.

There is no easy way to make it work in decimal. So you need to learn how fractional parts work in binary, then move up to scientific notation in binary. Then you probably noticed that all numbers start with 1, so you can knock it off to save space. Oh, and add a special case for zero, because it is the only number that doesn't start with 1, and if you are feeling adventurous, there are subnormal numbers.

That's why I find the explanation in the article absolutely brilliant. By treating the exponent as a window, not only you don't need to bother with scientific notation in binary, but zero and subnormal numbers fit very nicely.

In fact, I've been working for years with floating point numbers and feel stupid for not realizing it works just as described in the article. And feeling stupid after reading something is usually a very good sign. I clicked, expected an article I could dismiss in a typical Hacker News fashion, but no, not this time.

1 comments

It's an implicit 1 or 0 (because it's binary floating point). The implicit 0 is for subnormal numbers. That's the part people usually don't explain when they're first introducing it. It's literally {1,0}.xxxxxx where x is also a 1 or 0. I.e. a binary floating point number in scientific notation. I've seen a lot of explanations that kind of gloss over that part of it (not saying you don't understand it, just that even the article doesn't make that clear enough IMO).
In practice, subnormal are very rarely used. Most compiler disable subnormals when compiling with anything other than -O0. It takes over a hundred cycle to complete an operation.

Demo: #include <stdio.h>

    int
    main ()
    {
    
        volatile float v;
        float acc = 0;
        float den = 1.40129846432e-45;
    
        for (size_t i; i < (1ul<<33); i++) {
            acc += den;
        }
    
        v = acc;
    
        return 0;
    }
With -01: $ gcc float.c -o float -O1 && time ./float ./float 8.93s user 0.00s system 99% cpu 8.933 total

With -O0: $ gcc float.c -o float -O1 && time ./float ./float 20.60s user 0.00s system 99% cpu 20.610 total

I fixed your example a bit and here is what I get

  /tmp>cat t.c
  #include <stdio.h>
  int main() {
      float acc = 0;
      float den = 1.40129846432e-45;
      for (size_t i = 0; i < (1ul<<33); i++) acc += den;
      printf("%g\n", acc);
      return 0;
  }
  /tmp>gcc -O3 t.c && time ./a.out
  2.35099e-38
  ./a.out  5.94s user 0.00s system 99% cpu 5.944 total
  /tmp>gcc -O3 -ffast-math t.c && time ./a.out
  0
  ./a.out  1.50s user 0.00s system 99% cpu 1.502 total
So subnormal numbers are supported in -O3 unless you specify -ffast-math. And it definitely makes a difference (gcc 9.3, Debian 11, Ryzen 7 3700X).

EDIT That one is interesting too

  /tmp>clang -O3 t.c && time ./a.out
  2.35099e-38
  ./a.out  6.04s user 0.00s system 99% cpu 6.044 total
  /tmp>clang -O3 -ffast-math t.c && time ./a.out
  0
  ./a.out  0.10s user 0.00s system 99% cpu 0.101 total
clang (9.0.1) performs about the same without -ffast-math; but with it, it managed to optimize the loop away.
I looked at the asm generate from my original example and they generate very different codes, gcc applies other optimization when compiled with -O1.

I've been fighting the compiler to generate a minimal working example of the subnormals, but didn't have any success.

Some things take need to be taken in account (from the top of my head):

- Rounding. You don't want to get stuck in the same number. - The FPU have some accumulator register that are larger than the floating point register. - Using more register than the architecture has it not trivial because the register renaming and code reordering. The CPU might optimize in a way that the data never leaves those register.

Trying to make a mwe, I found this code:

    #include <stdio.h>
    
    int
    main ()
    {
        double x = 5e-324;
        double acc = x;
    
        for (size_t i; i < (1ul<<46); i++) {
                acc += x;
    
        }
    
        printf ("%e\n", acc);
    
        return 0;
    }

Runs is fraction of seconds with -O0:

    gcc double.c -o double -O0
But takes forever (killed after 5 minutes) with -O1:

    gcc double.c -o double -O1
I'm using gcc (Arch Linux 9.3.0-1) 9.3.0 on i7-8700

I also manage to create a code that sometimes run in 1s, but in others would take 30s. Didn't matter if I recompiled.

Floating point is hard.

Shouldn't i be initialized?
lol, how the compiler didn't warn about uninitialized variable
Compiler warnings are not straightforward and depend on many things; compiler version, optimization settings, warning settings.

  $ gcc-9 -Wuninitialized -O0 float.c  # NO WARNINGS!!!

  $ gcc-9 -O1 float.c  # NO WARNINGS!!!
  
  $ gcc-9 -Wuninitialized -O1 float.c
  float.c: In function 'main':
  float.c:9:5: warning: 'i' is used uninitialized in this function [-Wuninitialized]
      9 |     for (size_t i; i < (1ul << 33); i++) {
        |     ^~~
Shouldn't i be initialized?