| HN Mirror

> And the computation is not free either, though at sufficiently large sizes the memory accesses should dominate sind and tan, I think.

Looks like I was wrong about this! You might want to retry your experiments with cheaper operations than sin and tan.

I wrote a little C benchmark to test this more:

    #include <stdio.h>
    #include <time.h>
    #include <math.h>

    extern void sinTanSeparate(double *a, double *b, int n) {
        for (int i = 0; i < n; i++) {
            b[i] = tan(a[i]);
        }
        for (int i = 0; i < n; i++) {
            b[i] = sin(b[i]);
        }
    }

    extern void sinTanFused(double *a, double *b, int n) {
        for (int i = 0; i < n; i++) {
            b[i] = sin(tan(a[i]));
        }
    }

    #define N (128 * 1024 * 1024)
    #define RUNS 5

    double a[N];
    double b[N];

    int main(void) {
        clock_t start, end;

        printf("will do %d runs over %zu MB of data\n\n",
               RUNS, sizeof a / (1024 * 1024));

        for (int i = 0; i < RUNS; i++) {
            start = clock();
            sinTanSeparate(a, b, N);
            end = clock();
            printf("separate: %f sec\n", ((double) end - start) / CLOCKS_PER_SEC);
        }

        printf("\n");

        for (int i = 0; i < RUNS; i++) {
            start = clock();
            sinTanFused(a, b, N);
            end = clock();
            printf("fused:    %f sec\n", ((double) end - start) / CLOCKS_PER_SEC);
        }

        return 0;
    }

Compiling this with gcc -O3 gives:

    will do 5 runs over 1024 MB of data
    
    separate: 1.461349 sec
    separate: 1.020120 sec
    separate: 1.019002 sec
    separate: 1.019888 sec
    separate: 1.018454 sec
    
    fused:    1.014774 sec
    fused:    1.014724 sec
    fused:    1.013895 sec
    fused:    1.016440 sec
    fused:    1.013729 sec

So almost no difference, though with enough runs I think this would be significant. Interestingly, although C is not JIT compiled, even here there is a "warmup" effect. I guess these are initial page faults or something.

But if we now comment out <math.h> and instead use some cheap "fake" implementations of in and tan:

    // #include <math.h>
    #define tan(x) (x + 1)
    #define sin(x) (x + 2)

we get very different behavior:

    will do 5 runs over 1024 MB of data

    separate: 0.548558 sec
    separate: 0.154741 sec
    separate: 0.151271 sec
    separate: 0.150542 sec
    separate: 0.151337 sec

    fused:    0.078880 sec
    fused:    0.074742 sec
    fused:    0.078313 sec
    fused:    0.076987 sec
    fused:    0.077729 sec

Here the computation is so cheap that it's really other effects that dominate, and you get a 2x difference.