Hacker News new | ask | show | jobs
by tom_mellior 2419 days ago
Dragging a large data structure through the cache only once rather than twice can be beneficial. Also, GC will clean the intermediate array, but (depending on the specifics of the language and the data types involved) it might first have to make yet another scan through the data structure. So yes, it's linear, but possibly 2-3x slower.
1 comments

Then why do I have the following results?

    const a = [...Array(1000000).keys()];
    const m = 8;
    let leftAvg = .0;
    for(let _ of Array(m)) {
        const t0 = performance.now();
        a.map(Math.tan).map(Math.sin);
        const t1 = performance.now();
        leftAvg += (t1 - t0)/m;
    }
    let rightAvg = .0;
    for(let _ of Array(m)) {
        const t0 = performance.now();
        a.map(x => Math.sin(Math.tan(x)));
        const t1 = performance.now();
        rightAvg += (t1 - t0)/m;
    }
    console.log(leftAvg, rightAvg, leftAvg/rightAvg);
    // JS Firefox 70:  264 360.75 0.7318087318087318




    var a = Enumerable.Range(0, 10000000).Select(x => (double)x).ToArray();
    double[] xs1 = null;
    double[] xs2 = null;
    var m = 16;

    var leftAvg = .0;
    foreach (var _ in Enumerable.Range(0, m))
    {
        var watch = System.Diagnostics.Stopwatch.StartNew();
        xs1 = a.Select(Math.Tan).ToArray().Select(Math.Sin).ToArray();
        watch.Stop();
        leftAvg += (double)watch.ElapsedMilliseconds / m;
    }

    var rightAvg = .0;
    foreach (var _ in Enumerable.Range(0, m))
    {
        var watch = System.Diagnostics.Stopwatch.StartNew();
        xs2 = a.Select(x => Math.Sin(Math.Tan(x))).ToArray();
        watch.Stop();
        rightAvg += (double)watch.ElapsedMilliseconds / m;
    }

    Console.WriteLine($"{leftAvg} {rightAvg} {leftAvg / rightAvg}");
    // C# Results: 505.75 602.25 0.839767538397675
Hard to tell without more information, and probably more runs. Did the code run often enough for the JIT compiler to warm up sufficiently? If you're running in interpreted mode, or only baseline compiled mode, you have other overheads. Also, and this is admittedly something I should not have glossed over: The reading and writing costs of fused maps might be sped up by the factors I mentioned, but in your loops you also have allocations, which have their own costs and can complicate things. And the computation is not free either, though at sufficiently large sizes the memory accesses should dominate sind and tan, I think.

It also matters how this code is compiled exactly. The C# version (I know nothing about C# or how good its compiler is) looks like it must first allocate some kind of dynamic stream, and only when ToArray() is called can it allocate the final array, so there might be extra copying. Maybe the compiler is smart enough to optimize a sequence of arr.Select().ToArray() to allocate a target array of the size of arr right away, I don't know.

Also, the JavaScript version uses a smaller array than the C# version, is that on purpose? 1000000 unboxed doubles are only 8 MB, which is not very big: On the machine I'm typing this on, L3 cache is 6 MB.

My advice would be to run the JavaScript version many times, for many more than 8 iterations, and with sizes increasing stepwise up to a GB or so. Also try replacing the maps with preallocated arrays and hand-written loops that contain only the computations, not the allocations. I know this sounds like I'm trying to give you homework, which I'm not, but benchmarking is hard, and there are many factors to take into account.

> And the computation is not free either, though at sufficiently large sizes the memory accesses should dominate sind and tan, I think.

Looks like I was wrong about this! You might want to retry your experiments with cheaper operations than sin and tan.

I wrote a little C benchmark to test this more:

    #include <stdio.h>
    #include <time.h>
    #include <math.h>

    extern void sinTanSeparate(double *a, double *b, int n) {
        for (int i = 0; i < n; i++) {
            b[i] = tan(a[i]);
        }
        for (int i = 0; i < n; i++) {
            b[i] = sin(b[i]);
        }
    }

    extern void sinTanFused(double *a, double *b, int n) {
        for (int i = 0; i < n; i++) {
            b[i] = sin(tan(a[i]));
        }
    }

    #define N (128 * 1024 * 1024)
    #define RUNS 5

    double a[N];
    double b[N];

    int main(void) {
        clock_t start, end;

        printf("will do %d runs over %zu MB of data\n\n",
               RUNS, sizeof a / (1024 * 1024));

        for (int i = 0; i < RUNS; i++) {
            start = clock();
            sinTanSeparate(a, b, N);
            end = clock();
            printf("separate: %f sec\n", ((double) end - start) / CLOCKS_PER_SEC);
        }

        printf("\n");

        for (int i = 0; i < RUNS; i++) {
            start = clock();
            sinTanFused(a, b, N);
            end = clock();
            printf("fused:    %f sec\n", ((double) end - start) / CLOCKS_PER_SEC);
        }

        return 0;
    }

Compiling this with gcc -O3 gives:

    will do 5 runs over 1024 MB of data
    
    separate: 1.461349 sec
    separate: 1.020120 sec
    separate: 1.019002 sec
    separate: 1.019888 sec
    separate: 1.018454 sec
    
    fused:    1.014774 sec
    fused:    1.014724 sec
    fused:    1.013895 sec
    fused:    1.016440 sec
    fused:    1.013729 sec
So almost no difference, though with enough runs I think this would be significant. Interestingly, although C is not JIT compiled, even here there is a "warmup" effect. I guess these are initial page faults or something.

But if we now comment out <math.h> and instead use some cheap "fake" implementations of in and tan:

    // #include <math.h>
    #define tan(x) (x + 1)
    #define sin(x) (x + 2)
we get very different behavior:

    will do 5 runs over 1024 MB of data

    separate: 0.548558 sec
    separate: 0.154741 sec
    separate: 0.151271 sec
    separate: 0.150542 sec
    separate: 0.151337 sec

    fused:    0.078880 sec
    fused:    0.074742 sec
    fused:    0.078313 sec
    fused:    0.076987 sec
    fused:    0.077729 sec
Here the computation is so cheap that it's really other effects that dominate, and you get a 2x difference.