Hacker News new | ask | show | jobs
by manholio 1481 days ago
Seems like you can have the cake and eat it too, by manually parallelizing the code to something like this:

    double A4 = A+A+A+A;
    double Z = 3A+B;
    double Y1 = C;
    double Y2 = A+B+C;

    int i;
    // ... setup unroll when LEN is odd...

    for(i=0; i<LEN; i++) {
        data[i] = Y1;
        data[++i] = Y2;
        Y1 += Z;
        Y2 += Z;
        Z += A4;
    }
Probably not entirely functional as written, but you get the idea: unroll the loop so that the data dependent paths can each be done in parallel. For the machine being considered, a 4 step unroll should achieve maximum performance, but of course, you get all the fun things that come with hard-coding the architecture in your software.
2 comments

The idea is right, but some details are wrong. You need a separate Z for each Y. But even if that's done, it is indeed faster.
I'm shocked and aghast to hear you found a bug in my code - I assure you it compiled and ran flawlessly in my brain.
Looks like someone wrote pretty good parallelizable code on the original question, here: https://stackoverflow.com/a/72333152
That code isn't faster for me while Manholios's is. And his gets even faster with 4x parallelization.