Hacker News new | ask | show | jobs
by sampo 1477 days ago
> Unroll the dependency until you are longer than the SIMD width.

> Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.

Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

    void compute()
    {
        const double A = 1.1, B = 2.2, C = 3.3;
        const double A128 = 128*A;
        double Y[8], Z[8];
    
        Y[0] =               C;
        Y[1] =     A +   B + C;
        Y[2] =   4*A + 2*B + C;
        Y[3] =   9*A + 3*B + C;
        Y[4] =  16*A + 4*B + C;
        Y[5] =  25*A + 5*B + C;
        Y[6] =  36*A + 6*B + C;
        Y[7] =  49*A + 7*B + C;
        Z[0] =  64*A + 8*B;
        Z[1] =  80*A + 8*B;
        Z[2] =  96*A + 8*B;
        Z[3] = 112*A + 8*B;
        Z[4] = 128*A + 8*B;
        Z[5] = 144*A + 8*B;
        Z[6] = 160*A + 8*B;
        Z[7] = 176*A + 8*B;
    
        int i;
        for(i=0; i<LEN; i+=8) {
            data[i  ] = Y[0];
            data[i+1] = Y[1];
            data[i+2] = Y[2];
            data[i+3] = Y[3];
            data[i+4] = Y[4];
            data[i+5] = Y[5];
            data[i+6] = Y[6];
            data[i+7] = Y[7];
            Y[0] += Z[0];
            Y[1] += Z[1];
            Y[2] += Z[2];
            Y[3] += Z[3];
            Y[4] += Z[4];
            Y[5] += Z[5];
            Y[6] += Z[6];
            Y[7] += Z[7];
            Z[0] += A128;
            Z[1] += A128;
            Z[2] += A128;
            Z[3] += A128;
            Z[4] += A128;
            Z[5] += A128;
            Z[6] += A128;
            Z[7] += A128;
        }
    }
2 comments

> Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for.

I'm always "surprised" by the fact that CPUs care more about bandwidth rather than latency these days. A lot of CPUs (Intel, AMD, ARM, etc. etc.) support 1x or even 2x SIMD-multiplications per clock tick, even though they take 5 clock ticks to execute.

I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise.

So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference.

One of these days, I'll actually remember that fact, lol.

On my machine, your code is faster for smaller LEN values. I'm not sure why this is though.
8x 64-bit is 512-bit, which is designed for AVX512. You'll probably need AVX512 to fully benefit from unrolling x8.

4x 64-bit is 256-bit, which requires special compiler flags for 256-bit AVX2, but most x86 CPUs should support them these days.

2x64-bit is 128-bit, which fits in default SSE 128-bit SIMD with default GCC / Visual Studio compiler flags.