| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zeusk 1488 days ago
	How would the second approach be vectorized given each iteration's input has dependence on previous iteration's output??

1 comments

dragontamer 1488 days ago

Unroll the dependency until you are longer than the SIMD width.

Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.

Or in other words: i+7 can depend on i-1 no problems.

link

sampo 1487 days ago

> Unroll the dependency until you are longer than the SIMD width.

> Ex: as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other, you can vectorize to SIMD-width 8.

Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

    void compute()
    {
        const double A = 1.1, B = 2.2, C = 3.3;
        const double A128 = 128*A;
        double Y[8], Z[8];
    
        Y[0] =               C;
        Y[1] =     A +   B + C;
        Y[2] =   4*A + 2*B + C;
        Y[3] =   9*A + 3*B + C;
        Y[4] =  16*A + 4*B + C;
        Y[5] =  25*A + 5*B + C;
        Y[6] =  36*A + 6*B + C;
        Y[7] =  49*A + 7*B + C;
        Z[0] =  64*A + 8*B;
        Z[1] =  80*A + 8*B;
        Z[2] =  96*A + 8*B;
        Z[3] = 112*A + 8*B;
        Z[4] = 128*A + 8*B;
        Z[5] = 144*A + 8*B;
        Z[6] = 160*A + 8*B;
        Z[7] = 176*A + 8*B;
    
        int i;
        for(i=0; i<LEN; i+=8) {
            data[i  ] = Y[0];
            data[i+1] = Y[1];
            data[i+2] = Y[2];
            data[i+3] = Y[3];
            data[i+4] = Y[4];
            data[i+5] = Y[5];
            data[i+6] = Y[6];
            data[i+7] = Y[7];
            Y[0] += Z[0];
            Y[1] += Z[1];
            Y[2] += Z[2];
            Y[3] += Z[3];
            Y[4] += Z[4];
            Y[5] += Z[5];
            Y[6] += Z[6];
            Y[7] += Z[7];
            Z[0] += A128;
            Z[1] += A128;
            Z[2] += A128;
            Z[3] += A128;
            Z[4] += A128;
            Z[5] += A128;
            Z[6] += A128;
            Z[7] += A128;
        }
    }

link

dragontamer 1487 days ago

> Do you mean like this? I get this to about as fast as the first "unoptimized" version in the SO post, but not faster.

Yeah, something like that. I haven't double-checked your math, but the idea is what I was going for.

I'm always "surprised" by the fact that CPUs care more about bandwidth rather than latency these days. A lot of CPUs (Intel, AMD, ARM, etc. etc.) support 1x or even 2x SIMD-multiplications per clock tick, even though they take 5 clock ticks to execute.

I guess the original "simple" code may have had a multiply in there, but that's not a big deal these days (throughput wise), even though its a big-deal latency wise.

So getting rid of those multiplies and cutting down the latency (ie: using only add statements) barely helps at all, maybe with no measurable difference.

One of these days, I'll actually remember that fact, lol.

link

btdmaster 1487 days ago

On my machine, your code is faster for smaller LEN values. I'm not sure why this is though.

link

dragontamer 1487 days ago

8x 64-bit is 512-bit, which is designed for AVX512. You'll probably need AVX512 to fully benefit from unrolling x8.

4x 64-bit is 256-bit, which requires special compiler flags for 256-bit AVX2, but most x86 CPUs should support them these days.

2x64-bit is 128-bit, which fits in default SSE 128-bit SIMD with default GCC / Visual Studio compiler flags.

link

yongjik 1487 days ago

If they were integer variables, I guess the compiler would have done that, but you can't really do that with floats because i+A+A is not necessarily i+2*A. (Of course, in this particular example, the difference doesn't matter for the programmer, but the compiler doesn't know that!)

I think there's some gcc option that enables these "dangerous" optimizations. -ffast-math, or something like that?

link

ummonk 1487 days ago

No the computer would have been unlikely to be able to figure out the math to coalesce 8 recursive additions into one operation.

link

zeusk 1487 days ago

> as long as i, i+1, i+2, i+3, ... i+7 are not dependent on each other

I really don't see how that works in improving this.

You can only calculate i+8, for calculating i+9 you depend on 8. And you can't go in strides either since i+16 depends on i+15 which you've not calculated so far unless you want to intermix the stateful and non-stateful code. I'd rather not go there.

link