| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cogman10 1490 days ago

The reverse is true.

SIMD is harder because you have to have a uniform operation across a set of data.

Imagine a for loop that looks like this

    int[] x, y, z;
    int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i]
       d[i] = z[i] * x[i]
       q[i] = y[i] + z[i]  
    }

For SIMD, this is a complicated mess for the compiler to unravel. What the compiler would LIKE to do is turn this into 3 for loops and use the SIMD instructions to perform those operations in parallel.

The itanium optimization, however, is a lot easier. The compiler can see that none of p, d, or q depend on the results of the previous stage (that is q[i] doesn't depend on p[i]). As a result, the entire thing can be packed into a single operation.

Now, of course, modern OOO processors can do the same optimization so maybe it's not a huge win? Still, would have been something worth exploring more (IMO) but the market forces killed it. Moving that sort of optimization out of the processor hardware and into the compiler software seems like it could lead to some nice power/performance benefits.

3 comments

jcranmer 1490 days ago

That loop is actually nicely vectorizable, at least assuming that you replace int with float (there is no integer division vector instruction on x86).

All of the array accesses are uniform, so the resulting vector code is roughly:

  for (i = 0 .. size by vector width) {
    r0 = vector load x[i..i + vw]
    r1 = vector load y[i..i + vw]
    r2 = vector load z[i..i + vw]
    r3 = r0 / r2
    r4 = r2 * r0
    r5 = r1 + r2
    vector store r3 to p[i..i + vw]
    vector store r4 to d[i..i + vw]
    vector store r5 to q[i..i + vw]
  }

(and probably unroll the loop for good measure). No need to fission the loop to vectorize here.

link

sifar 1490 days ago

and any vliw compiler worth it's salt would bundle the load, div/mul/alu, store into one instruction packet

link

sifar 1490 days ago

>> For SIMD, this is a complicated mess for the compiler to unravel

this is trivially vectorizable for simd, would fit nicely in a vliw packet too. The only issue is if there was a runtime memory stall with any access, then the entire pipeline would stall.

with predication, modern simd even parallelize if conditions like below.

int[] x, y, z; int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i];
       d[i] = z[i] * x[i];
       if(i>n) {
         q[i] = y[i] + z[i]  ;
       } else {
         q[i] = y[i];
       } 
    }

link

hajile 1490 days ago

VLIW architecture is so bad that AMD and Nvidia couldn't make it work well with embarrassingly parallel graphics code. AMD first moved from VLIW-5 to VLIW-4 because they couldn't find enough data to reliably keep unit 5 busy.

AMD then followed Nvidia into the world of SIMD/SIMT because it offered better real-world performance for the majority of applications.

VLIW has been tried repeatedly only to be replaced with something that worked better.

link