Hacker News new | ask | show | jobs
by jcranmer 1492 days ago
That loop is actually nicely vectorizable, at least assuming that you replace int with float (there is no integer division vector instruction on x86).

All of the array accesses are uniform, so the resulting vector code is roughly:

  for (i = 0 .. size by vector width) {
    r0 = vector load x[i..i + vw]
    r1 = vector load y[i..i + vw]
    r2 = vector load z[i..i + vw]
    r3 = r0 / r2
    r4 = r2 * r0
    r5 = r1 + r2
    vector store r3 to p[i..i + vw]
    vector store r4 to d[i..i + vw]
    vector store r5 to q[i..i + vw]
  }
(and probably unroll the loop for good measure). No need to fission the loop to vectorize here.
1 comments

and any vliw compiler worth it's salt would bundle the load, div/mul/alu, store into one instruction packet