|
In the post, multiplications and 2 additions are not faster than 2 additions. The post compares (1) loop code that can be vectorized, as loop rounds are independent and do not depend on the result from the previous round, and (2) an "optimization" that makes calculations shorter, but also makes each loop round depend on the result of the previous round, so this cannot be vectorized. |
The first version is simple. The second version is more complicated. The simplicity is because the first version can be represented as a formula; imagine trying to write out the second version as a formula, and the complexity becomes obvious.
(The addition loop would have to be written as a recurrence relation, which of course means every step depends on the previous step.)
Complexity isn’t always obvious. It’s sometimes common in C programs to write a loop like dst[index++] = src[i], particularly in deeply nested for-loops. In my experience it’s almost always worth rewriting it so that the indices are computable entirely from the loop iteration variables, with no state. It helps you build a mental map of the operations, because you can think of it in terms of geometry (memcpy = copying rectangles onto other larger rectangles) whereas when the index is stateful it becomes quite a lot harder to visualize in higher dimensions. At least for me.
We’ve been building a compiler at Groq for our custom ML hardware. (Roughly, “turn pytorch into our ISA with no extra programmer effort.”) I used to think of posts like this as “Well, modern compilers are so fast, who really cares about autovectorization?” — it turns out you care when you need to write your own compiler. :)
MLIR is pretty cool. It makes a lot of transformations like this pretty easy. The MLIR “krnl” dialect can also automatically transform nested loops into tiled iteration. Graphics devs will know what I mean — no need to loop over 8x8 blocks of pixels manually, just write “for x in range(width): for y in range(height): …” and set the block size to 8.