Hacker News new | ask | show | jobs
by loeg 1484 days ago
The other essential insight is that the second version has data dependencies between loop iterations. Without that, the tldr is incomplete.
1 comments

That is the 'why' autovectorization fails. But it seems solvable in this case pretty easily actually.

EDIT: Solvable by a human. I dunno much about compilers though, so I dunno if there's some kind of language issue that would prevent the unrolling of this dependency.

I think the point GP is making is that even without vectorization, the data dependency causes stalls even in normal, single data instructions. That is, a data dependency between iterations of loops will hurt performance even for non-vectorizable calculations (or on CPUs with high ILP but no really good vector instructions, which granted is probably a small pool since both those things came about at about the same time).
I think that is the point Peter Cordes is trying to make (quite politely but firmly) over there on stackoverflow. It's not only about autovectorization. The main point there is the loop-carried dependency that prevents both the compiler (autovec) and the processor (ILP) to do their thing.

Loop-carried dependency is the big culprit here. I wish we had a culture of writing for loops with the index a constant inside the loop, as in the Ada for statement, and not the clever C while loops or for with two running variables... Simpler loop syntax makes so many static analyses 'easier' and kind of forces the brain to think in bounded independant steps, or to reach for higher level constructs (e.g. reduce).

I wonder if it becomes a math problem to optimize then, like Euler solving the sum of 1-100 by adding 1 to 100 and multiplying by the 50 pairs of numbers that operation created to get 5050?
It's a different issue altogether. Even in vectorized code artificial data dependencies are an issue. E.g., for fast dot products (not actually usually applicable on most input sizes because of memory bandwidth, but related problems can be CPU bound) you'll roughly double your execution speed by alternating between two running totals and merging those at the end, and that gain is mostly orthogonal to the question of whether those running totals are vectorized.

Edit: And many modern compilers have been doing that particular optimization for a few years anyway, but it's still an important idea to keep in mind for any non-trivial graph of operations.