|
|
|
|
|
by dragontamer
2760 days ago
|
|
Fully agree. In most cases on modern systems, small loops should remain compact as possible, to stay in the uop cache. The "for" loop overhead (the inc, cmp, and jmp instructions) effectively execute in parallel. Modern systems are highly out-of-order and the for-loop overhead is virtually nil. |
|
For small loops, unrolling is the most important of all, since loop carried dependency chains are dense, and the loop overhead is a high fraction of the overall work.
It is easy to get a 2x speedup by unrolling a small loop, and even larger speedups are not uncommon.
So this "unrolling rarely helps" idea is just as much of a myth as "unrolling never helps". The main problem with unrolling is that the compilers usually don't do it intelligently - usually loops are unrolling if some kind of threshold is met, depending on compiler options - but this always happens in kind of a feed-forward way, rather thank a feed-back way, which would involve unrolling the loop and analyzing the benefit and costs after further optimization passes.