|
|
|
|
|
by jandrewrogers
4707 days ago
|
|
This article overlooks a major factor in bit-twiddling performance on modern CPUs: saturation of the execution ports in a CPU core. An Intel i7 core has six execution ports, three of which are ALUs of various types. Depending on the specific instruction and the dependencies between instructions, the CPU can execute up to 3 simple integer operations every clock cycle mixed with operations like loads and stores at the same time. For most algorithms, particularly those that are not carefully designed, multiple execution ports may be sitting idle for a given clock cycle. (Hyper-threads work by opportunistically using these unused execution ports.) Consequently, algorithms with a few extra operations but more operation parallelism will frequently be faster than an equivalent algorithm where the operations are necessarily serialized in the CPU. Furthermore, the compiler and CPU may have a difficult time discerning when instructions in some algorithms can be executed in parallel across execution ports. Seemingly null changes to the implementation of such algorithms, such as using splitting the algorithm across two accumulator variables and combining them at the end when any normal programmer would just use one variable to achieve the same thing can have a large impact on performance. I once doubled the performance of a bit-twiddling algorithm simply by taking the algorithm and using three variables instead of one. The algorithm was identical but the use of three registers exposed the available parallelism to the CPU. |
|
Second, bit-reversal never exists in a vacuum. There are other operations taking place around it, which will fill in unused execution resources, thanks to out-of-order execution. (And as you note, hyper threading will take advantage of them too).
Third, even though there are six ports (actually, 8 ports and 4 ALUs in Haswell![1]), that i7 can still only retire 4 fused uops per cycle, so in practice one thread cannot saturate all of the execution ports, no matter how cleverly it is optimized.
All of this combines to mean that the fastest bit-reversal in isolation may not be the fastest bit-reversal in situ, which is much more important. Actually evaluating that is much more complex, but it does tend to tip things away from chasing too much ILP slightly more than isolated timing does.
[1] http://www.anandtech.com/show/6355/intels-haswell-architectu...