Hacker News new | ask | show | jobs
by jandrewrogers 4707 days ago
This article overlooks a major factor in bit-twiddling performance on modern CPUs: saturation of the execution ports in a CPU core.

An Intel i7 core has six execution ports, three of which are ALUs of various types. Depending on the specific instruction and the dependencies between instructions, the CPU can execute up to 3 simple integer operations every clock cycle mixed with operations like loads and stores at the same time. For most algorithms, particularly those that are not carefully designed, multiple execution ports may be sitting idle for a given clock cycle. (Hyper-threads work by opportunistically using these unused execution ports.)

Consequently, algorithms with a few extra operations but more operation parallelism will frequently be faster than an equivalent algorithm where the operations are necessarily serialized in the CPU.

Furthermore, the compiler and CPU may have a difficult time discerning when instructions in some algorithms can be executed in parallel across execution ports. Seemingly null changes to the implementation of such algorithms, such as using splitting the algorithm across two accumulator variables and combining them at the end when any normal programmer would just use one variable to achieve the same thing can have a large impact on performance. I once doubled the performance of a bit-twiddling algorithm simply by taking the algorithm and using three variables instead of one. The algorithm was identical but the use of three registers exposed the available parallelism to the CPU.

3 comments

This is an excellent point, however, there are a few things to keep in mind: first, compilers can (and do) perform this optimization for you (ignoring details about re-associating floating-point since we’re talking about bit twiddling).

Second, bit-reversal never exists in a vacuum. There are other operations taking place around it, which will fill in unused execution resources, thanks to out-of-order execution. (And as you note, hyper threading will take advantage of them too).

Third, even though there are six ports (actually, 8 ports and 4 ALUs in Haswell![1]), that i7 can still only retire 4 fused uops per cycle, so in practice one thread cannot saturate all of the execution ports, no matter how cleverly it is optimized.

All of this combines to mean that the fastest bit-reversal in isolation may not be the fastest bit-reversal in situ, which is much more important. Actually evaluating that is much more complex, but it does tend to tip things away from chasing too much ILP slightly more than isolated timing does.

[1] http://www.anandtech.com/show/6355/intels-haswell-architectu...

Both GCC and Clang are surprisingly mediocre at this kind of optimization. I write a lot of extreme performance integer algorithms and those compilers only seem to find "obvious" parallel instruction schedules about half the time even in isolated contexts.

Fortunately, it is pretty simple to induce the desired optimization from the C code without resorting to much cleverness. The compilers miss these optimizations often enough that I frequently double check if I care. Still, it requires fairly detailed knowledge of the microarchitecture.

I do not do microarchitecture optimization work very often. The last time I did, it was to design a faster, better hash function to replace Google's CityHash (and the result was faster and stronger). For most codes, memory behaviors dominate with respect to performance.

> Both GCC and Clang are surprisingly mediocre at this kind of optimization.

Clang, at least, has gotten significantly better in the past year (I haven’t used GCC for a while), but there’s certainly room for improvement. Please report bugs for cases that your compiler misses if you can spare a few minutes.

Have you published your hash function?
Not yet but will soon. It is not just one function but an entire family of functions with some interesting aspects beyond just the algorithms.

The hash functions were algorithmically optimized around a scaffolding I designed that guaranteed certain performance characteristics and easy analyzability. It has literally produced many, many thousands of high quality hash functions. Tens of thousands of CPU hours have been burned on the optimization process, which is still running, and the ones that have been put into use so far were early samples pulled out of that process but the hash functions still being processed in the pipeline are statistically more robust than the earlier versions. Much easier than trying to design them the old fashioned way. At some point soon, since the optimization is converging, I will evaluate the most promising parts of the phase space to select the strongest and most aesthetic functions and publish those into the public domain.

how does it compare to, say, tabulation as in

Mihai Patrascu, Mikkel Thorup: The Power of Simple Tabulation Hashing. J. ACM 59(3): 14 (2012) http://doi.acm.org/10.1145/2220357.2220361 (free version: http://arxiv.org/abs/1011.5200)

or also

Mihai Patrascu, Mikkel Thorup: Twisted Tabulation Hashing. SODA 2013: 209-228 http://knowledgecenter.siam.org/0236-000005/

This is exactly what I was planning on doing one of these days. After looking at CityHash I was sure it could be beat.
This is an excellent point, however, there are a few things to keep in mind: first, compilers can (and do) perform this optimization for you (ignoring details about re-associating floating-point since we’re talking about bit twiddling).

For people who know what they're doing (e.g. DarkShikari), the compiler doesn't stand a chance: http://www.scribd.com/doc/137419114/Introduction-to-AVX2-opt... (via https://news.ycombinator.com/item?id=5598010)

P.S. Scribd stinks. The important numbers are on the hidden second and third pages. Do people know that Scribd is doing this to their documents? That document is CC-BY-NC -- charging money to read pages 2 and 3 or download the original is not NC.

Found the original here: http://mailman.videolan.org/pipermail/x264-devel/attachments...

Non-trivial vectorization is an entirely different sort of optimization than re-association to extract ILP (especially in the integer domain where the vector instructions often have no exact non-vector analogue).

Yes, there are people who have the knowledge and experience to regularly beat the compiler. I do it professionally. My point is not "don't bother optimizing, let the compiler do it". My point is "this particular micro-optimization is less valuable in the real world than focused benchmarks suggest, and good compilers often do it (this one specific optimization) for you, anyway."

Was there a followup to that article? I'm pretty sure the NDA period should be up by now and I'm wondering if the author posted a more in-depth explanation and overview.
Absolutely true.

Note that micro-fusion can allow you to push the number of uops retired per cycle to 5-6 (SNB would be 3 ALU, 2 loads, Haswell could be 4 ALU, 2 loads), as the issue/retire rates are on the fused domain.

It's a far cry from RISC - like a load/store machine.

All that being said micro-fusion only works with an ALU op and load from the same instruction and you obviously have to be careful to ensure that the load applies to the appropriate operand (tricky given the non-orthogonal, frequently 2-address form of the instructions).

You also need to have loads to do. Adding spurious loads just to boost uop retirement rate would be an interesting choice. :)
Very very good points! Relatedly: for any performance sensitive code, reading the relevant version of the Intel Optimization manual + a book like Hacker's Delight will lead to a lot of good understanding of these trick.

(admission: i'm spending a lot of my time staring at ways to make it really really easy to write fast numerical codes, so thinking about the ports on modern CPUs is very very helpful)

Could you recommend a good reading (a book preferably) for learning about CPU architecture. I'm aware of Intel Software Developer's Manual. What other resources are worth reading?
Computer Architecture: A Quantitative Approach by Hennesey and Patterson is the definitive text, and it is good (at least the 3rd edition I read back in 02).

One way to put the ideas you learn into practice is to try to write an optimized large NxN matrix multiplication routine. Start in C, converting the kernel to x86 code. Also disassemble the generated C code in the kernel to see what the compiler is doing. See how close you can get to the theoretical peak CPU performance.

It is fun stuff. While this kind of optimization is rarely needed, doing so (like learning lisp) will make you a better programmer.