Hacker News new | ask | show | jobs
by stephencanon 4707 days ago
This is an excellent point, however, there are a few things to keep in mind: first, compilers can (and do) perform this optimization for you (ignoring details about re-associating floating-point since we’re talking about bit twiddling).

Second, bit-reversal never exists in a vacuum. There are other operations taking place around it, which will fill in unused execution resources, thanks to out-of-order execution. (And as you note, hyper threading will take advantage of them too).

Third, even though there are six ports (actually, 8 ports and 4 ALUs in Haswell![1]), that i7 can still only retire 4 fused uops per cycle, so in practice one thread cannot saturate all of the execution ports, no matter how cleverly it is optimized.

All of this combines to mean that the fastest bit-reversal in isolation may not be the fastest bit-reversal in situ, which is much more important. Actually evaluating that is much more complex, but it does tend to tip things away from chasing too much ILP slightly more than isolated timing does.

[1] http://www.anandtech.com/show/6355/intels-haswell-architectu...

3 comments

Both GCC and Clang are surprisingly mediocre at this kind of optimization. I write a lot of extreme performance integer algorithms and those compilers only seem to find "obvious" parallel instruction schedules about half the time even in isolated contexts.

Fortunately, it is pretty simple to induce the desired optimization from the C code without resorting to much cleverness. The compilers miss these optimizations often enough that I frequently double check if I care. Still, it requires fairly detailed knowledge of the microarchitecture.

I do not do microarchitecture optimization work very often. The last time I did, it was to design a faster, better hash function to replace Google's CityHash (and the result was faster and stronger). For most codes, memory behaviors dominate with respect to performance.

> Both GCC and Clang are surprisingly mediocre at this kind of optimization.

Clang, at least, has gotten significantly better in the past year (I haven’t used GCC for a while), but there’s certainly room for improvement. Please report bugs for cases that your compiler misses if you can spare a few minutes.

Have you published your hash function?
Not yet but will soon. It is not just one function but an entire family of functions with some interesting aspects beyond just the algorithms.

The hash functions were algorithmically optimized around a scaffolding I designed that guaranteed certain performance characteristics and easy analyzability. It has literally produced many, many thousands of high quality hash functions. Tens of thousands of CPU hours have been burned on the optimization process, which is still running, and the ones that have been put into use so far were early samples pulled out of that process but the hash functions still being processed in the pipeline are statistically more robust than the earlier versions. Much easier than trying to design them the old fashioned way. At some point soon, since the optimization is converging, I will evaluate the most promising parts of the phase space to select the strongest and most aesthetic functions and publish those into the public domain.

how does it compare to, say, tabulation as in

Mihai Patrascu, Mikkel Thorup: The Power of Simple Tabulation Hashing. J. ACM 59(3): 14 (2012) http://doi.acm.org/10.1145/2220357.2220361 (free version: http://arxiv.org/abs/1011.5200)

or also

Mihai Patrascu, Mikkel Thorup: Twisted Tabulation Hashing. SODA 2013: 209-228 http://knowledgecenter.siam.org/0236-000005/

This is exactly what I was planning on doing one of these days. After looking at CityHash I was sure it could be beat.
This is an excellent point, however, there are a few things to keep in mind: first, compilers can (and do) perform this optimization for you (ignoring details about re-associating floating-point since we’re talking about bit twiddling).

For people who know what they're doing (e.g. DarkShikari), the compiler doesn't stand a chance: http://www.scribd.com/doc/137419114/Introduction-to-AVX2-opt... (via https://news.ycombinator.com/item?id=5598010)

P.S. Scribd stinks. The important numbers are on the hidden second and third pages. Do people know that Scribd is doing this to their documents? That document is CC-BY-NC -- charging money to read pages 2 and 3 or download the original is not NC.

Found the original here: http://mailman.videolan.org/pipermail/x264-devel/attachments...

Non-trivial vectorization is an entirely different sort of optimization than re-association to extract ILP (especially in the integer domain where the vector instructions often have no exact non-vector analogue).

Yes, there are people who have the knowledge and experience to regularly beat the compiler. I do it professionally. My point is not "don't bother optimizing, let the compiler do it". My point is "this particular micro-optimization is less valuable in the real world than focused benchmarks suggest, and good compilers often do it (this one specific optimization) for you, anyway."

Was there a followup to that article? I'm pretty sure the NDA period should be up by now and I'm wondering if the author posted a more in-depth explanation and overview.
Absolutely true.

Note that micro-fusion can allow you to push the number of uops retired per cycle to 5-6 (SNB would be 3 ALU, 2 loads, Haswell could be 4 ALU, 2 loads), as the issue/retire rates are on the fused domain.

It's a far cry from RISC - like a load/store machine.

All that being said micro-fusion only works with an ALU op and load from the same instruction and you obviously have to be careful to ensure that the load applies to the appropriate operand (tricky given the non-orthogonal, frequently 2-address form of the instructions).

You also need to have loads to do. Adding spurious loads just to boost uop retirement rate would be an interesting choice. :)