|
Thanks for writing this. It's great to see more articles this in HN. The assembly in the article looks like it was compiled without optimization, which is going to change the exact numbers quite a bit. Contrary to what it sounds like, "without optimization" really means something like "output some strange dialect of antiquated assembly that's 10 times slower than it needs to be". It's really only useful if you are debugging the compiler. In particular, it's pushing all the variables onto the stack even though it doesn't need to, so that they are there in case you need to check them manually. 99% of the time you want to be compiling with -02, -03, or -0fast. With gcc -O3 for x64, here's what they look like. The functions are shorter, faster, and in some ways easier to understand: align_1:
leal 7(%rdi), %eax
andl $-8, %eax
ret
The first integer arg always comes in %rdi/%edi. The compiler has recognized the idiom, and transformed it into (x + 7) & (0b11..11000). LEA is "load effective address", and in this case is a slightly shorter way of writing 'ADD', but no faster. align_2:
cmpl $8, %edi
movl $8, %eax
jbe .L6
.L5:
addl $8, %eax
cmpl %eax, %edi
ja .L5
rep ret
.L6:
rep ret
"If x < 8 (jbe == jump before or equal), return 8. Otherwise keep adding 8 until it's greater (jump after), and return that". The return value is whatever is in %eax on return. The 'rep ret' is a funny way of saying return, apparently done to be slightly faster on ancient AMD processors. At this point it's cruft. And there is no real reason to have two return statements --- that's the kind of wacky thing compilers like to do even when optimizing. align_3:
movl %edi, %edi
subq $8, %rsp
cvtsi2sdq %rdi, %xmm0
mulsd .LC0(%rip), %xmm0
call ceil
mulsd .LC1(%rip), %xmm0
addq $8, %rsp
cvttsd2siq %xmm0, %rax
ret
.LC0:
.long 0
.long 1069547520
.LC1:
.long 0
.long 1075838976
The first 'movl' wipes out any bits greater than 32 if they are there. This is because x was defined as an int, and not a long. In general, you might be better off always using long unless there is a specific reason not to. Then we convert the int to floating point. Floating point on modern systems uses the 128-bit XMM registers that are also used for vectors. We put it into %xmm0, because that is register for the first floating point or vector arg. Keeping the result in %xmm0, multiply by what is presumably a floating point constant equal to ALIGN and then call ceil(). The result is also returned in %xmm0. Then because multiplication is much faster than division, multiply by ALIGN and then call ceil(). Then because multiplication is much faster than division, multiply by another constant that is is approximately (and depending on the value of the alignment, this might be a significant fact) equal to 1/align. Convert that back to a 32-bit "signed integer quadword" (siq) and return it.The trickiness with optimizing (which is probably why you used the mangled unoptimized version) is that the compiler has a tendency to optimize them out completely so you end up testing an empty loop. Your choices are either work in assembly directly so what you see is what you get, or to be crafty and come up with something that prevents the compiler from doing so: a checksum, or perhaps a comparison that you know the result of but hopefully the compiler doesn't. Here are the timings in cycles on a Haswell system. I compiled with '-fno-inline', which would be faster, but might skew the results more in this case, even though function calls are fast on x64. baseline: 5.3 cycles for a loop that returns x and compares against zero. align_1: 5.0 cycles including the loop. Yes, in this case adding the two instructions of the function actually took fewer cycles than baseline loop. It's more than ILP voodoo, it's a whole way of life! align_2: 62 cycles per iteration. Horrible, but not as bad as the uncompiled mess. But it worth pointing out that while you don't want to use a loop here, there are much faster loops. Perhaps while (x % align != 0) { x++ }? This gets you down to 11 cycles, only 6 more than the baseline. Interestingly, it's this fast because the processor apparently can perfectly predict the number of loop iterations after the first 30, which is a little scary. align_3: 19 cycles per call. Faster than the silly loop, but really not a way you want to approach this problem unless you've really thought through all the floating point details. I put my code up at https://gist.github.com/nkurz/985470b01b999e67d04b.
At the bottom are more numbers provided by Likwid for things like number of instructions, number of branch errors, etc. |
I don't agree. That's trading off useful cache for almost invisible micro optimizations.
As a general rule, it's always better to use the smallest size possible.