Hacker News new | ask | show | jobs
by nkurz 4295 days ago
Thanks for writing this. It's great to see more articles this in HN.

The assembly in the article looks like it was compiled without optimization, which is going to change the exact numbers quite a bit. Contrary to what it sounds like, "without optimization" really means something like "output some strange dialect of antiquated assembly that's 10 times slower than it needs to be". It's really only useful if you are debugging the compiler. In particular, it's pushing all the variables onto the stack even though it doesn't need to, so that they are there in case you need to check them manually. 99% of the time you want to be compiling with -02, -03, or -0fast.

With gcc -O3 for x64, here's what they look like. The functions are shorter, faster, and in some ways easier to understand:

  align_1:
        leal    7(%rdi), %eax
        andl    $-8, %eax
        ret
The first integer arg always comes in %rdi/%edi. The compiler has recognized the idiom, and transformed it into (x + 7) & (0b11..11000). LEA is "load effective address", and in this case is a slightly shorter way of writing 'ADD', but no faster.

  align_2:
                cmpl    $8, %edi
                movl    $8, %eax
                jbe     .L6
        .L5:
                addl    $8, %eax
                cmpl    %eax, %edi
                ja      .L5
                rep ret
        .L6:
                rep ret
"If x < 8 (jbe == jump before or equal), return 8. Otherwise keep adding 8 until it's greater (jump after), and return that". The return value is whatever is in %eax on return. The 'rep ret' is a funny way of saying return, apparently done to be slightly faster on ancient AMD processors. At this point it's cruft. And there is no real reason to have two return statements --- that's the kind of wacky thing compilers like to do even when optimizing.

  align_3:
                movl    %edi, %edi
                subq    $8, %rsp
                cvtsi2sdq       %rdi, %xmm0
                mulsd   .LC0(%rip), %xmm0
                call    ceil
                mulsd   .LC1(%rip), %xmm0
                addq    $8, %rsp
                cvttsd2siq      %xmm0, %rax
                ret
        .LC0:
                .long   0
                .long   1069547520
        .LC1:
                .long   0
                .long   1075838976
The first 'movl' wipes out any bits greater than 32 if they are there. This is because x was defined as an int, and not a long. In general, you might be better off always using long unless there is a specific reason not to. Then we convert the int to floating point. Floating point on modern systems uses the 128-bit XMM registers that are also used for vectors. We put it into %xmm0, because that is register for the first floating point or vector arg. Keeping the result in %xmm0, multiply by what is presumably a floating point constant equal to ALIGN and then call ceil(). The result is also returned in %xmm0. Then because multiplication is much faster than division, multiply by ALIGN and then call ceil(). Then because multiplication is much faster than division, multiply by another constant that is is approximately (and depending on the value of the alignment, this might be a significant fact) equal to 1/align. Convert that back to a 32-bit "signed integer quadword" (siq) and return it.

The trickiness with optimizing (which is probably why you used the mangled unoptimized version) is that the compiler has a tendency to optimize them out completely so you end up testing an empty loop. Your choices are either work in assembly directly so what you see is what you get, or to be crafty and come up with something that prevents the compiler from doing so: a checksum, or perhaps a comparison that you know the result of but hopefully the compiler doesn't.

Here are the timings in cycles on a Haswell system. I compiled with '-fno-inline', which would be faster, but might skew the results more in this case, even though function calls are fast on x64.

baseline: 5.3 cycles for a loop that returns x and compares against zero.

align_1: 5.0 cycles including the loop. Yes, in this case adding the two instructions of the function actually took fewer cycles than baseline loop. It's more than ILP voodoo, it's a whole way of life!

align_2: 62 cycles per iteration. Horrible, but not as bad as the uncompiled mess. But it worth pointing out that while you don't want to use a loop here, there are much faster loops. Perhaps while (x % align != 0) { x++ }? This gets you down to 11 cycles, only 6 more than the baseline. Interestingly, it's this fast because the processor apparently can perfectly predict the number of loop iterations after the first 30, which is a little scary.

align_3: 19 cycles per call. Faster than the silly loop, but really not a way you want to approach this problem unless you've really thought through all the floating point details.

I put my code up at https://gist.github.com/nkurz/985470b01b999e67d04b. At the bottom are more numbers provided by Likwid for things like number of instructions, number of branch errors, etc.

1 comments

>In general, you might be better off always using long unless there is a specific reason not to

I don't agree. That's trading off useful cache for almost invisible micro optimizations.

As a general rule, it's always better to use the smallest size possible.

I phrase it as "might be" since I realize it's controversial, but I think that rule is outdated. Yes, having a large static array would be a fine specific reason to use the smallest size possible. But what would be the benefit when you are dealing with the argument to a function in an example like this?

Most of the time the argument starts in a register, is passed in a register, and returned in a register. Using a smaller size often just means that the compiler adds some unneeded conversions as in this case. Usually this doesn't matter, but when it does, the benefit is almost always in favor of the simpler rule of always using 64-bit variables.

One argument? Sure, probably no difference. When you start using longs as local variables and arguments, even if they're all in registers in one function, if you call other functions inside they are going to be pushed on the stack. It all adds up and suddenly you're getting L1 cache misses.

Anyway, unnecessary conversions mostly go away when you use link time optimizations (fwhole-program or flto in gcc).

A reasonable argument, although not one I seem to run up against, perhaps because I'm rarely concerned about high performance when writing functions with that many layers of subcalls.

By contrast, when trying to optimize inner loops, I frequently encounter cases where the front-end limitation of 4 micro-ops per cycle is a limiting factor, and getting rid of any extraneous instruction is a speedup. And rather than worrying about a deep stack causing L1 data misses, I'm more concerned with missing L1 instruction cache, or with the extra micro-ops causing me to miss the ~1000 slot decoded micro-op cache.

These concerns are clearly at opposite ends of the performance spectrum, and which should dominate probably depends on the problem at hand.

(I glanced at your comment history. Welcome to HN! You have good insights. Please stick around.)