Hacker News new | ask | show | jobs
by ice799 5808 days ago
building the code in the bug report as a 64bit binary and various system information: http://gist.github.com/483494

and

testing harness, scripts to build it, and to run it: http://gist.github.com/483524 -- yes i was too lazy to make a makefile.

you will still need to construct some command line fu to separate the results into separate files so you can load it into whatever maths program you want.

1 comments

Thanks for the information.

Your microbenchmark appears to be alignment sensitive. With your assembly code on my machine (quad-core 2.66 Core2 Quad) running for 250 tests I get:

  test1 usecs: avg 1.16759e+06 stddev 13917.6
  test2 usecs: avg 1.31382e+06 stddev 405.725
which are similar results to yours.

But if you add .align 8 right before the definition of test2 in the assembly file (i.e. make it be 8-byte aligned, just like test1), I get the following numbers:

  test1 usecs: avg 1.15972e+06 stddev 17004
  test2 usecs: avg 1.1264e+06 stddev 754.44
so the code that "doesn't use frame pointers" is actually slightly faster, as you might expect.

Additionally, if I simply modify your testcase to use 16-byte alignment, rather than 8-byte alignment, I get the following numbers:

  test1 usecs: avg 1.15895e+06 stddev 15764.7
  test2 usecs: avg 1.12657e+06 stddev 941.606
I think aligning both test functions by 8 bytes at least makes things fair, but you can see that minor changes in alignment can cause big changes.

You can see the assembly sources I used: http://gist.github.com/483840

FWIW, the code that uses movs rather than pushes and pops ought to be faster since (generally speaking for larger prologues and epilogues) you can execute a series of movs in parallel, whereas your pushes and pops are serialized, since they're all updating a common resource (the stack pointer). Empirical testing on benchmarks like SPEC2k has borne this out, both on x86 and x86-64. (You ought to be able to see this effect with gcc, depending on what cpu you use for the -mtune switch.) As you noted, this strategy carries a size penalty, since movs are somewhat larger than pushes and pops.

I'll also note that on my machine, with gcc saying it's:

  @nightcrawler:~$ gcc --version
  gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3
I get identical assembly for compiling the testcase from the PR with and without -fomit-frame-pointer (I should have noted the gcc version I was using, just as you did. My bad.) Furthermore, for:

  @nightcrawler:~$ gcc-4.3 --version
  gcc-4.3 (Ubuntu 4.3.4-10ubuntu1) 4.3.4
I also get identical assembly. On one of the servers at work, with:

  @nightcrawler:~$ ssh henry7 gcc --version
  gcc (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
I get identical assembly. Finally, also at work, with:

  @nightcrawler:~$ ssh henry7 /usr/local/tools/gcc-4.3.3/bin/i686-pc-linux-gnu-gcc --version
  i686-pc-linux-gnu-gcc (Sourcery G++ 4.3-83) 4.3.2
which is a somewhat patched version of GCC circa 4.3.2, I get identical assembly. So with four different flavors of GCC, there's no difference on the testcase in the PR with and without -fomit-frame-pointer. I'd be willing to bet that there's no differences with 4.5.x and mainline GCC as well. It looks like Debian may just have a peculiar set of patches to its version of GCC.

EDIT: formatting fixes.