| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nkurz 4575 days ago

Do you say this as someone familiar with assembly and GCC? My usual guess would be that you can often hope for a 50% speedup in a tight loop by dropping from C to assembly, and that a 2x speedup over GCC is not uncommon.

The original author's code isn't available for this example, but I put together something I think is comparable. I may still have silly bugs, but here are my initial result on Sandy Bridge are something like:

  icc 13.0.1 -03 -march=native -fno-inline wrong-loop: 1.35 s
  icc 13.0.1 -03 -march=native -fno-inline right-loop: 0.78 s
  icc 13.0.1 -03 -march=native -finline-functions wrong-loop: 0.22 s
  icc 13.0.1 -03 -march=native -finline-functions right-loop: 0.22 s

  gcc 4.8.0  -03 -march=native -fno-inline wrong-loop -fno-inline: 1.38 s
  gcc 4.8.0  -03 -march=native -fno-inline right-loop -fno-inline: 1.14 s
  gcc 4.8.0  -03 -march=native -finline-functions wrong-loop: 1.35 s
  gcc 4.8.0  -03 -march=native -finline-functions right-loop: 1.14 s

There are all sorts of things I might be doing differently (or wrong), but I'm printing out a total-of-totals so I know it's at least going through the loops. It's possible that is a fast-math optimization, but I wouldn't be betting on GCC -O3 to be close to optimal.

4 comments

sharpneli 4574 days ago

Did you use restrict?

I made a simple test void nsum(float v, float acc, int n, int vc ) { int j, i; for(i = 0; i < n; i++) for(j = 0; j < vc; j++) acc[i] += v[j][i]v[j][i]; }

And then I tested the same function with a different declaration void nsum(float * restrict * v, float * restrict acc, int n, int vc )

The version without restrict qualifier had 1.01s runtime. Version with restrict had 0.45s runtime. Both were compiled with identical flags (just -O3) using the ancient gcc 4.4.5. (vectorizer is enabled by default at O3 even in this version).

That's 2x speedup with a simple pointer definition.

link

nkurz 4574 days ago

Normally I'd use restrict and float pointers, but since I was trying to repeat what the original poster did, I used fixed arrays instead. Because of this, I did not see a difference with 'restrict'. But I might be missing something, or might have messed up with the array indexing. The generated GCC optimized function is 500 instructions long, and thus difficult to scan. I put my untested test code up here: http://pastebin.com/qB0DfkXN

link

sharpneli 4574 days ago

At least on this ancient version of gcc restrict helps even with the fixed sized array argument.

Without it the code of sum_of_squares_1 is as following:

  400913:       f3 0f 11 07             movss  %xmm0,(%rdi)
  400917:       f3 0f 10 48 34          movss 0x34(%rax),%xmm1
  40091c:       f3 0f 59 c9             mulss  %xmm1,%xmm1
  400920:       f3 0f 58 c8             addss  %xmm0,%xmm1
  400924:       f3 0f 11 0f             movss  %xmm1,(%rdi)
  400928:       f3 0f 10 40 38          movss  0x38(%rax),%xmm0
  40092d:       f3 0f 59 c0             mulss  %xmm0,%xmm0
  400931:       f3 0f 58 c1             addss  %xmm1,%xmm0
  400935:       f3 0f 11 07             movss  %xmm0,(%rdi)
  400939:       f3 0f 10 48 3c          movss  0x3c(%rax),%xmm1

As you can see it stores the dst[y] on each iteration. With function definition of: void sum_of_squares_1(float dst[restrict ROWS], float src[restrict ROWS][COLS]) The disassembly becomes completely different. However the speed of the end result did not really change that much.

Could you throw objdump -d of the best icc output to pastebin? I'm interested to see what kind of code it produces.

link

nkurz 4574 days ago

icc -fno-alias -Wall -std=c99 -finline-functions -Ofast -march=native loop-optimization.c -o loop

http://pastebin.com/qjEPy6Y0

Late night here in California. Good night!

link

exDM69 4574 days ago

> My usual guess would be that you can often hope for a 50% speedup in a tight loop by dropping from C to assembly

The problem with inline assembler is that it is almost untouchable by the optimizer. By adding some inline asm, you may inhibit a lot of optimization that could give better perf overall.

For this kind of tasks it is often a lot better to use intrinsics (e.g. xmmintrin.h for SSE) or use compiler extensions __attribute__((vector_size(16))) etc. This way you can utilize the CPU features you have available while still allowing the optimizer to do high level optimizations.

link

nkurz 4574 days ago

While there is lots to be said for the maintainability of intrinsics, I have found inline assembly to be significantly better for performance. And this is precisely because it inhibits the compiler from blindly performing 'optimizations' in the section of code you've already optimized. This thread offers an example and some numbers: http://software.intel.com/en-us/forums/topic/480004

link

gillianseed 4574 days ago

I was under the impression that the parts of performance oriented programs which are typically converted to assembly are in essence small profiled hotspots like very tight loops, as such I doubt that there's any real performance to be had from high level optimizations in conjunction with that code as made possible by insintrics/extensions.

But I'm certainly no expert in this area, so take my opinion with a large grain of salt.

link

nkurz 4574 days ago

Jarek (the author) fixed the link and his (very clean) code is available again at: http://www.lshift.net/wp-content/uploads/2013/10/ve.c

link

gillianseed 4574 days ago

Could you try with -Ofast which enables -ffast-math and post the results?

link

nkurz 4574 days ago

I get no significant difference with -Ofast for either icc or gcc. My code is still untested and quite possibly buggy, but I put it up here: http://pastebin.com/qB0DfkXN

link