Hacker News new | ask | show | jobs
by sharpneli 4575 days ago
Did you use restrict?

I made a simple test void nsum(float v, float acc, int n, int vc ) { int j, i; for(i = 0; i < n; i++) for(j = 0; j < vc; j++) acc[i] += v[j][i]v[j][i]; }

And then I tested the same function with a different declaration void nsum(float * restrict * v, float * restrict acc, int n, int vc )

The version without restrict qualifier had 1.01s runtime. Version with restrict had 0.45s runtime. Both were compiled with identical flags (just -O3) using the ancient gcc 4.4.5. (vectorizer is enabled by default at O3 even in this version).

That's 2x speedup with a simple pointer definition.

1 comments

Normally I'd use restrict and float pointers, but since I was trying to repeat what the original poster did, I used fixed arrays instead. Because of this, I did not see a difference with 'restrict'. But I might be missing something, or might have messed up with the array indexing. The generated GCC optimized function is 500 instructions long, and thus difficult to scan. I put my untested test code up here: http://pastebin.com/qB0DfkXN
At least on this ancient version of gcc restrict helps even with the fixed sized array argument.

Without it the code of sum_of_squares_1 is as following:

  400913:       f3 0f 11 07             movss  %xmm0,(%rdi)
  400917:       f3 0f 10 48 34          movss 0x34(%rax),%xmm1
  40091c:       f3 0f 59 c9             mulss  %xmm1,%xmm1
  400920:       f3 0f 58 c8             addss  %xmm0,%xmm1
  400924:       f3 0f 11 0f             movss  %xmm1,(%rdi)
  400928:       f3 0f 10 40 38          movss  0x38(%rax),%xmm0
  40092d:       f3 0f 59 c0             mulss  %xmm0,%xmm0
  400931:       f3 0f 58 c1             addss  %xmm1,%xmm0
  400935:       f3 0f 11 07             movss  %xmm0,(%rdi)
  400939:       f3 0f 10 48 3c          movss  0x3c(%rax),%xmm1
As you can see it stores the dst[y] on each iteration. With function definition of: void sum_of_squares_1(float dst[restrict ROWS], float src[restrict ROWS][COLS]) The disassembly becomes completely different. However the speed of the end result did not really change that much.

Could you throw objdump -d of the best icc output to pastebin? I'm interested to see what kind of code it produces.

icc -fno-alias -Wall -std=c99 -finline-functions -Ofast -march=native loop-optimization.c -o loop

http://pastebin.com/qjEPy6Y0

Late night here in California. Good night!