Hacker News new | ask | show | jobs
by sharpneli 4574 days ago
At least on this ancient version of gcc restrict helps even with the fixed sized array argument.

Without it the code of sum_of_squares_1 is as following:

  400913:       f3 0f 11 07             movss  %xmm0,(%rdi)
  400917:       f3 0f 10 48 34          movss 0x34(%rax),%xmm1
  40091c:       f3 0f 59 c9             mulss  %xmm1,%xmm1
  400920:       f3 0f 58 c8             addss  %xmm0,%xmm1
  400924:       f3 0f 11 0f             movss  %xmm1,(%rdi)
  400928:       f3 0f 10 40 38          movss  0x38(%rax),%xmm0
  40092d:       f3 0f 59 c0             mulss  %xmm0,%xmm0
  400931:       f3 0f 58 c1             addss  %xmm1,%xmm0
  400935:       f3 0f 11 07             movss  %xmm0,(%rdi)
  400939:       f3 0f 10 48 3c          movss  0x3c(%rax),%xmm1
As you can see it stores the dst[y] on each iteration. With function definition of: void sum_of_squares_1(float dst[restrict ROWS], float src[restrict ROWS][COLS]) The disassembly becomes completely different. However the speed of the end result did not really change that much.

Could you throw objdump -d of the best icc output to pastebin? I'm interested to see what kind of code it produces.

1 comments

icc -fno-alias -Wall -std=c99 -finline-functions -Ofast -march=native loop-optimization.c -o loop

http://pastebin.com/qjEPy6Y0

Late night here in California. Good night!