|
|
|
|
|
by sharpneli
4574 days ago
|
|
At least on this ancient version of gcc restrict helps even with the fixed sized array argument. Without it the code of sum_of_squares_1 is as following: 400913: f3 0f 11 07 movss %xmm0,(%rdi)
400917: f3 0f 10 48 34 movss 0x34(%rax),%xmm1
40091c: f3 0f 59 c9 mulss %xmm1,%xmm1
400920: f3 0f 58 c8 addss %xmm0,%xmm1
400924: f3 0f 11 0f movss %xmm1,(%rdi)
400928: f3 0f 10 40 38 movss 0x38(%rax),%xmm0
40092d: f3 0f 59 c0 mulss %xmm0,%xmm0
400931: f3 0f 58 c1 addss %xmm1,%xmm0
400935: f3 0f 11 07 movss %xmm0,(%rdi)
400939: f3 0f 10 48 3c movss 0x3c(%rax),%xmm1
As you can see it stores the dst[y] on each iteration. With function definition of:
void sum_of_squares_1(float dst[restrict ROWS], float src[restrict ROWS][COLS])
The disassembly becomes completely different. However the speed of the end result did not really change that much.Could you throw objdump -d of the best icc output to pastebin? I'm interested to see what kind of code it produces. |
|
http://pastebin.com/qjEPy6Y0
Late night here in California. Good night!