Any idea what optimization flags were used? It's rather strange that they're not reported. I would be surprised if GCC 4.8 was that far from optimal with -O3.
Do you say this as someone familiar with assembly and GCC? My usual guess would be that you can often hope for a 50% speedup in a tight loop by dropping from C to assembly, and that a 2x speedup over GCC is not uncommon.
The original author's code isn't available for this example, but I put together something I think is comparable. I may still have silly bugs, but here are my initial result on Sandy Bridge are something like:
icc 13.0.1 -03 -march=native -fno-inline wrong-loop: 1.35 s
icc 13.0.1 -03 -march=native -fno-inline right-loop: 0.78 s
icc 13.0.1 -03 -march=native -finline-functions wrong-loop: 0.22 s
icc 13.0.1 -03 -march=native -finline-functions right-loop: 0.22 s
gcc 4.8.0 -03 -march=native -fno-inline wrong-loop -fno-inline: 1.38 s
gcc 4.8.0 -03 -march=native -fno-inline right-loop -fno-inline: 1.14 s
gcc 4.8.0 -03 -march=native -finline-functions wrong-loop: 1.35 s
gcc 4.8.0 -03 -march=native -finline-functions right-loop: 1.14 s
There are all sorts of things I might be doing differently (or wrong), but I'm printing out a total-of-totals so I know it's at least going through the loops. It's possible that is a fast-math optimization, but I wouldn't be betting on GCC -O3 to be close to optimal.
I made a simple test
void nsum(float v, float acc, int n, int vc )
{
int j, i;
for(i = 0; i < n; i++)
for(j = 0; j < vc; j++)
acc[i] += v[j][i]v[j][i];
}
And then I tested the same function with a different declaration
void nsum(float * restrict * v, float * restrict acc, int n, int vc )
The version without restrict qualifier had 1.01s runtime. Version with restrict had 0.45s runtime. Both were compiled with identical flags (just -O3) using the ancient gcc 4.4.5. (vectorizer is enabled by default at O3 even in this version).
That's 2x speedup with a simple pointer definition.
Normally I'd use restrict and float pointers, but since I was trying to repeat what the original poster did, I used fixed arrays instead. Because of this, I did not see a difference with 'restrict'. But I might be missing something, or might have messed up with the array indexing. The generated GCC optimized function is 500 instructions long, and thus difficult to scan. I put my untested test code up here: http://pastebin.com/qB0DfkXN
As you can see it stores the dst[y] on each iteration. With function definition of:
void sum_of_squares_1(float dst[restrict ROWS], float src[restrict ROWS][COLS])
The disassembly becomes completely different. However the speed of the end result did not really change that much.
Could you throw objdump -d of the best icc output to pastebin? I'm interested to see what kind of code it produces.
> My usual guess would be that you can often hope for a 50% speedup in a tight loop by dropping from C to assembly
The problem with inline assembler is that it is almost untouchable by the optimizer. By adding some inline asm, you may inhibit a lot of optimization that could give better perf overall.
For this kind of tasks it is often a lot better to use intrinsics (e.g. xmmintrin.h for SSE) or use compiler extensions __attribute__((vector_size(16))) etc. This way you can utilize the CPU features you have available while still allowing the optimizer to do high level optimizations.
While there is lots to be said for the maintainability of intrinsics, I have found inline assembly to be significantly better for performance. And this is precisely because it inhibits the compiler from blindly performing 'optimizations' in the section of code you've already optimized. This thread offers an example and some numbers: http://software.intel.com/en-us/forums/topic/480004
I was under the impression that the parts of performance oriented programs which are typically converted to assembly are in essence small profiled hotspots like very tight loops, as such I doubt that there's any real performance to be had from high level optimizations in conjunction with that code as made possible by insintrics/extensions.
But I'm certainly no expert in this area, so take my opinion with a large grain of salt.
I get no significant difference with -Ofast for either icc or gcc. My code is still untested and quite possibly buggy, but I put it up here: http://pastebin.com/qB0DfkXN
The original author's code isn't available for this example, but I put together something I think is comparable. I may still have silly bugs, but here are my initial result on Sandy Bridge are something like:
There are all sorts of things I might be doing differently (or wrong), but I'm printing out a total-of-totals so I know it's at least going through the loops. It's possible that is a fast-math optimization, but I wouldn't be betting on GCC -O3 to be close to optimal.