|
|
|
|
|
by nkurz
4575 days ago
|
|
Do you say this as someone familiar with assembly and GCC? My usual guess would be that you can often hope for a 50% speedup in a tight loop by dropping from C to assembly, and that a 2x speedup over GCC is not uncommon. The original author's code isn't available for this example, but I put together something I think is comparable. I may still have silly bugs, but here are my initial result on Sandy Bridge are something like: icc 13.0.1 -03 -march=native -fno-inline wrong-loop: 1.35 s
icc 13.0.1 -03 -march=native -fno-inline right-loop: 0.78 s
icc 13.0.1 -03 -march=native -finline-functions wrong-loop: 0.22 s
icc 13.0.1 -03 -march=native -finline-functions right-loop: 0.22 s
gcc 4.8.0 -03 -march=native -fno-inline wrong-loop -fno-inline: 1.38 s
gcc 4.8.0 -03 -march=native -fno-inline right-loop -fno-inline: 1.14 s
gcc 4.8.0 -03 -march=native -finline-functions wrong-loop: 1.35 s
gcc 4.8.0 -03 -march=native -finline-functions right-loop: 1.14 s
There are all sorts of things I might be doing differently (or wrong), but I'm printing out a total-of-totals so I know it's at least going through the loops. It's possible that is a fast-math optimization, but I wouldn't be betting on GCC -O3 to be close to optimal. |
|
I made a simple test void nsum(float v, float acc, int n, int vc ) { int j, i; for(i = 0; i < n; i++) for(j = 0; j < vc; j++) acc[i] += v[j][i]v[j][i]; }
And then I tested the same function with a different declaration void nsum(float * restrict * v, float * restrict acc, int n, int vc )
The version without restrict qualifier had 1.01s runtime. Version with restrict had 0.45s runtime. Both were compiled with identical flags (just -O3) using the ancient gcc 4.4.5. (vectorizer is enabled by default at O3 even in this version).
That's 2x speedup with a simple pointer definition.