|
To be precise, they're both "vectorized" in the sense that both versions are using SSE vector instructions (and in fact, Clang will even generate AVX instructions for the first version if you use -mavx2). The difference is really the data dependency which has a massive effect on the ability of the CPU to pipeline the operation. For the first version w/ AVX I get: $ perf stat ./a.out
[-] Took: 225634 ns. Performance counter stats for './a.out': 247.34 msec task-clock:u # 0.998 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,009 page-faults:u # 8.122 K/sec
960,495,151 cycles:u # 3.883 GHz
2,125,347,630 instructions:u # 2.21 insn per cycle
62,572,806 branches:u # 252.982 M/sec
3,072 branch-misses:u # 0.00% of all branches
4,764,794,900 slots:u # 19.264 G/sec
2,298,312,834 topdown-retiring:u # 48.2% retiring
37,370,940 topdown-bad-spec:u # 0.8% bad speculation
186,854,701 topdown-fe-bound:u # 3.9% frontend bound
2,242,256,423 topdown-be-bound:u # 47.1% backend bound
0.247734256 seconds time elapsed
0.241338000 seconds user
0.004943000 seconds sys
For the second version with SSE and the data dependency I get:$ perf stat ./a.out
[-] Took: 955104 ns. Performance counter stats for './a.out': 975.30 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
2,010 page-faults:u # 2.061 K/sec
4,031,519,362 cycles:u # 4.134 GHz
3,400,341,362 instructions:u # 0.84 insn per cycle
200,073,542 branches:u # 205.140 M/sec
3,192 branch-misses:u # 0.00% of all branches
20,091,613,665 slots:u # 20.600 G/sec
3,110,995,283 topdown-retiring:u # 15.5% retiring
236,371,925 topdown-bad-spec:u # 1.2% bad speculation
236,371,925 topdown-fe-bound:u # 1.2% frontend bound
16,546,034,782 topdown-be-bound:u # 82.2% backend bound
0.975762759 seconds time elapsed
0.967603000 seconds user
0.004937000 seconds sys
As you can see the first version gets nearly 3x better IPC (2.21 vs 0.84) and spends half as much time being backend bound. |
No, just because it's using SSE doesn't mean it's vectorized. SSE has both scalar and vector instructions.