x1 (AoS) vs x2 (SoA): no performance difference
x3 (arrays not in structure, both arrays in loop): slower
x4 (arrays not in structure, one array in loop): faster
My advice is still not to assume that SoA is always faster than AoS without benchmarking.
+ cc -O -o x1 x1.c
+ ./x1
s=1808348672
real 0m11.775s
user 0m3.540s
sys 0m6.592s
+ ./x1
s=1808348672
real 0m5.427s
user 0m2.727s
sys 0m2.682s
+ cc -O -o x2 x2.c
+ ./x2
s=1808348672
real 0m5.185s
user 0m2.296s
sys 0m2.872s
+ ./x2
s=1808348672
real 0m5.140s
user 0m2.273s
sys 0m2.852s
+ cc -O -o x3 x3.c
+ ./x3
s=1808348672
real 0m6.423s
user 0m3.745s
sys 0m2.660s
+ ./x3
s=1808348672
real 0m6.485s
user 0m3.741s
sys 0m2.714s
+ cc -O -o x4 x4.c
+ ./x4
s=1808348672
real 0m4.875s
user 0m2.205s
sys 0m2.651s
+ ./x4
s=1808348672
real 0m4.894s
user 0m2.189s
sys 0m2.684s
The reason you're not seeing much difference is that your struct is very small, just 16 bytes. In the cases where you're using both x and y in the loop (x1 and x2), you can fit 4 of them in a cache line and you're not wasting space since you need to use both. In the case you're only using one of the values (x3), you're wasting half a cache line and that shows in the benchmark. If you had a bigger struct and/or where you're not using all the members in the calculation, you'd see a much bigger difference in performance between SoA and AoS.
You need to add “-march=native” to turn on SIMD at the highest level your processor will support, otherwise it will just use SSE4.1 by default on most compilers (the lowest common denominator, as all x86-64 processors have it).
x1 (AoS) vs x2 (SoA): no performance difference x3 (arrays not in structure, both arrays in loop): slower x4 (arrays not in structure, one array in loop): faster
My advice is still not to assume that SoA is always faster than AoS without benchmarking.