| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by prirun 1172 days ago

Modern CPU caches are usually loaded in 64-byte units - much larger than 128 bits. I just ran some tests with a C program on an Intel I5 with both AoS and SoA using a list of 1B points with 32-bit X and Y components. Looping through the list of points and totaling all X and Y components was the same speed with either AoS or SoA.

It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.

Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.

  [jim@mbp ~]$ sh -x x
  + cat x1.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x + p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m12.078s
  user 0m7.319s
  sys 0m4.363s
  + ./x1
  s=1808348672

  real 0m9.415s
  user 0m6.677s
  sys 0m2.685s
  + cat x2.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      x[i] = i;
      y[i] = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += x[i] + y[i];
    }
    printf("s=%d\n", s);
  }
  + cc -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m9.753s
  user 0m6.713s
  sys 0m2.967s
  + ./x2
  s=1808348672

  real 0m9.642s
  user 0m6.674s
  sys 0m2.902s
  + cat x3.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
    }
    for (i=0; i<NUM; i++) {
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x;
    }
    for (i=0; i<NUM; i++) {
      s += p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m13.844s
  user 0m11.095s
  sys 0m2.700s
  + ./x3
  s=1808348672

  real 0m13.686s
  user 0m11.038s
  sys 0m2.611s
  + cat x4.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++)
      x[i] = i;
    for (i=0; i<NUM; i++)
      y[i] = i;

    s=0;
    for (i=0; i<NUM; i++)
      s += x[i];
    for (i=0; i<NUM; i++)
      s += y[i];

    printf("s=%d\n", s);
  }
  + cc -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m13.530s
  user 0m10.851s
  sys 0m2.633s
  + ./x4
  s=1808348672

  real 0m13.489s
  user 0m10.856s
  sys 0m2.603s

1 comments

lights0123 1172 days ago

It appears that you didn't enable optimizations. That's needed for SIMD, which can only be taken advantage of with contiguously packed data.

link

prirun 1171 days ago

Results with -O are below.

x1 (AoS) vs x2 (SoA): no performance difference x3 (arrays not in structure, both arrays in loop): slower x4 (arrays not in structure, one array in loop): faster

My advice is still not to assume that SoA is always faster than AoS without benchmarking.

  + cc -O -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m11.775s
  user 0m3.540s
  sys 0m6.592s
  + ./x1
  s=1808348672

  real 0m5.427s
  user 0m2.727s
  sys 0m2.682s
  + cc -O -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m5.185s
  user 0m2.296s
  sys 0m2.872s
  + ./x2
  s=1808348672

  real 0m5.140s
  user 0m2.273s
  sys 0m2.852s
  + cc -O -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m6.423s
  user 0m3.745s
  sys 0m2.660s
  + ./x3
  s=1808348672

  real 0m6.485s
  user 0m3.741s
  sys 0m2.714s
  + cc -O -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m4.875s
  user 0m2.205s
  sys 0m2.651s
  + ./x4
  s=1808348672

  real 0m4.894s
  user 0m2.189s
  sys 0m2.684s

link

Narishma 1171 days ago

The reason you're not seeing much difference is that your struct is very small, just 16 bytes. In the cases where you're using both x and y in the loop (x1 and x2), you can fit 4 of them in a cache line and you're not wasting space since you need to use both. In the case you're only using one of the values (x3), you're wasting half a cache line and that shows in the benchmark. If you had a bigger struct and/or where you're not using all the members in the calculation, you'd see a much bigger difference in performance between SoA and AoS.

link

rpep 1171 days ago

You need to add “-march=native” to turn on SIMD at the highest level your processor will support, otherwise it will just use SSE4.1 by default on most compilers (the lowest common denominator, as all x86-64 processors have it).

link