|
> And the computation is not free either, though at sufficiently large sizes the memory accesses should dominate sind and tan, I think. Looks like I was wrong about this! You might want to retry your experiments with cheaper operations than sin and tan. I wrote a little C benchmark to test this more: #include <stdio.h>
#include <time.h>
#include <math.h>
extern void sinTanSeparate(double *a, double *b, int n) {
for (int i = 0; i < n; i++) {
b[i] = tan(a[i]);
}
for (int i = 0; i < n; i++) {
b[i] = sin(b[i]);
}
}
extern void sinTanFused(double *a, double *b, int n) {
for (int i = 0; i < n; i++) {
b[i] = sin(tan(a[i]));
}
}
#define N (128 * 1024 * 1024)
#define RUNS 5
double a[N];
double b[N];
int main(void) {
clock_t start, end;
printf("will do %d runs over %zu MB of data\n\n",
RUNS, sizeof a / (1024 * 1024));
for (int i = 0; i < RUNS; i++) {
start = clock();
sinTanSeparate(a, b, N);
end = clock();
printf("separate: %f sec\n", ((double) end - start) / CLOCKS_PER_SEC);
}
printf("\n");
for (int i = 0; i < RUNS; i++) {
start = clock();
sinTanFused(a, b, N);
end = clock();
printf("fused: %f sec\n", ((double) end - start) / CLOCKS_PER_SEC);
}
return 0;
}
Compiling this with gcc -O3 gives: will do 5 runs over 1024 MB of data
separate: 1.461349 sec
separate: 1.020120 sec
separate: 1.019002 sec
separate: 1.019888 sec
separate: 1.018454 sec
fused: 1.014774 sec
fused: 1.014724 sec
fused: 1.013895 sec
fused: 1.016440 sec
fused: 1.013729 sec
So almost no difference, though with enough runs I think this would be significant. Interestingly, although C is not JIT compiled, even here there is a "warmup" effect. I guess these are initial page faults or something.But if we now comment out <math.h> and instead use some cheap "fake" implementations of in and tan: // #include <math.h>
#define tan(x) (x + 1)
#define sin(x) (x + 2)
we get very different behavior: will do 5 runs over 1024 MB of data
separate: 0.548558 sec
separate: 0.154741 sec
separate: 0.151271 sec
separate: 0.150542 sec
separate: 0.151337 sec
fused: 0.078880 sec
fused: 0.074742 sec
fused: 0.078313 sec
fused: 0.076987 sec
fused: 0.077729 sec
Here the computation is so cheap that it's really other effects that dominate, and you get a 2x difference. |