|
|
|
|
|
by manholio
1481 days ago
|
|
Seems like you can have the cake and eat it too, by manually parallelizing the code to something like this: double A4 = A+A+A+A;
double Z = 3A+B;
double Y1 = C;
double Y2 = A+B+C;
int i;
// ... setup unroll when LEN is odd...
for(i=0; i<LEN; i++) {
data[i] = Y1;
data[++i] = Y2;
Y1 += Z;
Y2 += Z;
Z += A4;
}
Probably not entirely functional as written, but you get the idea: unroll the loop so that the data dependent paths can each be done in parallel. For the machine being considered, a 4 step unroll should achieve maximum performance, but of course, you get all the fun things that come with hard-coding the architecture in your software. |
|