|
|
|
|
|
by mainland
4815 days ago
|
|
First author here. The dot product example was compiled with GCC 4.7.2 -O3 -msse4.2 -ffast-math -ftree-vectorize -funroll-loops; see the caption to Figure 5. What compiler options would you have suggested? The Haskell version only used SSE instructions, not AVX; this should have been made clear in the paper. The more complex examples are in Section 5.2; see Figure 8. Granted, we would have liked to have done more, but deadlines are deadlines... |
|
There are some other CPU-related slight inaccuracies in the paper. Prefetching is repeatedly mentioned, even though its effect is negligible when one has a perfectly linear memory access pattern; unaligned loads are mentioned as a performance hit, but they are essentially free on the test processor (2600k, Sandy Bridge).
Matrix multiplication would perhaps be a better example to show the power of clever prefetching.