|
|
|
|
|
by carterschonwald
4614 days ago
|
|
actually in the specific example I'm thinking about, i'm talking about memory locality being the performance difference (and in this case, array layout for matrix multiplication). The naive obvious "dot product" matrix mult of two Row Major matrices is 100-1000x slower than somewhat fancier layouts, or even simply transposing the right hand matrix can make a significant difference, let alone more fancy things. Often the biggest throughput bottleneck for CPU bound algorithms in a numerical setting is the quality of the memory locality (because the CPU can chew through data faster than you can feed it). Its actually really really hard to get C / C++ to help you write code with suitably fancy layouts that are easy to use. Amusingly, I also think most auto vectorization approaches to SIMD actually miss the best way to use SIMD registers! I've actually some cute small matrix kernels where by using the AVX SIMD registers as a "L0" cache, I get a 1.5x perf boost! |
|
Still I don't see the connection to Haskell, can you elaborate ?