| HN Mirror

You forget that if you can do the whole weight update in a single shot operation that the data doesn't have to go through the cache multiple times, and at least on the problem sizes I am working on FP bandwidth isn't what kills it, but the memory bandwidth. Back to the NN example: If you can do the matrix multiply and the application of the delta weights in a single loop iteration you get much better cache behavior.

Another thing about code generation, I am also using a hacked version of Eigen as well in a project I'm working on that can do the tanh and derivative of the tanh so the NN activations go quite abit faster since you can generate vectorized code for the whole calculation that will visit the memory location exactly once. While true the calculation of the weight updates is the most time spent, I saw 3-4x speedup in the activation code doing it in a single operation due to better memory access patterns and less loop iterations. Better memory access patterns can also have synergistic effects on other code because there is less cache pollution happening. By being fast and loose and introducing a few other copies of the matrix data in my case, my performance falls off a cliff when it no longer fits in the cpu cache nicely. 10x difference in the particular case I am remembering.

As always performance is part art, part science and perhaps it won't matter as much for the general case, but for my specific implementation and my matrix sizes Eigen has made a measurable difference for me compared to other solutions.