| It looks like while converting from my benchmarking code you've dropped the 'f' when creating the resulting array. https://github.com/treo/benchmarking_nd4j/blob/master/src/ma... The difference is rather huge with the newer versions of nd4j. While the numbers in the following gists do not contain the measurements I took for neanderthal, they do contain the numbers that I got for ND4J. Without f ordering:
https://gist.github.com/treo/1fab39f213da26255cf4f75e383ff90... With f ordering:
https://gist.github.com/treo/94fe92c9417b5c8b24baa12924a35b0... As you can see something happened in the time between the 0.4 release (I took that as the comparison point since that was when I ran my own benchmarks the last time) and the 0.9.1 release that introduced additional overhead. Originally I planned to create my own write-up on this, but I wanted to first to find out what happened there. Given that ND4J is mainly used inside of DL4J and the matrix sizes it is used with usually are rather large, the performance overhead difference that I've observed there for tiny multiplications isn't necessarily that bad, as the newer version performs much better on larger matrices. |
Although, to my defense, the option in question here is very poorly documented. I've found the ND4J tutorial page where it's mentioned, and even after re-reading the sentence multiple times, I still do not connect its description to what it (seems to) actually do. It also does not mention that it affects computation speed.
Anyway, I'm looking forward to reading your detailed analysis, and especially seeing your Neanderthal numbers.