|
I actually did some benchmarking on this once upon a time[1] and it turns out that it's actually really easy to write something that is only ~5x slower[2] (until you get to matrices that don't fit in RAM): (4, 4, 4) (5, 5, 5) (32, 32, 32) (33, 33, 33) (256, 256, 256) (257, 257, 257) (512, 512, 512) (513, 513, 513) (1024, 1024, 1024) (1025, 1025, 1025)
–––––––– ––––––––– ––––––––– –––––––––––– –––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– –––––––––––––––––– ––––––––––––––––––
:naive 0.0 0.0 1.3e-5 2.0e-5 0.0114 0.0133 0.0942 0.106 3.25 2.39
:tiled 0.0 0.0 2.7e-5 2.2e-5 0.0139 0.0121 0.154 0.101 1.25 0.888
:fastf77 0.0 0.0 8.0e-6 8.5e-6 0.00543 0.00563 0.0426 0.0445 0.437 0.448
:blas 4.5e-6 4.0e-6 1.9e-5 2.1e-5 0.000972 0.00109 0.00712 0.00744 0.0582 0.0607
(Units are seconds per multiplication.)Obviously OpenBLAS is so easy to package that it's not really worth avoiding it, but it was very eye-opening to see just how easy it is to get within an order of magnitude (easier, in fact, than getting into the 10x-20x range). [1]: https://gist.github.com/Sean1708/69c5694048e9a9ca7bd84fcbc9e... [2]: 8-core 3.4GHz Haswell i7 with 32kB L1, 256kB L2, 8MB L3, and 8GB RAM. |