| HN Mirror

I actually did some benchmarking on this once upon a time[1] and it turns out that it's actually really easy to write something that is only ~5x slower[2] (until you get to matrices that don't fit in RAM):

           (4, 4, 4) (5, 5, 5) (32, 32, 32) (33, 33, 33) (256, 256, 256) (257, 257, 257) (512, 512, 512) (513, 513, 513) (1024, 1024, 1024) (1025, 1025, 1025)
  –––––––– ––––––––– ––––––––– –––––––––––– –––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– –––––––––––––––––– ––––––––––––––––––
  :naive         0.0       0.0       1.3e-5       2.0e-5          0.0114          0.0133          0.0942           0.106               3.25               2.39
  :tiled         0.0       0.0       2.7e-5       2.2e-5          0.0139          0.0121           0.154           0.101               1.25              0.888
  :fastf77       0.0       0.0       8.0e-6       8.5e-6         0.00543         0.00563          0.0426          0.0445              0.437              0.448
  :blas       4.5e-6    4.0e-6       1.9e-5       2.1e-5        0.000972         0.00109         0.00712         0.00744             0.0582             0.0607

(Units are seconds per multiplication.)

Obviously OpenBLAS is so easy to package that it's not really worth avoiding it, but it was very eye-opening to see just how easy it is to get within an order of magnitude (easier, in fact, than getting into the 10x-20x range).

[1]: https://gist.github.com/Sean1708/69c5694048e9a9ca7bd84fcbc9e...

[2]: 8-core 3.4GHz Haswell i7 with 32kB L1, 256kB L2, 8MB L3, and 8GB RAM.