Hacker News new | ask | show | jobs
by kmbriedis 2320 days ago
But it can be 20x
1 comments

I actually did some benchmarking on this once upon a time[1] and it turns out that it's actually really easy to write something that is only ~5x slower[2] (until you get to matrices that don't fit in RAM):

           (4, 4, 4) (5, 5, 5) (32, 32, 32) (33, 33, 33) (256, 256, 256) (257, 257, 257) (512, 512, 512) (513, 513, 513) (1024, 1024, 1024) (1025, 1025, 1025)
  –––––––– ––––––––– ––––––––– –––––––––––– –––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– –––––––––––––––––– ––––––––––––––––––
  :naive         0.0       0.0       1.3e-5       2.0e-5          0.0114          0.0133          0.0942           0.106               3.25               2.39
  :tiled         0.0       0.0       2.7e-5       2.2e-5          0.0139          0.0121           0.154           0.101               1.25              0.888
  :fastf77       0.0       0.0       8.0e-6       8.5e-6         0.00543         0.00563          0.0426          0.0445              0.437              0.448
  :blas       4.5e-6    4.0e-6       1.9e-5       2.1e-5        0.000972         0.00109         0.00712         0.00744             0.0582             0.0607
(Units are seconds per multiplication.)

Obviously OpenBLAS is so easy to package that it's not really worth avoiding it, but it was very eye-opening to see just how easy it is to get within an order of magnitude (easier, in fact, than getting into the 10x-20x range).

[1]: https://gist.github.com/Sean1708/69c5694048e9a9ca7bd84fcbc9e...

[2]: 8-core 3.4GHz Haswell i7 with 32kB L1, 256kB L2, 8MB L3, and 8GB RAM.

Why is (1025, 1025, 1025) so much faster than (1024, 1024, 1024)?
My guess that it's happening mostly due cache conflicts. With 1024 for a simplified L1 with 32kb you can fit exactly 8 lines of the inner dimmension in the cache, which means that (0,8,0) would have the same cache location as (0, 0, 0), which is bad for tiling