| HN Mirror

We compared to several frequent directions variants, Fast Johnson–Lindenstrauss, some other hashing-based methods, and a bunch of other approximate matrix multiplication approaches. We had to omit some of them from the results section though because they were 10-100x worse than exact matrix products and they ruined the plots. More info in appendix E5.

As far as single threaded, there's a simple answer and a subtle one. The simple answer is that we consider the core subroutine a single thread would run in a multithreaded context, not how to do the multithreading. These are basically orthogonal issues since matrix multiplication is embarrassingly parallel and our method would parallelize just like any other. More details in appendix E2.

The subtler answer is that we could do even better in the multithreaded context if you could fuse the encoding step with whatever earlier code produces the larger matrix. This is a result of matrix products becoming memory-bandwidth-bound in the limit of sufficient cores, combined with our approach reducing the size of the matrix by a huge amount.