Hacker News new | ask | show | jobs
by lightcatcher 2841 days ago
Deep learning mostly uses matrices with largest dimension <1000. The size of the matrix directly corresponds to the number of input and output units to to a fully connected layer.

Even a 2000x2000 matrix is relatively small as far as non O(N^3) matrix multiplication algorithms go. The faster asymptotic algorithms have larger fixed costs and do not map as nicely to current CPU/GPU architectures. Additionally, the people capable of writing high performance matrix multiplication mostly haven't spent time on the non-N^3 algorithms. I believe every common implementation (MKL, cuBLAS, Eigen, and various other BLAS's) all use the N^3 algorithm with optimizations more focused on computer architecture (cache and register blocking, limiting instruction dependencies, SIMD) than on algorithmic cleverness.

The only attempt that I know of to use a non-N^3 algorithm for practical matrix multiplication is this 2016 paper[0] called Strassens' reloaded. Figure 5 shows that their implementation of Strassen's outperforms MKL on a single core on about a (2K, 2K, 2K) matrix multiplication problem (look at upper right part of figure). However, you're rarely multiplying on a single core. Figure 6 shows that for a 10 core system, Strassen's becomes marginally faster with square matrices of size 4K, and only becomes significantly faster at size 8K. Although these results are super cool in my opinion, they're mostly not applicable to deep learning in it's current incarnation. Finally, Strassen's algorithm is one of the simplest non-N^3 matmul algorithms (known since 1969, complexity O(n^2.8)) and I believe it has much lower fixed costs than something more recent that has complexity more like O(n^2.37).

[0] Strassen's Reloaded paper: https://www.cs.utexas.edu/~jianyu/papers/sc16.pdf

1 comments

Thanks for the links! I had never actually seen it quantified for the size matrix you needed. Just "large" which is always relative. :)

And I also hadn't looked closely enough to see if amount of training data influenced matrix size. Makes sense that it would only influence a single dimension.

The typical neural net matrix multiplication is N_EXAMPLES X N_FEATURES_IN multiplied with N_FEATURES_IN X N_FEATURES_OUT.

The output feature count is completely independent of the data size, and input feature count is only dependent on the dimensionality of the data (not the number of points), and that's only in the first layer of the network. Even with datasets with huge number of examples, the net usually only trains on a small "minibatch" of examples at a time, typically somewhere between 16 and 1024. This minibatch size is the algorithmic N_EXAMPLES. Given these numbers, the typical neural net matrix multiplication is probably something like (32, 256) x (256, 128). This is not nearly large enough for non-N^3 tmatmul algorithms o accelerate things.