|
|
|
|
|
by lightcatcher
2841 days ago
|
|
Deep learning mostly uses matrices with largest dimension <1000. The size of the matrix directly corresponds to the number of input and output units to to a fully connected layer. Even a 2000x2000 matrix is relatively small as far as non O(N^3) matrix multiplication algorithms go. The faster asymptotic algorithms have larger fixed costs and do not map as nicely to current CPU/GPU architectures. Additionally, the people capable of writing high performance matrix multiplication mostly haven't spent time on the non-N^3 algorithms. I believe every common implementation (MKL, cuBLAS, Eigen, and various other BLAS's) all use the N^3 algorithm with optimizations more focused on computer architecture (cache and register blocking, limiting instruction dependencies, SIMD) than on algorithmic cleverness. The only attempt that I know of to use a non-N^3 algorithm for practical matrix multiplication is this 2016 paper[0] called Strassens' reloaded. Figure 5 shows that their implementation of Strassen's outperforms MKL on a single core on about a (2K, 2K, 2K) matrix multiplication problem (look at upper right part of figure). However, you're rarely multiplying on a single core. Figure 6 shows that for a 10 core system, Strassen's becomes marginally faster with square matrices of size 4K, and only becomes significantly faster at size 8K. Although these results are super cool in my opinion, they're mostly not applicable to deep learning in it's current incarnation.
Finally, Strassen's algorithm is one of the simplest non-N^3 matmul algorithms (known since 1969, complexity O(n^2.8)) and I believe it has much lower fixed costs than something more recent that has complexity more like O(n^2.37). [0] Strassen's Reloaded paper: https://www.cs.utexas.edu/~jianyu/papers/sc16.pdf |
|
And I also hadn't looked closely enough to see if amount of training data influenced matrix size. Makes sense that it would only influence a single dimension.