|
|
|
|
|
by sdenton4
206 days ago
|
|
A somewhat more beautiful matmul for neural networks is given by the Monarch paper:
https://arxiv.org/abs/2204.00595 Generally, low-rank and block-diagonal matrices are both great strategies for producing expressive matmuls with fewer parameters. We can view the FFT as a particularly deft example of factorizing one big matmul into a number of block-diagonal matmuls, greatly reducing the overall number of multiplications by minimizing the block size. However, on a G/TPU, we have a lot more parallelism available, so the sweet spot for size of the blocks may be larger than 2x2... We can also mix low-rank, block diagonal, and residual connections to get the best of both worlds: x' = (L@x + B@x + x) The block-diagonal matrix does 'local' work, and the low-rank matrix does 'broadcast' work. I find it pretty typical to be able to replace a single dense matmul with this kind of structure and save ~90% of the params with no quality cost... (and sometimes the regularization actually helps!) |
|
There's a lot of opportunity here. Just because matrix multiplication makes for a beautiful mathematical building block, and a very reasonable one to build high-level ML logic on, doesn't mean it needs to be computed the same way, and in the same order, that we learned in linear algebra courses.
I'm quite curious if this is being used in practice at scale, or whether it's still in the lab at the moment!