|
|
|
|
|
by sfpotter
206 days ago
|
|
I think this sentence: > But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board. is the most interesting one. A man hammering a nail into a board can be both beautiful and elegant! If you've ever seen someone effortlessly hammer nail after nail into wood without having to think hardly at all about what they're doing, you've seen a master craftsman at work. Speaking as a numerical analyst, I'd say a well multiplied matrix is much the same. There is much that goes into how deftly a matrix might be multiplied. And just as someone can hammer a nail poorly, so too can a matrix be multiplied poorly. I would say the matrices being multiplied in service of training LLMs are not a particularly beautiful example of what matrix multiplication has to offer. The fast Fourier transform viewed as a sparse matrix factorization of the DFT and its concomitant properties of numerical stability might be a better candidate. |
|
Generally, low-rank and block-diagonal matrices are both great strategies for producing expressive matmuls with fewer parameters. We can view the FFT as a particularly deft example of factorizing one big matmul into a number of block-diagonal matmuls, greatly reducing the overall number of multiplications by minimizing the block size. However, on a G/TPU, we have a lot more parallelism available, so the sweet spot for size of the blocks may be larger than 2x2...
We can also mix low-rank, block diagonal, and residual connections to get the best of both worlds:
x' = (L@x + B@x + x)
The block-diagonal matrix does 'local' work, and the low-rank matrix does 'broadcast' work. I find it pretty typical to be able to replace a single dense matmul with this kind of structure and save ~90% of the params with no quality cost... (and sometimes the regularization actually helps!)