Hacker News new | ask | show | jobs
by sdenton4 206 days ago
A somewhat more beautiful matmul for neural networks is given by the Monarch paper: https://arxiv.org/abs/2204.00595

Generally, low-rank and block-diagonal matrices are both great strategies for producing expressive matmuls with fewer parameters. We can view the FFT as a particularly deft example of factorizing one big matmul into a number of block-diagonal matmuls, greatly reducing the overall number of multiplications by minimizing the block size. However, on a G/TPU, we have a lot more parallelism available, so the sweet spot for size of the blocks may be larger than 2x2...

We can also mix low-rank, block diagonal, and residual connections to get the best of both worlds:

x' = (L@x + B@x + x)

The block-diagonal matrix does 'local' work, and the low-rank matrix does 'broadcast' work. I find it pretty typical to be able to replace a single dense matmul with this kind of structure and save ~90% of the params with no quality cost... (and sometimes the regularization actually helps!)

2 comments

https://hazyresearch.stanford.edu/blog/2023-12-11-truly-subq... is an approachable overview of the Monarch approach, for those interested!

There's a lot of opportunity here. Just because matrix multiplication makes for a beautiful mathematical building block, and a very reasonable one to build high-level ML logic on, doesn't mean it needs to be computed the same way, and in the same order, that we learned in linear algebra courses.

I'm quite curious if this is being used in practice at scale, or whether it's still in the lab at the moment!

> doesn't mean it needs to be computed the same way, and in the same order, that we learned in linear algebra courses.

I think this touches on something fundamental. As a stand-alone operation matmul is ugly because it's arbitrary. In other words.. if the goal was just to entangle values, there's a bunch of ways to do it, so why this particular way landing on ae+bg etc? You kind of need algebra/geometry to justify matmul this way, which makes it obviously useful, but now it's still ugly, exactly because you had to invoke this other stuff.

Compare that situation to algebra and geometry themselves, which in a real sense don't need each other. Or to things like logic, sets, categories, processes, numbers, knots, games, etc where you can build up piles of stuff based on it in a whole rich universe before you need to appeal to much that is "outside". And in those universes operations would be defined mostly in ways that were more like "natural" or "necessary" without anything feeling arbitrary.

Traditional matmul is beautiful in the sense of "connections across and between", where all the particulars do become necessary. For those that prefer a certain amount of abstract perfection / platonism / etc or those with a taste for foundations though, it's understandable if it's not that appealing. This is related to, but not the same as the pure vs applied split.

Do low rank/block diagonal matrices come up in LLMs often? What about banded or block tridiagonal? Intuitively banded matrices seem like they ought to be good at encoding things about the world… everything is connected but not randomly so.
Yep! Think of LORA for network fine tuning. Monarch (linked above) uses lots of block diagonality. These ideas also make flash attention flash.

I haven't seen banded matrices as much, though (with weight sharing) they're just convolutions. One nice feature of block diagonality is that you can express it as batched matrix multiplication, reusing all the existing matmul kernels.