Hacker News new | ask | show | jobs
by sfpotter 206 days ago
I think this sentence:

> But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board.

is the most interesting one.

A man hammering a nail into a board can be both beautiful and elegant! If you've ever seen someone effortlessly hammer nail after nail into wood without having to think hardly at all about what they're doing, you've seen a master craftsman at work. Speaking as a numerical analyst, I'd say a well multiplied matrix is much the same. There is much that goes into how deftly a matrix might be multiplied. And just as someone can hammer a nail poorly, so too can a matrix be multiplied poorly. I would say the matrices being multiplied in service of training LLMs are not a particularly beautiful example of what matrix multiplication has to offer. The fast Fourier transform viewed as a sparse matrix factorization of the DFT and its concomitant properties of numerical stability might be a better candidate.

3 comments

A somewhat more beautiful matmul for neural networks is given by the Monarch paper: https://arxiv.org/abs/2204.00595

Generally, low-rank and block-diagonal matrices are both great strategies for producing expressive matmuls with fewer parameters. We can view the FFT as a particularly deft example of factorizing one big matmul into a number of block-diagonal matmuls, greatly reducing the overall number of multiplications by minimizing the block size. However, on a G/TPU, we have a lot more parallelism available, so the sweet spot for size of the blocks may be larger than 2x2...

We can also mix low-rank, block diagonal, and residual connections to get the best of both worlds:

x' = (L@x + B@x + x)

The block-diagonal matrix does 'local' work, and the low-rank matrix does 'broadcast' work. I find it pretty typical to be able to replace a single dense matmul with this kind of structure and save ~90% of the params with no quality cost... (and sometimes the regularization actually helps!)

https://hazyresearch.stanford.edu/blog/2023-12-11-truly-subq... is an approachable overview of the Monarch approach, for those interested!

There's a lot of opportunity here. Just because matrix multiplication makes for a beautiful mathematical building block, and a very reasonable one to build high-level ML logic on, doesn't mean it needs to be computed the same way, and in the same order, that we learned in linear algebra courses.

I'm quite curious if this is being used in practice at scale, or whether it's still in the lab at the moment!

> doesn't mean it needs to be computed the same way, and in the same order, that we learned in linear algebra courses.

I think this touches on something fundamental. As a stand-alone operation matmul is ugly because it's arbitrary. In other words.. if the goal was just to entangle values, there's a bunch of ways to do it, so why this particular way landing on ae+bg etc? You kind of need algebra/geometry to justify matmul this way, which makes it obviously useful, but now it's still ugly, exactly because you had to invoke this other stuff.

Compare that situation to algebra and geometry themselves, which in a real sense don't need each other. Or to things like logic, sets, categories, processes, numbers, knots, games, etc where you can build up piles of stuff based on it in a whole rich universe before you need to appeal to much that is "outside". And in those universes operations would be defined mostly in ways that were more like "natural" or "necessary" without anything feeling arbitrary.

Traditional matmul is beautiful in the sense of "connections across and between", where all the particulars do become necessary. For those that prefer a certain amount of abstract perfection / platonism / etc or those with a taste for foundations though, it's understandable if it's not that appealing. This is related to, but not the same as the pure vs applied split.

Do low rank/block diagonal matrices come up in LLMs often? What about banded or block tridiagonal? Intuitively banded matrices seem like they ought to be good at encoding things about the world… everything is connected but not randomly so.
Yep! Think of LORA for network fine tuning. Monarch (linked above) uses lots of block diagonality. These ideas also make flash attention flash.

I haven't seen banded matrices as much, though (with weight sharing) they're just convolutions. One nice feature of block diagonality is that you can express it as batched matrix multiplication, reusing all the existing matmul kernels.

https://www.youtube.com/watch?v=Ruf-cLr2PZ8 I always think of this when thinking about the gracefulness of a hammer.
Turns out it’s the skill of the person handling the hammer that matters most. Enlightening! Appreciate the link!
Upholstery tacks are sold sterilized for people holding them in their mouths. Now I wonder the same about drywall nails!

Thanks for the link; that is absolutely masterful work.

Wow, thank you for this gem!!!
Yes!
>The fast Fourier transform viewed as a sparse matrix factorization of the DFT

Riffing further on the Fourier connection: are you planning to explore the link between matmul and differentiation?

Using the "Pauli-Z" matrix that you introduced without a straightforward motivation, eg.

(I took it that you intended it to be a "backyard instance" of "dual numbers")