Hacker News new | ask | show | jobs
by sdenton4 661 days ago
So, the thing is that linear algebra operations are very cheap already... you just need a lot of them. Any other 'cheap' method is going to have a similar problem: if the unit is small and not terribly expressive, you need a whole lot of them. But it will be compounded by the fact that we don't have decades of investment in making these new atomic operations as fast and cheap as possible.

A good take-away from the Wolfram writeup is that you can do machine learning on any pile of atoms you've got lying around, so you might as well do it on whatever you've got the best tooling for - right now this is silicon doing fixed-point linear algebra operations, by a long shot.

3 comments

My take is that the neural network is a bit of a red herring -- people poked around in brains to see what was going on and noticed a network structure with many apparently simple computing nodes. So they tried making similar structures in software and quickly discovered they could do some interesting things. But it may turn out that the neural network was just nature's best implementation for "field programmable matrix manipulation". You can implement the functionality in other ways, not resembling neural networks.
I think the point of wolfram's essay is that you don't need the base unit of computation to be a dot product
Sort of, yes. But if the existing thing were "the cheapest", quantization wouldn't exist.

It depends on what your constraint is! So if you're memory constrained (or don't have a GPU), a bunch of 1 bit atoms with operations that are very fast on CPU might be better

I haven't thought very deeply about whether it's provably faster to do gradient descent on 32 bits vs 8, but it probably always is. What's the next step to speed up training?

But to your point - that is how I feel about graph nns vs transformers or the fully connected set (GPUs are so good at transformers and fully connected nns, even if there is a structure that makes sense we don't have the hardware to have it make sense.... Unless grok makes it cheap??)
Perhaps; in a lot of cases the architecture barely matters. Transformers took a lot of extra tricks to get working well; the ConvNext paper showed that applying those same tricks to convolutional networks can fully close the gap.

https://arxiv.org/abs/2201.03545