> Isn't this an old idea?
So are neural networks. So is attention.What's your point? Sometimes things need to be retried. Sometimes there are small subtle details that make or break an idea. So what's the point of acting dismissively? If an old idea that didn't work now works, then what's the problem? Besides, progress is typically iterative, not through leaps and bounds. The vast majority of things that look like leaps are just because we don't see the steps between. The reason I'm saying this is because that sentiment is often used to pass over working solutions and slows down their progress. So even if unintentional it should cause us to rethink how we respond. Otherwise we end up with such silly claims like "Einstein just used Tensors" and "Nash just used topology". In some sense these are accurate, but they are too high level descriptions (and these are real dismissals. Which again, so what? If it works, it works?). Why is "novelty" even a question? Novelty is only ever in the eyes of the beholder. > What is novel about the approach in the blog post? Serious question, I really can't tell after reading the post.
Honestly, I do not know, but I'll give you my best read on it.1) Scale: Don't underestimate the importance of this. While I don't think scale is all you need, it certainly is a critical factor. 2) Different optimization: I may be missing something, but it looks like they are using a different optimizer. They mention that they're using the muon optimizer constraining to a Stiefel manifold. Neither of those things are unique on their own, but is their combination? This is where I'm uncertain because such a thing would be easy to miss. Maybe someone did and was unsuccessful with it. Maybe someone did, but was not at scale. Maybe someone did, it worked, and just nobody noticed (that happens a lot!). So I think this is quite similar to how 99% of progress and breakthroughs are made: putting together ideas that seem unrelated and inventing some glue to generalize the process. At a high level this always looks like you're just putting existing things together, but that glue is really hard to make. And to continue that analogy, if we do a good enough job gluing things together then to anyone but an expert it'll look like there is no glue. It can be surprisingly difficult to tell if something is glued, welded, mated, milled, printed, or whatever. It usually takes a very keen eye to determine the answer non-destructively. |
> I figured out how to solve manifold Muon in the square case late last year, but I was unable to solve the full rectangular case and thus posed the problem as an open problem on the Modula docs. Jianlin Su solved the problem this summer
It sounds like the generalisation of projected gradient decent to "Muon" is what they're focusing on, but the derivation is all about the retraction map on the Stiefel manifold? I think I'm missing some background here.