|
|
|
|
|
by brunoalano
105 days ago
|
|
I've been experimenting with Muon on GNNs to see whether orthogonalizing updates helps with the usual depth problems. In my runs, the shallow 2-layer setting looked mostly similar to AdamW. The more interesting case was moderate depth: around 8 layers, Muon was noticeably more stable and gave better final results. I also saw a fairly large robustness gap under feature noise and edge dropout. The writeup focuses on the spectral side of the story: singular values, conditioning, and why the effect seems to show up more in deeper message-passing stacks than in the standard shallow benchmark regime. I included the negative results too: Muon is slower per epoch, it doesn’t win everywhere, and by very large depth the optimizer alone is not enough. |
|