Hacker News new | ask | show | jobs
by lopuhin 853 days ago
If convergence were a matter of luck, it would look completely different, like white noise, but it clearly has well-defined structure.

The reason for high learning rate is that they used full batched training (see the first cell in https://colab.research.google.com/github/Sohl-Dickstein/frac...), and when batch sizes are large, learning rates typically can be large as well. Plus as others said it's more of a toy problem, it would be hard to get such detail on anything non-toy.