Hacker News new | ask | show | jobs
by baq 755 days ago
My takeaway from the paper is that you can guide training by adding/switching to a more difficult loss function after you got the basics right. Looks like they never got to overfitting grokking, so maybe there’s more to discover further down the training alley.