Hacker News new | ask | show | jobs
by jampekka 450 days ago
I've been wondering how the training process of the huge models works in practice. If an optimization run costs millions, they probably don't just run a grid of hyperparameters.
1 comments

Run a grid of hyperparameters for small models of different sizes to find out how the optimal values change as you scale up (the "scaling laws"), then extrapolate to predict performance at even larger scales, then do a single large run and hope that your predictions aren't too far off.