I've been wondering how the training process of the huge models works in practice. If an optimization run costs millions, they probably don't just run a grid of hyperparameters.
Run a grid of hyperparameters for small models of different sizes to find out how the optimal values change as you scale up (the "scaling laws"), then extrapolate to predict performance at even larger scales, then do a single large run and hope that your predictions aren't too far off.