Hacker News new | ask | show | jobs
by KuriousCat 793 days ago
Does the paper assume uniform settings through out the training phase? Or is it the bound no matter what training strategy is used given the dataset?
1 comments

They only experimented with different cosine learning rate decay schedules, but found results consistent across these, as well as across two different types of experiment where they either varied number of training tokens for a given model size, or varied model size for a given number of training FLOPs.