| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HarHarVeryFunny 793 days ago

The Chinchilla paper is about how to design/train a model to optimize use of computing power, which we can equate to cost (FLOPs cost dollars).

The question that Chinchilla tries to answer is: for a given training budget (which you can think of as dollars or FLOPs), what is the optimal trade off of model size and quantity of training data to get the most performant model? Build a large model and train with less data, or build a smaller one and train with more data?

However, another consideration is minimizing total lifetime cost of the model: training cost + inference cost. You could train a model for longer (costing more) in order to get a given level of performance from a smaller model that will be cheaper for inference, or vice versa. For any given projected model lifetime inference volume, there is going to be a different answer.

It's not that Chinchilla-optimal models stopped making sense, but rather that this sort of consideration has people willing to pump more money (tokens) into smaller models to reduce inference cost for that level of capability.

1 comments

KuriousCat 793 days ago

Does the paper assume uniform settings through out the training phase? Or is it the bound no matter what training strategy is used given the dataset?

link

HarHarVeryFunny 792 days ago

They only experimented with different cosine learning rate decay schedules, but found results consistent across these, as well as across two different types of experiment where they either varied number of training tokens for a given model size, or varied model size for a given number of training FLOPs.

link