| There are three ways to train: - best score - don't care about efficiencies (GPT3, GPT4) - best score for a fixed quantity of compute at training time - good for PhD's and people who make proofs-of-concept (Chinchilla) - best score for a fixed quantity of compute at inference time - good for people who inference their models at scale (LLaMA, chatGPT turbo) The article didn't mention the LLaMA scaling laws, where we use more than 20 tokens per weight, more precisely 142 tokens per weight for LLaMA 7B. > The objective of the scaling laws from Chinchilla is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Chinchilla recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens. What we care about is the best model we could run on our own hardware, not how efficient was its training, that doesn't cost us users anything. |
Training cost is a huge (but diffuse) cost because it limits which organizations can be train a model. Such concentration of power greatly affect users.