| For things like BERT where you just want to extract an embedding, the naive way you reach full utilization at inference time is that you : - run tokenization of inputs on CPU - sort inputs by length - batch inputs of similar length and apply padding to make of uniform length - pass the batches through so a single model can process many inputs in parallel. For GPT-style decoder models however, this becomes much more challenging because inference requires a forward pass for every token generated. (Stopping criteria also may differ but that’s another tangent). Every generated token performs attention on every previous token, both the context (or “prompt”) and the previously generated tokens (important for self consistency). this is a quadratic operation in the vanilla case. Model sizes are large , often spanning multiple machines, and the information for later layers depends on previous ones, meaning inference has to be pipelined. The naive approach would be to have a single transaction processed exclusively by a single instance of the model. this is expensive! even if each model can be crammed into a single A100 , if you want to run something like Codex or ChatGPT for millions of users with low latency inference, you’d have to have thousands of GPUs preloaded with models, and each transaction would take a highly variable amount of time. If a model spans multiple machines, you’d achieve a max of 1/n% utilization because each shard has to remain loaded while the others process, and then if you want to do pipeline parallelism like in pipe dream, you’d have to deal with attention caches since you don’t want to have to recompute every previous state each time |