|
|
|
|
|
by minimaxir
1014 days ago
|
|
LLMs are GPU compute-bound. If you infer at batch_size = 1 on a model like Llama 2 7B on a "cheap" GPU like a T4 or an L4 it'll use about 100% of the compute, which means you get no benefit from batching. The exception is the A100 GPU which does not use 100% of GPU compute and therefore you get benefit from batching, but is hella expensive. The economics are not simple, and in most cases "just use the ChatGPT API" is also the most cost-effective option anyways. A smaller 1.1B model (which would likely not be compute-bound) with similar performance to a 7B model may tip the scales. |
|
From what I understand, they are severely bandwidth bound at a GPU batch size of 1. Even llama.cpp is fairly RAM speed bound on a CPU with much less compute than a GPU.
It's just that batching is quite inefficient without an implementation like this: https://www.anyscale.com/blog/continuous-batching-llm-infere...