|
|
|
|
|
by mlyle
310 days ago
|
|
An A100 is probably 2-4k tokens/second on a 20B model with batched inference. Multiply the number of A100's you need as necessary. Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards. Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100... |
|