Hacker News new | ask | show | jobs
by mlyle 310 days ago
An A100 is probably 2-4k tokens/second on a 20B model with batched inference.

Multiply the number of A100's you need as necessary.

Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.

Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...

1 comments

You can batch only if you have distinct chat in parallel,
> > if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each)