|
|
|
|
|
by coder543
899 days ago
|
|
> These data center targeted GPUs can only output that many tokens per second for large batches. No… my RTX 3090 can output 130 tokens per second with Mistral on batch size 1. A more powerful GPU (with faster memory) should easily be able to crack 200 tokens per second at batch size 1 with Mistral. At larger batch sizes, the token rate would be enormous. Microsoft’s high performing Phi-2 model breaks 200 tokens per second on batch size 1 on my RTX 3090. TinyLlama-1.1B is 350 tokens per second, though its usefulness may be questionable. We’re just used to datacenter GPUs being used for much larger models, which are much slower, and cannot fit on today’s phones. |
|
Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.