|
|
|
|
|
by Const-me
899 days ago
|
|
These data center targeted GPUs can only output that many tokens per second for large batches. These tokens are shared between hundreds or even thousands of users concurrently accessing the same server. That’s why despite these GPUs deliver very high throughput in tokens/second, responses do not appear instantly, and individual users observe non-trivial latency. Another interesting consequence, running these ML models with batch size = 1 (when running on end-user computers or phones) is practically guaranteed to bottleneck on memory. Computation performance or tensor cores are irrelevant for the use case, the only number which matters is memory bandwidth. For example, I’ve tested my Mistral implementation on desktop with nVidia 1080Ti versus laptop with Radeon Vega 7 inside Ryzen 5 5600U. The performance difference between them is close to 10x, because memory: 484 GB/second for GDDR5X in the desktop versus 50 GB/second for dual-channel DDR4-3200 in the laptop. This is despite theoretical compute performance only differs by the factor of 6.6, the numbers are 10.6 versus 1.6 TFlops. |
|
No… my RTX 3090 can output 130 tokens per second with Mistral on batch size 1. A more powerful GPU (with faster memory) should easily be able to crack 200 tokens per second at batch size 1 with Mistral.
At larger batch sizes, the token rate would be enormous.
Microsoft’s high performing Phi-2 model breaks 200 tokens per second on batch size 1 on my RTX 3090. TinyLlama-1.1B is 350 tokens per second, though its usefulness may be questionable.
We’re just used to datacenter GPUs being used for much larger models, which are much slower, and cannot fit on today’s phones.