|
|
|
|
|
by coder543
900 days ago
|
|
What does snappier even mean in this context? The latency from connecting to a server over most network connections isn’t really noticeable when talking about text generation. If the server with a beefy datacenter-class GPU were running the same Mistral you can run on your phone, it would be spitting out hundreds of tokens per second. Most responses would appear on your screen before you blink. There is no expectation that phones will ever be comparable in performance for LLMs. Mistral runs at a decent clip on phones, but we’re talking like 11 tokens per second, not hundreds of tokens per second. Server-based models tend to be only slightly faster than Mistral on my phone because they’re usually running much larger, much more accurate/useful models. Models which currently can’t fit onto phones. Running models locally is not motivated by performance, except if you’re in places without reliable internet. |
|
That’s why despite these GPUs deliver very high throughput in tokens/second, responses do not appear instantly, and individual users observe non-trivial latency.
Another interesting consequence, running these ML models with batch size = 1 (when running on end-user computers or phones) is practically guaranteed to bottleneck on memory. Computation performance or tensor cores are irrelevant for the use case, the only number which matters is memory bandwidth.
For example, I’ve tested my Mistral implementation on desktop with nVidia 1080Ti versus laptop with Radeon Vega 7 inside Ryzen 5 5600U. The performance difference between them is close to 10x, because memory: 484 GB/second for GDDR5X in the desktop versus 50 GB/second for dual-channel DDR4-3200 in the laptop. This is despite theoretical compute performance only differs by the factor of 6.6, the numbers are 10.6 versus 1.6 TFlops.