Hacker News new | ask | show | jobs
by dnnssl2 927 days ago
If you were to serve this on a datacenter server, is the client to server roundtrip networking the slowest part of the inference? Curious if it would be faster to run this cloud GPUs on better hardware but farther compute, or locally with worse hardware.
1 comments

Surprisingly, no. And part of this is that text generation is really expensive. Unlike traditional ML inference (like with, resnets), you don't just pass your data through your model once. You need to pass it over and over again (once for each token you generate).

So, in practice, a full "text completion request" can often take on the order of seconds, which dwarfs the client <-> server roundtrip.

Is this still the case for sliding window attention/streaming LLMs, where you have a fixed length attention window rather than infinitely passing in new tokens for quadratic scaling? You even get better performance due to purposely downsampling non-meaningful attention sink tokens.
I cover it a bit in the blog post, but unless you have a really long context length (like 32k+), your primary computational cost doesn't come from attention but rather from loading your weights from VRAM into registers.

I mean, practically speaking, completions from say, ChatGPT or Claude take seconds to finish :)