| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by miki123211 281 days ago

Loading a model takes at least a few seconds, usually more, depending on model size, disk / network speed and a bunch of other factors.

If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet.

If that kind of latency isn't acceptable to you, you have to keep the models loaded.

This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

3 comments

cnr 281 days ago

Let's say, then, that it's not so much "dumb and wasteful" as "energy inefficient". In fact, this can be quite wise in a modern world full of surveillance-as-a-business and "us-east-1 disasters"

link

cnr 281 days ago

Can you elaborate the last statement? Don't quite understand why loading local LLM to GPU RAM, using it for the job and then "ejecting" is "dumb and wasteful" idea?

link

carderne 281 days ago

Layman understanding:

Because as a function of hardware and electricity costs, a “cloud” GPU will be many times more efficient per output token. You aren’t loading/offloading models and don’t have any parts of the GPU waiting for input. Everything is fully saturated always.

link

OJFord 281 days ago

I believe GP means it still to be connected to 'if this kind of latency is unacceptable to you' - i.e. you can't load/use/unload, you have to keep it in RAM all the time.

In that case it's massively increasing your memory requirement not just to the peak the model needs, but to + whatever the other biggest use might be that'll be inherently concurrent with it.

link

behnamoh 281 days ago

> This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

Local models are never a dumb idea. The only time it's dumb to use them in an enterprise is if the infra is Mac Studio with M3 Ultra because pp time is terrible.

link