|
|
|
|
|
by bee_rider
235 days ago
|
|
I’m slightly confuse as to how all this works. Do the GPUs just sit there with the models on them when the models are not in use? I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed? (I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!) |
|
If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet.
If that kind of latency isn't acceptable to you, you have to keep the models loaded.
This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.