|
|
|
|
|
by miki123211
234 days ago
|
|
Loading a model takes at least a few seconds, usually more, depending on model size, disk / network speed and a bunch of other factors. If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet. If that kind of latency isn't acceptable to you, you have to keep the models loaded. This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale. |
|