|
|
|
|
|
by jychang
236 days ago
|
|
They definitely won't idle out- if they idle out, it'll take on the order of up to 60 seconds to load the model back into VRAM, depending on the model. That's an eternity for a request. I highly doubt they will timeout any model they serve. |
|
Let's say 10 GPUs are in use. You keep another 3 with the model loaded. If demand increases slowly you slowly increase your headroom. If demand increases rapidly, you also increase rapidly.
The correct way to do this is more complicated and you should model based on your usage history, but if you have sufficient headroom then very few should be left idle. Remember that these models do requests in batches.
If they don't timeout models, they're throwing money down the drain. Though that wouldn't be uncommon.