|
|
|
|
|
by svachalek
237 days ago
|
|
Models take a lot of VRAM which is tightly coupled to the GPU so yeah, it's basically sitting there with the model waiting for use. I'm sure they probably do idle out but a few minutes of idle time is a lot of waste--possibly the full 82% mentioned. In this case they optimized by letting the GPUs load multiple models and sharing the load out by token. |
|
That's an eternity for a request. I highly doubt they will timeout any model they serve.