| Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine) > 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found Instead of 1192 GPUs they now use 213 for serving those requests. |
I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?
(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)