|
|
|
|
|
by hinkley
236 days ago
|
|
That is a confusing coincidence, but no. > Reserving full GPU instances for these models leads to allocating 17.7% of our GPUs to serve only 1.35% of requests > Deployment results show that Aegaeon reduces the number of GPUs required for serving these models from 1,192 to 213, highlighting an 82% GPU resource saving. 82% of their CPUs were serving 98.6% of all traffic. If they reduced the cluster size, they got it to 96.2% of their CPUs serving 98.6% of their traffic. If they reallocated those, which is more likely, then 96.8% of their CPUs are serving 98.6% of all requests, or around 17% more capacity for popular requests on the same hardware. |
|