| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hinkley 236 days ago

That is a confusing coincidence, but no.

> Reserving full GPU instances for these models leads to allocating 17.7% of our GPUs to serve only 1.35% of requests

> Deployment results show that Aegaeon reduces the number of GPUs required for serving these models from 1,192 to 213, highlighting an 82% GPU resource saving.

82% of their CPUs were serving 98.6% of all traffic. If they reduced the cluster size, they got it to 96.2% of their CPUs serving 98.6% of their traffic. If they reallocated those, which is more likely, then 96.8% of their CPUs are serving 98.6% of all requests, or around 17% more capacity for popular requests on the same hardware.