|
|
|
|
|
by stygiansonic
787 days ago
|
|
This is probably using their excess capacity, but not necessarily that their GPUs are idle. For LLMs/large models the huge cost is memory ops to load each layer weights during the forward pass. This is why doing inference at batch size 1 is extremely wasteful: you pay all the mem ops cost and don’t use enough compute FLOPs to justify. You want a high enough batch size so that compute:mem ops is close to the ratio for that GPU. This is usually done by batching together multiple user requests. At times of low usage there is excess capacity because batch size is below this “optimal” ratio. So they can slot in these “relaxed SLA” requests for little marginal increase resource usage on their end. Basically have a queue of these requests that you use to “top off” your batch size when you can. Edit: also you may not be able to get optimal batch size depending on when the requests arrive, eg you don’t want to wait forever to fill up a batch. So again having a queue of outstanding/delayed requests to serve allows for smoothing things out and increasing compute utilization |
|
I think that beyond optimizing batch size, massive training clusters tend to benefit from scheduled maintenance periods where everything gets fixed vs rolling fixes (as you either need everything to be working or you need to restart the training window). If OpenAI could interleave batch inference with training specific HW downtime like interconnect maintenance it would be another basically free source of GPU FLOPS.
[1] https://www.youtube.com/watch?v=PeKMEXUrlq4