Hacker News new | ask | show | jobs
by stygiansonic 787 days ago
This is probably using their excess capacity, but not necessarily that their GPUs are idle.

For LLMs/large models the huge cost is memory ops to load each layer weights during the forward pass. This is why doing inference at batch size 1 is extremely wasteful: you pay all the mem ops cost and don’t use enough compute FLOPs to justify.

You want a high enough batch size so that compute:mem ops is close to the ratio for that GPU. This is usually done by batching together multiple user requests.

At times of low usage there is excess capacity because batch size is below this “optimal” ratio. So they can slot in these “relaxed SLA” requests for little marginal increase resource usage on their end.

Basically have a queue of these requests that you use to “top off” your batch size when you can.

Edit: also you may not be able to get optimal batch size depending on when the requests arrive, eg you don’t want to wait forever to fill up a batch. So again having a queue of outstanding/delayed requests to serve allows for smoothing things out and increasing compute utilization

4 comments

This awesome talk [1] from OpenAI covers this topic quite a bit, one useful takeaway is how GPU compute is basically static, gone are the days of autoscaling as there is nothing to autoscale to.

I think that beyond optimizing batch size, massive training clusters tend to benefit from scheduled maintenance periods where everything gets fixed vs rolling fixes (as you either need everything to be working or you need to restart the training window). If OpenAI could interleave batch inference with training specific HW downtime like interconnect maintenance it would be another basically free source of GPU FLOPS.

[1] https://www.youtube.com/watch?v=PeKMEXUrlq4

Hey there, this comment is super insightful. I'd love to talk to you about this some more, but you don't have contact info. If you're interested my contact info is in my profile.
Thanks - I added my contact info (I don’t comment a lot on HN, mostly just read) but will drop you a line
I wonder if they have a clear hardware separation between each of the API, ChatGPT, their lower-scale experiments and their large scale (e.g. GPT5) training hardware. Or is everything just a big blob of hardware, that dynamically gets allocated to jobs depending on demand?

Hardware demand is so high, having GPUs idling is a massive waste, but you also want to have separation between dev, test and prod environments, so not obvious what to do.

Yeah this makes sense. I do wonder though how it changes the dynamics around provisioned capacity, if at all.
It reduces the need. If they can get non-latency sensitive users onto this API then they only need to be provisioned to support their max interactive query load (ChatGPT) rather than peak API load, which can be arbitrary high (however fast the program generating the load can run). The lower pricing should move users across quite fast, and the higher efficiency will free up hardware and reduce the rate at which they need to grow it.
That's the way it seems to me as well. Curious too about the business implications. My guess is that they wanted to bite the bullet and commit to provisioned capacity but wanted to do so in a way that didn't require massive overprovisioning for API requests.
They're well beyond that point now I guess. MS has been building whole datacenters just for OpenAI.