Hacker News new | ask | show | jobs
by cavisne 790 days ago
This awesome talk [1] from OpenAI covers this topic quite a bit, one useful takeaway is how GPU compute is basically static, gone are the days of autoscaling as there is nothing to autoscale to.

I think that beyond optimizing batch size, massive training clusters tend to benefit from scheduled maintenance periods where everything gets fixed vs rolling fixes (as you either need everything to be working or you need to restart the training window). If OpenAI could interleave batch inference with training specific HW downtime like interconnect maintenance it would be another basically free source of GPU FLOPS.

[1] https://www.youtube.com/watch?v=PeKMEXUrlq4