Hacker News new | ask | show | jobs
by dekhn 1604 days ago
Training a state of the art model typically involves keeping a very large computer around at near 100% power load. Roughly about 10MW.

The actual limits on DL models (and any simulation or optimization) are: power density and the speed of light, plus the maximum amount of power you can deliver to the area. The speed of light limits how long your cables can be while still doing collective reductions, and the power density limits how much compute power you can fit per unit volume. One could imagine a fully liquid cooled supercomputer at 100MW (located near a very reliable and large power source) with optical fiber interconnect, this would completely change the state of the art in large models overnight.

1 comments

All true.

I cannot cite a source here, but it is generally believed that the actual effective GPU utilization in AI training clusters which are "100% utilized" is actually quite poor - 23%-26% - due to data movement, non-essential serial execution, and and scheduling issues. So at least for now there is low-hanging fruit to improve the performance of the capital expenses.

Long term, though, DL clusters are basically CAPEX and energy limited.

IMHO, for now, return on the investment is not really a limiting factor, but it will become one once the shine is off the field.