|
|
|
|
|
by foobiekr
1604 days ago
|
|
All true. I cannot cite a source here, but it is generally believed that the actual effective GPU utilization in AI training clusters which are "100% utilized" is actually quite poor - 23%-26% - due to data movement, non-essential serial execution, and and scheduling issues. So at least for now there is low-hanging fruit to improve the performance of the capital expenses. Long term, though, DL clusters are basically CAPEX and energy limited. IMHO, for now, return on the investment is not really a limiting factor, but it will become one once the shine is off the field. |
|