| A couple of interesting factors leading to liquid cooling in the commercial market (always been prominent in the top 500) and why I think this time it’s different. First, we are designing smaller transistor leading to leakage and thus increasing chip thermal design power (TDP). We are also using bigger chips and chiplets also increasing TDP. TDP on a GPU went from 200w to 1 kW in the past 5 years and even CPU are now close to 500w.
At 25k a pop avoiding throttling due to heat restriction is a must and is becoming incredibly hard with air cooling as air is an amazing insulator. Second, networking is extremely expensive with the cost of top-end InfiniBand cable close to 1000$ a foot. This means you want to cram your CPU /GPU as close as possible to keep the connection electrical if possible and minimize optical cable lengths. This also decreases latency and increase cluster performance. On a 50 000 cable deployment, the saving can be … significant. Third, we are now seeing the emergence of foundational model that requires government lab levels of interconnected compute. This is a new paradigm in the private sector and consume power in the 20-30 MW range. An increase in efficiency of 10% (from 1.2 to 1.1 PUE) brings huge saving on the power cost side plus important environmental benefits. It also becomes much easier to recuperate heat. That’s why we’re seeing major announcement at OCP i.e. meta going the liquid cooling way. I personally believe that’s where things are going and we are building a facility to optimize for the above points: https://www.qscale.com/ |
Wait... wat? I thought InfiniBand just used fibre!