| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by karmasimida 1612 days ago

> Meta’s AI supercomputer houses 6,080 Nvidia graphics-processing units ..... By mid-summer, when the AI Research SuperCluster is fully built, it will house some 16,000 GPUs

Honestly ... this is lot of GPUs ... but is it the biggest...?

> Model training is done with mixed precision on the NVIDIA DGX SuperPOD-based Selene supercomputer powered by 560 DGX A100 servers networked with HDR InfiniBand in a full fat tree configuration. Each DGX A100 has eight NVIDIA A100 80GB Tensor Core GPUs

So Nvidia used 4480 GPUs to train Megatron-Turing NLG 530B for example.

3 comments

vl 1612 days ago

Honestly, this single GPU-based install is child's play compared to Google's multiple TPU exoflop supercomputers with hyper-cube optical interconnects. Google's ML setups allow synchronous weight update on thousand+ TPUs...

link

rawtxapp 1612 days ago

TPUs are amazing, but in my experience, debugging issues with them can be a bit tricky. Since nvidia's gpus are more common place (especially outside gcp), you can find a lot more information when you get stuck, it's also more battle tested, etc.

link

6gvONxR4sf7o 1612 days ago

For what it's worth, jax is helpful to me here. You can drop out of the jit to debug it as if it were numpy.

Of course that assumes your issues aren't with the jit itself or inside pmap, etc. That shit's hard.

link

alex_sf 1612 days ago

Tbh I thought I was being trolled with 'hyper-cube optical interconnects'.

link

vl 1612 days ago

Actually, you are right, I mistyped. Although hypercube interconnects exist, and were used, for example, in AS400, system in question uses hypertorus topology.

link

cm2012 1612 days ago

For what its worth, for attention based advertising (youtube and display, not search), FB targeting blows Google out of the water. Not sure why but its consistent across brands.

link

mchusma 1612 days ago

I have seen this myself, by I'm unsure if it's just a "ad quality" thing. For example, I can target exact placements on YouTube for my exact niche, and broad Facebook matching will outperform. I have tried YouTube and display for months with nothing within an order of magnitude as effective as Facebook.

link

dekhn 1612 days ago

for TPUv3 it's 2D torus, not hyper-cube, right? Not sure if TPUv4 topology is externally published, but IIRC hypercubes are basically never used any more.

link

vl 1612 days ago

I mistyped, one version is 2D torus, next is 3D torus aka hypertorus.

link

sailingparrot 1612 days ago

At 16k it will definitely be the biggest.

As for today, Nvidia has this a very slightly smaller cluster that you outlined at ~5k, Microsoft as a few of them roughly of that size, and Microsoft also built a 10k GPU cluster for OpenAI 2 years ago, but those are V100 GPUs.

So, is 6k A100 "bigger" than 10k V100? Depends exactly how you use them, in a perfect usage scenario yes, slightly. In real life maybe not.

link

dekhn 1612 days ago

Systems like this are designed to reach nearly peak performance (IE # of flops per processing element * # of processing elements), explicitly by making a network that won't block or increase latency for the common expensive operations (allreduce, allvall) at the expensive of greatly increased cost.

The point of making this machine is to have a lot of A100s going at the same time, and that will unblock some small set of researchers who are working on time-sensitive competitive research projects by giving them a slightly throughput and latency advantage on the largest problems. The vast majority of users would be better served by a small number of cheaper, slower GPUs that they had exclusive access to for the longest time period they could afford to wait.

link

sailingparrot 1612 days ago

> Systems like this are designed to reach nearly peak performance

The system certainly is. The code running on that system generally isn't. Pulling 100% of the FLOPS the GPUs are able to provide is quite hard.

And my point was it also depends on the specific models you are training. Are you training a transformer model in FP32 precision? Then yes, 6K A100 will blow 10K V100. Are you training a ConvNet in FP16? Then no, 10K V100 will perform better.

The GPUs have different architecture, you have to use the architecture best suited for the A100 to achieve the speedup marketed by NVidia, which is presumably the number FB is using to claim that their 6k GPU cluster is bigger than OpenAI's 10K one.

link

buildbot 1612 days ago

Probably not, as Azure was at 10K last year: https://blogs.microsoft.com/ai/openai-azure-supercomputer/

link