Hacker News new | ask | show | jobs
by chollida1 930 days ago
I understand that the H100 is NVidia's leading edge chip, but can someone let me know if 10K is considered to be a big cluster?

I've never worked inside one of the leading edge AI companies like OpenAI, Google, Microsoft or Meta.

Is this comparable to what they would work with?

My first guess is that it seems much smaller. And if you are running many parallel training jobs then you are getting about 1,000 chips at most to work with.

Or is this about what the leading competitors are working with?

Azure, for one, seems to have orders of magnitude more chips at their disposal.

6 comments

10k H100 chips is considered a very large cluster. The third fastest supercomputer in the world is Microsoft’s eagle with 14k H100s https://www.top500.org/lists/top500/2023/11/
Ah, gotcha, so the fact that its 10,000 chips for one dedicated cluster that makes it large, as opposed to Azure which has an order of magnitude more GPUS but rents many of those out.
High performance on a single task requires simultaneous computation and communication between nodes. If there's high latency between nodes, such as between nodes in different data centers, the communication costs can't be masked by computation.
I guess Azure's are spread out too. Latency higher to world wide datacentres.
I previously ran 150,000 AMD GPUs. 10k doesn't seem that large. =)

That said, these GPUs aren't just the GPUs. They are whole chassis. They are huge onboard storage arrays, TB's of RAM, 800G networking (and associated cables), racks, cooling, power distribution, backup power, etc...

None of it is easy.

Out of interest, what did you use all that compute for?
ETH PoW. When ETH switched to PoS, we shut it all down. It sure was fun while it lasted, not many people on the planet have run that much compute.

I did a lot of unique optimizations to autotune each individual GPU for performance by tweaking the software knobs on them. They are all snowflakes. Same model, different batches (heck, even same batch!), can produce wildly different performance results.

Over the years, I did try to find some alternative workloads for it, but nothing could even pay for the power costs. The GPUs were very old models (rx470-rx580) and the rest of the hardware wasn't that advanced, like it is in AI, so none of it was transferred.

I'm in the process of building my own AI supercomputer now. Really looking forward to seeing how it turns out.

Make a vid. Or a blog post, at least. Please :)
Thanks, but not my style, sorry! I've been doing PoW mining since 2014 and have so many stories, I've forgotten half of them. I wouldn't even know where to start on trying to document any of it.
Perhaps reach out to a YouTube channel or podcast that could be interested?
Did you manage to recoup the investment?
Of course I can't say anything about that other than I did the job I was hired to do, and I performed far above anyone's wildest expectations.

Nobody else on the planet was able to automate the tuning like I did, which had a direct influence on ROI. I know this because it required a very specific change to the AMD drivers to enable that functionality to happen.

Classified I imagine.
H100 based DGX/HGX doesn't use 800 Gbit (it doesn't have the PCI-e bw), it's using 400 per GPU.
I was talking about between nodes. We're planning on bonding 2x400G NICs to get that 800G between nodes.

That said, latest 4th gen nvlink is 900G...

https://www.nvidia.com/en-us/data-center/nvlink/

But unless you're sleeping with Jensen, you're not going to see it for 52 weeks+ after you order it.

between the GPUs you already have 3.2Tbit/s, plus the 2x400 separately. Pretty sweet.

Our lead time hasn't been horrible actually, but I work for a pretty big corp

It is amazing to me how it is all about who you know. We just got a higher level contact and magically nvidia nic's just appeared in our BOM.
This is a big cluster, definitely large enough to pretrain 100B+ parameter LLMs in months. Source - I work at Databricks in the ML platform.

I don’t know much about AV processing, that’s highly customized to only a few customers but I’d expect it to also have very large computational requirements to do video processing and reinforcement learning.

The most powerful listed supercomputer has 37,888 Radeon GPUs, so in the same order of magnitude.
Interesting choice of words... I take you work for OpenAI? :) How large is their/'your' cluster? Probably the biggest in the world by now..
Parent is almost certainly talking about Frontier, the supercomputer with the US Department of Energy[0].

[0] - https://top500.org/system/180047/

Yes, that's "listed".. I'm curious how big the "unlisted" cluster is.
Unfortunately no, but there are almost certainly clusters in the hands of private companies and government organizations that would prefer not to advertise their capabilities.
Last I heard, the estimate was that NVIDIA would build 550k units in 2023, so 2% of all production — especially as at least six others (your four plus Apple and at least one intelligence agency) will be of similar size by themselves — is certainly non-negligible.
550k H100s? Who is buying these? They are hella expensive and China isn't allowed to have them.
Other than the ~12% I just estimated, lots of large-but-not-famous places will be buying ~1k, and small places will be buying tens to hundreds, and quite a lot of AI bubble money will be invested in startups that claim they only need one.

Probably some scientific modelling that can be done on these, so I bet some universities and private labs will be buying them. NASA, SpaceX, RocketLab, Helion, etc.

There's also probably a lot of AAA game studios and art studios for movies etc. who are each buying dozens of these graphics processing units for… graphics :P

Government agencies.
The Big Cloud
It's a small cluster the size of large cluster.