| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chollida1 981 days ago

I understand that the H100 is NVidia's leading edge chip, but can someone let me know if 10K is considered to be a big cluster?

I've never worked inside one of the leading edge AI companies like OpenAI, Google, Microsoft or Meta.

Is this comparable to what they would work with?

My first guess is that it seems much smaller. And if you are running many parallel training jobs then you are getting about 1,000 chips at most to work with.

Or is this about what the leading competitors are working with?

Azure, for one, seems to have orders of magnitude more chips at their disposal.

6 comments

jeffreyames 981 days ago

10k H100 chips is considered a very large cluster. The third fastest supercomputer in the world is Microsoft’s eagle with 14k H100s https://www.top500.org/lists/top500/2023/11/

chollida1 981 days ago

Ah, gotcha, so the fact that its 10,000 chips for one dedicated cluster that makes it large, as opposed to Azure which has an order of magnitude more GPUS but rents many of those out.

jeffreyames 981 days ago

High performance on a single task requires simultaneous computation and communication between nodes. If there's high latency between nodes, such as between nodes in different data centers, the communication costs can't be masked by computation.

rightbyte 981 days ago

I guess Azure's are spread out too. Latency higher to world wide datacentres.

latchkey 981 days ago

I previously ran 150,000 AMD GPUs. 10k doesn't seem that large. =)

That said, these GPUs aren't just the GPUs. They are whole chassis. They are huge onboard storage arrays, TB's of RAM, 800G networking (and associated cables), racks, cooling, power distribution, backup power, etc...

None of it is easy.

LysPJ 981 days ago

Out of interest, what did you use all that compute for?

latchkey 981 days ago

ETH PoW. When ETH switched to PoS, we shut it all down. It sure was fun while it lasted, not many people on the planet have run that much compute.

I did a lot of unique optimizations to autotune each individual GPU for performance by tweaking the software knobs on them. They are all snowflakes. Same model, different batches (heck, even same batch!), can produce wildly different performance results.

Over the years, I did try to find some alternative workloads for it, but nothing could even pay for the power costs. The GPUs were very old models (rx470-rx580) and the rest of the hardware wasn't that advanced, like it is in AI, so none of it was transferred.

I'm in the process of building my own AI supercomputer now. Really looking forward to seeing how it turns out.

alecco 980 days ago

Make a vid. Or a blog post, at least. Please :)

latchkey 980 days ago

Thanks, but not my style, sorry! I've been doing PoW mining since 2014 and have so many stories, I've forgotten half of them. I wouldn't even know where to start on trying to document any of it.

alecco 977 days ago

Perhaps reach out to a YouTube channel or podcast that could be interested?

ushakov 981 days ago

Did you manage to recoup the investment?

latchkey 981 days ago

Of course I can't say anything about that other than I did the job I was hired to do, and I performed far above anyone's wildest expectations.

Nobody else on the planet was able to automate the tuning like I did, which had a direct influence on ROI. I know this because it required a very specific change to the AMD drivers to enable that functionality to happen.

peteradio 981 days ago

Classified I imagine.

_zoltan_ 981 days ago

H100 based DGX/HGX doesn't use 800 Gbit (it doesn't have the PCI-e bw), it's using 400 per GPU.

latchkey 981 days ago

I was talking about between nodes. We're planning on bonding 2x400G NICs to get that 800G between nodes.

That said, latest 4th gen nvlink is 900G...

https://www.nvidia.com/en-us/data-center/nvlink/

But unless you're sleeping with Jensen, you're not going to see it for 52 weeks+ after you order it.

_zoltan_ 977 days ago

between the GPUs you already have 3.2Tbit/s, plus the 2x400 separately. Pretty sweet.

Our lead time hasn't been horrible actually, but I work for a pretty big corp

latchkey 977 days ago

It is amazing to me how it is all about who you know. We just got a higher level contact and magically nvidia nic's just appeared in our BOM.

joshhart 981 days ago

This is a big cluster, definitely large enough to pretrain 100B+ parameter LLMs in months. Source - I work at Databricks in the ML platform.

I don’t know much about AV processing, that’s highly customized to only a few customers but I’d expect it to also have very large computational requirements to do video processing and reinforcement learning.

kcb 981 days ago

The most powerful listed supercomputer has 37,888 Radeon GPUs, so in the same order of magnitude.

jbverschoor 981 days ago

Interesting choice of words... I take you work for OpenAI? :) How large is their/'your' cluster? Probably the biggest in the world by now..

kkielhofner 981 days ago

Parent is almost certainly talking about Frontier, the supercomputer with the US Department of Energy[0].

[0] - https://top500.org/system/180047/

jbverschoor 981 days ago

Yes, that's "listed".. I'm curious how big the "unlisted" cluster is.

kcb 981 days ago

Unfortunately no, but there are almost certainly clusters in the hands of private companies and government organizations that would prefer not to advertise their capabilities.

ben_w 981 days ago

Last I heard, the estimate was that NVIDIA would build 550k units in 2023, so 2% of all production — especially as at least six others (your four plus Apple and at least one intelligence agency) will be of similar size by themselves — is certainly non-negligible.

2OEH8eoCRo0 981 days ago

550k H100s? Who is buying these? They are hella expensive and China isn't allowed to have them.

ben_w 981 days ago

Other than the ~12% I just estimated, lots of large-but-not-famous places will be buying ~1k, and small places will be buying tens to hundreds, and quite a lot of AI bubble money will be invested in startups that claim they only need one.

Probably some scientific modelling that can be done on these, so I bet some universities and private labs will be buying them. NASA, SpaceX, RocketLab, Helion, etc.

There's also probably a lot of AAA game studios and art studios for movies etc. who are each buying dozens of these graphics processing units for… graphics :P

alecco 980 days ago

Government agencies.

ushakov 981 days ago

The Big Cloud

xvilka 981 days ago

It's a small cluster the size of large cluster.