| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by latchkey 982 days ago

I previously ran 150,000 AMD GPUs. 10k doesn't seem that large. =)

That said, these GPUs aren't just the GPUs. They are whole chassis. They are huge onboard storage arrays, TB's of RAM, 800G networking (and associated cables), racks, cooling, power distribution, backup power, etc...

None of it is easy.

2 comments

LysPJ 982 days ago

Out of interest, what did you use all that compute for?

link

latchkey 982 days ago

ETH PoW. When ETH switched to PoS, we shut it all down. It sure was fun while it lasted, not many people on the planet have run that much compute.

I did a lot of unique optimizations to autotune each individual GPU for performance by tweaking the software knobs on them. They are all snowflakes. Same model, different batches (heck, even same batch!), can produce wildly different performance results.

Over the years, I did try to find some alternative workloads for it, but nothing could even pay for the power costs. The GPUs were very old models (rx470-rx580) and the rest of the hardware wasn't that advanced, like it is in AI, so none of it was transferred.

I'm in the process of building my own AI supercomputer now. Really looking forward to seeing how it turns out.

link

alecco 981 days ago

Make a vid. Or a blog post, at least. Please :)

link

latchkey 981 days ago

Thanks, but not my style, sorry! I've been doing PoW mining since 2014 and have so many stories, I've forgotten half of them. I wouldn't even know where to start on trying to document any of it.

link

alecco 978 days ago

Perhaps reach out to a YouTube channel or podcast that could be interested?

link

ushakov 981 days ago

Did you manage to recoup the investment?

link

latchkey 981 days ago

Of course I can't say anything about that other than I did the job I was hired to do, and I performed far above anyone's wildest expectations.

Nobody else on the planet was able to automate the tuning like I did, which had a direct influence on ROI. I know this because it required a very specific change to the AMD drivers to enable that functionality to happen.

link

peteradio 982 days ago

Classified I imagine.

link

_zoltan_ 981 days ago

H100 based DGX/HGX doesn't use 800 Gbit (it doesn't have the PCI-e bw), it's using 400 per GPU.

link

latchkey 981 days ago

I was talking about between nodes. We're planning on bonding 2x400G NICs to get that 800G between nodes.

That said, latest 4th gen nvlink is 900G...

https://www.nvidia.com/en-us/data-center/nvlink/

But unless you're sleeping with Jensen, you're not going to see it for 52 weeks+ after you order it.

link

_zoltan_ 978 days ago

between the GPUs you already have 3.2Tbit/s, plus the 2x400 separately. Pretty sweet.

Our lead time hasn't been horrible actually, but I work for a pretty big corp

link

latchkey 977 days ago

It is amazing to me how it is all about who you know. We just got a higher level contact and magically nvidia nic's just appeared in our BOM.

link