Hacker News new | ask | show | jobs
by latchkey 930 days ago
I previously ran 150,000 AMD GPUs. 10k doesn't seem that large. =)

That said, these GPUs aren't just the GPUs. They are whole chassis. They are huge onboard storage arrays, TB's of RAM, 800G networking (and associated cables), racks, cooling, power distribution, backup power, etc...

None of it is easy.

2 comments

Out of interest, what did you use all that compute for?
ETH PoW. When ETH switched to PoS, we shut it all down. It sure was fun while it lasted, not many people on the planet have run that much compute.

I did a lot of unique optimizations to autotune each individual GPU for performance by tweaking the software knobs on them. They are all snowflakes. Same model, different batches (heck, even same batch!), can produce wildly different performance results.

Over the years, I did try to find some alternative workloads for it, but nothing could even pay for the power costs. The GPUs were very old models (rx470-rx580) and the rest of the hardware wasn't that advanced, like it is in AI, so none of it was transferred.

I'm in the process of building my own AI supercomputer now. Really looking forward to seeing how it turns out.

Make a vid. Or a blog post, at least. Please :)
Thanks, but not my style, sorry! I've been doing PoW mining since 2014 and have so many stories, I've forgotten half of them. I wouldn't even know where to start on trying to document any of it.
Perhaps reach out to a YouTube channel or podcast that could be interested?
Did you manage to recoup the investment?
Of course I can't say anything about that other than I did the job I was hired to do, and I performed far above anyone's wildest expectations.

Nobody else on the planet was able to automate the tuning like I did, which had a direct influence on ROI. I know this because it required a very specific change to the AMD drivers to enable that functionality to happen.

Classified I imagine.
H100 based DGX/HGX doesn't use 800 Gbit (it doesn't have the PCI-e bw), it's using 400 per GPU.
I was talking about between nodes. We're planning on bonding 2x400G NICs to get that 800G between nodes.

That said, latest 4th gen nvlink is 900G...

https://www.nvidia.com/en-us/data-center/nvlink/

But unless you're sleeping with Jensen, you're not going to see it for 52 weeks+ after you order it.

between the GPUs you already have 3.2Tbit/s, plus the 2x400 separately. Pretty sweet.

Our lead time hasn't been horrible actually, but I work for a pretty big corp

It is amazing to me how it is all about who you know. We just got a higher level contact and magically nvidia nic's just appeared in our BOM.