Hacker News new | ask | show | jobs
by latchkey 779 days ago
7 separate data centers all around the US.

12 GPU/host. 130,000 of that kind. ~10,833 hosts.

The ASRock BC-250's we deployed were 12 individual blades and those were all PXE booted. We deployed 20,000 of those blades across 2 data centers. This was a massive feat of engineering, especially during covid where I couldn't even access the machine directly. Built a whole dashboard to monitor it all too.

I know, I can't believe we did it either, but we did. Software automation was king. I built a single binary that ran on each individual host and knew how to self configure / optimize everything. Idempotently. Even distributing upgrades to the binary was a neat challenge that I solved perfectly, in very creative ways.

Today, we are starting much smaller. Literally from zero/scratch. Given the cost of MI300x, I doubt we will ever get to 150k GPUs, that's an absurd amount of money, but who knows.

1 comments

But who did the wiring? Even with blades which consolidate much of the cabling, there's still a tremendous amount of work to build the interconnect. On typical large systems I've seen a small team 3-5 guys working weeks+ to wire a modest DC.
We'd hire the initial deployment out to temporary contractors. It just took a few weeks to get a large deployment out. The hard part was the 12 GPUs needed to be inserted at the DC, which took a bunch of effort. Once it was done we generally had 1-2 people on the ground in the data centers to deal with breakfixes. Either contractors or supplied by the DC.

For this venture, again, we are starting small, so we are just flying to the DC and doing it ourselves. There are also staff there that are technical enough to swap stuff out when we need it. The plan will be to just hire one of their staff as our own.

I don't think we will make it for this next deployment due to time constraints, but ideally in our near future, we will go full L11. Assemble and ship out full racks at the manufacturer/VAR, bolt em down, wire them up and ready to go. That is my dream... we will see if we get there. L11 is hard cause a single missing cable can hold up an entire shipment.

I just realized we had this same conversation on HN before. IIRC I said last time and I'll repeat: if you say that you set up 15K GPUs with 2 people, and I ask who did the wiring, and you say an external company came in and spent a few weeks wiring the network for you, then you can't say that 2 people set up 15K GPUs. You're trying to externalize real costs (both time and money).

I understand your dream (having pursued similar ideas) but I think you have ot be realistic about the effort required, especially when you add picky customers to the mix.

Now you're nit picking two days later, which is fine. Sure, you got me there!

Two people hired some temporary workers and asked them to perform a task to get us up and running, which lasted a few days, out of years of operation.

¯\_(ツ)_/¯