|
|
|
|
|
by dekhn
782 days ago
|
|
150K GPU with two people... presumably, 8 GPU/host, you had close to 20K servers. I can't really see how that's achievable with only two people, given the time to install hardware, maintain it, deal with outages and planned maintainence and testing, etc. Note: I worked at Google and interfaced with hwops so I have some real-world experience to compare to. Building a 150K GPU system without a well-understood customer base seems a bit crazy to me. You will either become a hyperscale, serve a niche, or go out of business, I fear. |
|
12 GPU/host. 130,000 of that kind. ~10,833 hosts.
The ASRock BC-250's we deployed were 12 individual blades and those were all PXE booted. We deployed 20,000 of those blades across 2 data centers. This was a massive feat of engineering, especially during covid where I couldn't even access the machine directly. Built a whole dashboard to monitor it all too.
I know, I can't believe we did it either, but we did. Software automation was king. I built a single binary that ran on each individual host and knew how to self configure / optimize everything. Idempotently. Even distributing upgrades to the binary was a neat challenge that I solved perfectly, in very creative ways.
Today, we are starting much smaller. Literally from zero/scratch. Given the cost of MI300x, I doubt we will ever get to 150k GPUs, that's an absurd amount of money, but who knows.