| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Workaccount2 782 days ago
	For those curious, Cheyenne is a supercomputer from 2016/2017 that launched on the 20th spot in the top500 super computers. It was decommissioned in 2023 after pandemic lead to a two year operation extension. It has a peak compute of 5.34 petaflops, 313TB of memory, and gobbles 1.7MW.

3 comments

observationist 782 days ago

In comparison, 18 A100 GPUs would have 5.6 petaflops and 1.4 TB vram, consuming 5.6 kw.

The speed of processing and interconnect is orders of magnitude faster for an A100 cluster - 1 8 gpu pod server will cost around $200k, so around $600k more or less beats the supercomputer performance (price I'm searching seems wildly variable, please correct me if I'm wrong.)

mk_stjames 782 days ago

The Cheyenne numbers are 5.34 petaflops of *FP64*.

The 5.6PF you quote for 18 A100's would be in BF16. Not comparable.

The A100 can only do 9.746 TFLOPS in FP64.

So you would need 548 A100's to match the FP64 performance of the Cheyenne.

observationist 782 days ago

Thanks, glad you guys caught that - could be generous and allow the tensor core tflops, since you'd more than likely be using a100 pods for something cuda optimized, in which case 19.5 tflops fp64 at peak per GPU, roughly 267 would be needed, or 34 pods, at $6.8 million, with 21.76 TB vram and 81 kw power consumption.

Double those for raw fp64.

latchkey 782 days ago

AMD MI300x is 163.4 TFLOPS in FP64.

33 of them, which would also have 6,336TB of memory.

I'll have way more than that in my next purchase order.

It is really fun to build a super computer.

mk_stjames 782 days ago

I'm an amateur, but I have code that I think could probably dispatch threads pretty efficiently on the Cheyenne thru it's management system simply because it's all xeons distributed. If I can run it on my personal 80-core cluster, I could get it to run on Cheyenne back then.

But hitting the roofline on those AMD GPGPU's? I'd probably get nowhere fucking close.

That is the thing that Cheyenne was built for. People doing CFD research with x86 code that was already nicely parallelized via OpenMPI or whathaveyou.

latchkey 782 days ago

It is wild how much compute has grown.

I put dual Epyc 9754 into my first box of MI300x.

That's 256 cores + 8x MI300x, in a single box.

Agreed, it is a great solution for CFD, which is definitely one workload I'd love to host.

dekhn 782 days ago

I used to build small clusters and use supercomputers and I can't imagine it's fun to build a super computer. It requires a massive infrastructure and significant employee base, and individual component failures can take down entire jobs. Finding enough jobs to keep the system loaded 24/7 while also keeping the interconnect (which was 15-20% of the total system cost) busy, and finding the folks who can write such jobs, is not easy. Even then, other systems will be constantly nipping at your heels with newer/cheaper/smaller/faster/cooler hardware.

latchkey 782 days ago

Thanks for the feedback. You make a lot of good points. I've built a 150,000 GPU system previously, but it was lower end hardware. It was a lot of fun to make it run smoothly with its own challenges.

It doesn't take a lot of employee's, we did the above on essentially two technical people. Those same two are working on this business.

Finding workloads/jobs is definitely going to be an interesting adventure, that said, the need for compute isn't going away. By offering hard to get hardware at reasonable rates and contract lengths, I believe we are in a good position on that front, but time will tell.

We are only buying the best of the best that we can get today. The plan is to continuously cycle out older hardware as well as not pick sides on one over another. This should help us keep pace with other systems.

dekhn 782 days ago

150K GPU with two people... presumably, 8 GPU/host, you had close to 20K servers.

I can't really see how that's achievable with only two people, given the time to install hardware, maintain it, deal with outages and planned maintainence and testing, etc. Note: I worked at Google and interfaced with hwops so I have some real-world experience to compare to.

Building a 150K GPU system without a well-understood customer base seems a bit crazy to me. You will either become a hyperscale, serve a niche, or go out of business, I fear.

nickpsecurity 782 days ago

Also, supercomputers usually use general-purpose nodes supported by many standard tools, multiple methods of parallelization, and (for open standards) maybe multi-vendor. I imagine this one is much more flexible than A100’s.

jjtheblunt 782 days ago

also, comparing SIMD with cheyenne is misleading

martinpw 782 days ago

The supercomputer flops are FP64. The A100 stats you are using are FP16.

jeffbee 782 days ago

It's fine. We will simply run weather forecast in BF16 mode and hallucinate the weather.

dgacmu 782 days ago

Introducing our next supercomputer, Peyote.

adgjlsfhk1 782 days ago

weather forecasting is actually moving to reduced precision. none of the input data is known to more than a few digits, and it's a chaotic system so the numerical error is usually dominated by the modeling and spacial discretization error

CamperBob2 781 days ago

A black Sharpie marker is even cheaper...

Netcob 782 days ago

Aw man... I was going to use it for my homelab but that's 1696320W more than I can supply. Well... maybe if I use two plugs instead of one...

DonHopkins 782 days ago

Bet it runs warm. The cat will love sitting on it.

buescher 782 days ago

It was at #160 in 2023 when it was decommissioned.