Hacker News new | ask | show | jobs
by colechristensen 1608 days ago
What is the difference between “a supercomputer” and “a bunch of racks of computers”?

The actual difference between the two is quite diminished compared to years past and seems to reduce more to how a collection of computers is used and not what it is.

5 comments

The big remaining one appears to be an unusually high speed interconnect. Infiniband, etc.
Yep, hetero multigpu fleet mixing high ram GPUs (40-80GB each on each A100) as multigpus w smaller (ex: ~12-16 GB T4s) nodes, w crazy interconnects locally (nvlink) and across nodes. And storage gets fun as well, like parallel SSD arrays for 100GB+/s combined per node. Then whatever legacy+hybrid CPU stuff. Ex: for stuff like PCIe, new generations that ~10x the bandwidth you'd see in a gamer box, and like 1-2 per GPU. Varies a lot for say log mining vs NN training, and even for diff NNs. Ex: Graph NNs end up needing more balanced CPU side.

Saturating a box with 500+ GB GPU RAM is fun. Only our gov users ask us for help on that typically: most of our users are commercial nowadays, but with much smaller/scaled down GPU rigs. I think that'll change as the fintechs keep improving and software gets easier, but they are still not there (outside of niches). Working on it :)

(If you like writing shaders, we are hiring :D )

> What is the difference between “a supercomputer” and “a bunch of racks of computers”?

In addition to the other responses, I like pointing people to this talk[1] by Jeff Hammond for a comprehensive answer to this question (you can skip to the 11:15 timestamp).

[1] https://uchicago.hosted.panopto.com/Panopto/Pages/Embed.aspx...

That talk is from 2009 though. Nowadays companies regularly run jobs on commercial data centers which can include thousands of GPU cores, Infiniband networking and other specialized equipment. One can make a pretty valid case that we are approaching the ability to make an ad-hoc supercomputer for yourself from the GCP console.
This talk was from April 2021. 2009 is the year he got his PhD[2].

[2] https://jeffhammond.github.io/

It's all about the distributed filesystems made from big arrays of fast fast disks, and the massive I/O backplane to the storage system and between nodes.
This is a shared memory cluster. That is, there is some level of RDMA over a networking fabric.
I’d say mainly networking bandwidth.