|
TL;DR - It does come down mainly to the network, but in far more interesting ways than is apparent from some of the answers here - and also the nature of the HPC software ecosystem that co-evolved with supercomputing for the last 30+ years. This community has pioneered several key ideas in large scale computing that seem to be at risk in the world of cheap, lease-able compute. In scientific computing (usually where you see them), the primary workload is simulation/modeling of natural phenomena. The nature of this workload is that the more parallelism that is available, the bigger/more fine-grained a simulation can run, and hence the better it can approximate reality (as defined by the scientific models which are being simulated). Examples of this are fluid dynamics, multi-particle physics, molecular dynamics, etc. The big push with these types of workloads is to be able to get efficient parallel performance at scale - so it isnt about just the # of cores, PB of disk or TB of DRAM, but whether the software and underlying hardware work well together at scale to exploit the available aggregate compute. So the network matters, not just raw bandwidth but things like latency of remote memory access and the topology itself - for example, the Cray XCs going to Azure allow for a programming model (PGAS) that allows for large, scalable global memory views where a program can view the total memory of a set of nodes as a single address space. Underneath, the hardware and software work together to bound latency, do adaptive per-packet routing and ensure reliability - all at the level of 10s of thousands of nodes. In a real sense, the network is the (super)computer - the old Sun slogan. Where else is this useful? Well, look at deep learning - the new hotness in parallel computing these days - they are all realizing that it's amazing to run on GPUs, but once you have large enough problems (which the big guys do), you end up having to figure out how to efficiency get a bunch of GPUs to efficiently communicate during data parallel training (that efficient parallelism thing). This happens to map to a relatively simple set of communication patterns (e.g. AllReduce) that is a small subset of the kinds that the HPC community has solved for - so it's interesting that many deep learning engineers are starting to see the value of things like RDMA and frameworks like MPI (Baidu, Uber, MSFT and Amazon for starters). Interestingly though, the word supercomputing is being co-opted by the very companies that you're positioning as the alternative - the Google TPU Cloud is a specialized incarnation of a typical supercomputing architecture. Sundar Pichai refers to Google as being a 'supercomputer in your pocket'. |
I really love the detailed and expansive responses that some questions generate. Hopefully, they are of interest to more than just myself. I've been retired since 2007, so it is living vicariously through you folks - as I absolutely don't have to make these choices anymore.
A part of me wants to take on some big project, just to get back into it. I actually miss working. Go figure?
Thanks!