Hacker News new | ask | show | jobs
by thelastestate 1920 days ago
My understanding is that they're mostly fortran programs linked together with unix scripts which are run on HPCs - could the models run in a more distributed way like high quality grid computing setup? Lastly, what's the best way to find and learn more about the models?
3 comments

Switching to any sort of commercial grid or cloud computing setup would be rather complicated by the fact that climate models are critically dependent on the fast, low-latency interconnects (e.g., infiniband) of a proper HPC system to achieve good performance at scale. This is usually coordinated with hand-written message passing via MPI directly in the relevant top-level Fortran (or C/++) program.

There are some other (i.e, “embarrassingly parallel”) scientific computing problems where a higher-latency distributed setup would be fine, but in climate models, as in any finite-element model, each grid cell needs to be able to “talk to” its neighbors at each timestep, leading to quite a lot of inter-process communication.

Yes, they run in the cloud, see e.g. https://cloudrun.co (disclaimer: my side-business), but others have done it as well, for a few years now. On dedicated, shared-memory nodes, it's no different from HPC performance-wise. It can be even better because cloud instances tend to have later generation CPUs, whereas large HPC systems are typically updated every ~5 years or so. But for distributed-memory parallel runs (multi-nodes), latency increases considerably on commodity clouds which kills parallel scaling for models. Fortunately, major providers (AWS, GCP, Azure) have recently started offering low-latency interconnects for some of their VMs, so this problem will soon go away as well.
Indeed, basically, though you may lose from lack of direct access to the hardware. But it's typically expensive. Do AWS and GCP actually have RDMA fabrics now? The AWS "low latency" one of a year or so ago had a similar latency to what I got with 1GbE at one time.
Difficult to run true HPC software like this as a 'grid'. High speed, low latency communication (with MPI) is required.