Hacker News new | ask | show | jobs
by ajdecon 2958 days ago
I’ve worked in a few different settings on large-scale scientific computing. For those applications:

- Not cost-efficient at large scale. When you expect and plan to run thousands of nodes at near 100% CPU and memory usage for years at a time, running a machine room can still be less expensive.

- Specialized hardware not available in public clouds, e.g., very low latency networks configured in an optimal topology.

- Lack of control over hardware upgrade schedule. E.g., a cloud probably won’t give you those shiny new GPUs as early as you can shove them in your own servers.

The balance is shifting in many of these areas, and there’s plenty of scientific computing that can use a public cloud now. But I still wouldn’t use it for problems that are both highly CPU-intensive and require low latency networks, especially if I have long-term workloads.

2 comments

(Mostly academic) high performance computing (HPC) has clearly different needs from what typical cloud computing services can provide. The setup and operation costs of a medium size (~1k nodes, ~25k cores) university computing centre in Europe costs at the order of 1MEur per year, not even speaking about the large national centers with with 10-100k nodes and 100k to 1M cores. At these level of computing it is quite sensible to do it in-house, especially if the engineering challenges are welcome scientific research topics on their own (such as energy efficient HPC, research on distributed file systems or job queueing systems, usage of accelerator cards).

By the way, at one point, in science, there is already such a kind of computing cloud: We call it https://en.wikipedia.org/wiki/Grid_computing

I've (unfortunately) run scientific computing applications on aws. The experience is awful.

1 - aws is very very very expensive for sustained load.

2 - aws offers highly variable performance characteristics, both cpu and networking. It's a best practice after creating a set of ec2 machines to immediately spend 10 minutes perf testing them and dropping slow ones, either cpu or network.

2a - machines in aws that didn't start slow may become slow, particularly for networking. What you really want for many applications is a dedicated rack with very high speed TOR switches. You do not get this in AWS.

3 - Designing ML applications for variable tradeoffs between cpu and network is extremely ugly. Detecting and dealing with network links that can suddenly become extremely slow is awful.