Hacker News new | ask | show | jobs
by aschleck 1348 days ago
This is sort of a confusing article because it assumes the premise of "you have a fixed hardware profile" and then argues within that context ("Most scientific computing runs on queues. These queues can be months long for the biggest supercomputers".) Of course if you're getting 100% utilization then you'll find better raw pricing (and this article conveniently leaves out staffing costs), but this model misses one of the most powerful parts of cloud providers: autoscaling. Why would you want to waste scientist time by making them wait in a queue when you can just instead autoscale as high as needed? Giving scientists a tight iteration loop will likely be the biggest cost reduction and also the biggest benefit. And if you're doing that on prem then you need to provision for the peak load, which drives your utilization down and makes on prem far less cost effective.
2 comments

For fast-moving researchers who are blocked by a queue, cloud computing still makes sense. I guess I wasn't clear enough in the last section about how I still use AWS for startup-scale computational biology. My scientific computing startup (trytoolchest.com) is 100% built on top of AWS.

Most scientific computing still happens on supercomputers in slower moving academic or big co settings. That's the group for whom cloud computing – or at least running everything on the cloud – doesn't make sense.

Another service that runs on AWS is CodeOcean. It looks like Toolchest is oriented toward facilitating execution of specific packages rather than organization and execution like CodeOcean. Is that a fair summary?

https://codeocean.com/explore

Yep, that's right! Toolchest focuses on compute, deploying and optimizing popular scientific computing packages.
Generally, scientists aren't blocked while they are waiting on a computational queue. The results of a computation are needed eventually, but there is lots of other work that can be done that doesn't depend on a specific calculation.
It's good to learn how not to be blocked on long-running calculations.

On the other hand, if transitioning to a bursty cloud model means you can do your full run in hours instead of weeks, that has real impact on how many iterations you can do and often does appreciably affect velocity.

It can, if you have the technical ability to write code that can leverage the scale-out that most bursty-cloud solutions entail. Coding for clustering can be pretty challenging, and I would generally recommend a user target a single large system with job that takes a week over trying to adapt that job to a clustered solution of 100 smaller systems that can complete it in 8 hours.
This is a big part of it. In my lab, I have a lot of grad students who are computational scientists, not computer scientists. The time it will take them to optimize code far exceeds a quick-and-dirty job array on Slurm and then going back to working on the introduction of the paper, or catching up on the literature, or any one of a dozen other things.
Grad student here, I can attest to that.