| HN Mirror

Because a cluster doesn't work like a normal computer.

Your users don't see the nodes. They submit jobs and wait for their turn in the cluster. A sophisticated resource planner / job scheduler tries to empty the queue while optimizing job placement so the system usage can be maximized as much as possible.

Also, users' jobs work in under their own users. You need to isolate them. Giving them access to docker or any root level container engine is completely removing UNIX user security and isolation model and running in Windows95 mode. This also compromises system security since everyone is practically root at that point. Singularity is user-mode and its usage is increasing but then comes the next point.

Performance and hardware access is critical in HPC. GPU and special HBAs like Infiniband requires direct access from processes to run at their maximum performance or work at all. GPU access is much more important than containerizing workloads. Docker GPU is here because nVidia wanted to containerize AI workloads on DGX/HGX systems. These technologies are maturing on HPC now.

In performance front, consider the following: If main loop of your computation loses a second due to these abstractions, considering this loops run thousand times per core on many nodes, lost productivity is eye-watering. My simple application computes 1.7 million integrations per second per core. So, for working on long problems, increasing this number is critical.

Last but not the least, some of the applications run on these systems are developed for 20 years now. So, these applications are not some simple code bases which are extremely tidy and neat. You can't know/guess how these applications behave before running them inside a container. As I've said, you need to be able to trust what you have too. So, we scientists and HPC administrators tend to walk slowly but surely.

Doing my job properly on the HPC side means my cluster works with utmost efficiency and bulletproof user isolation so people can trust the validity of their results and integrity of their privacy. Doing my job properly on the development side means that my code builds with minimum effort and with maximum performance on systems I support. HPC software is not a single service which works like a normal container workload. We need to evolve our software to run with minimum problems with containers and containers should evolve to accommodate our workloads, workflows and meet our other needs.

The cutting edge technology doesn't solve every problem with same elegance. Also we're not a set of lazy academics or sysadmins just because our systems work more traditionally.