Hacker News new | ask | show | jobs
by pama 58 days ago
HPCs never loved the inefficiencies of anything virtualized (VMs or any containers really), so the shell hacks of module enabled a (limited, but workable) level of reproducibility that was sufficiently composable and usable by researchers who understood the shell. I am not going to defend this tcl hack any further, but I can see how it was the path of least resistance when people tried to stay close to the raw metal of their large clusters while keeping some level of sanity. Slurm is a more defensible choice, but I agree that these tools are from a different era of compute. I grew to love and hate these tools, but they definitely represent an acquired taste, like a dorian fruit; not like an apple.

Your centos6 references made me chuckle :-)

2 comments

I promise you that the main reason HPC is behind on virtualization is not because of the little bit of overhead. There are a dozen other inefficiencies in the average HPC workload that are more significant.

Most centers don't even have good real-time observability systems to diagnose systemic inefficiencies, leaving application/workload profiling purely up to user-space.

The HP in HPC has really been watered down over the last couple decades, and "IT for computational research" would be a more accurate name. You can do genuinely high-performance computing there, but you'll be an outlier.

It's a mixture of legacy and reality.

For one, the assumption has been that you had dedicated use of all the nodes and communication network. It would kill your performance if your local node CPU scheduler was interfering with having your actual HPC program active when the messages were coming in from its peer tasks on the other nodes, since parallel jobs are limited in the end by the critical path latency of the cross-node communications.

It's only on the most "embarrassingly parallel" end of the spectrum where you can tolerate a bunch of virtualization and non-determinism, because the tasks communicate so infrequently or via such asynchronous mechanisms that they don't really impact the throughput of the whole job if they are asleep at random times.

But HPC systems also were very "unique". It wasn't just all Linux but a dozen different vendors' Unix variants with very different personalities. And for the bleeding-edge systems, each deployment was practically its own dialect of that vendor OS. Running a job was like cross-compiling to a one of a kind target. There was no generic platform where you could expect to build an app once and ship it around to whichever supercomputer was available.

Agreed on all points and this captures the history well.
Containers are an OS sandboxing/namespacing primitive, they don't involve any overhead on their own. The overhead is dependent on what's inside the container besides a single deployed binary.
What you way is true after the container starts. Typical HPC codes are tuned to raw hardware so they assume full ownership of the hardware anyways. When HPC was developing 30 years ago we didnt have clean ways to avoid overheads in the regime of 10k nodes. Instead we got parallel filesystems, caching, and shell, with module, which technically did the job for reproducible runs at a huge human cost.