| Many HPC jobs aren’t simulations that are CPU bound. In fact, most of the jobs on the clusters I’ve used have been single-node jobs (so technically HTC, but that term is rarely used). I do genomics work and my jobs tend to be bursty. They may use a lot of CPU initially, but the second half of the job is writing results. This takes only one core, but still the max amount of memory. Or, I can have jobs that are CPU light, but need the max amount of memory for only a fraction of their wall-time. Here is an example for you. Let’s say I’m processing a genome sequencing experiment. This requires about 8 different steps between preprocessing the data, alignment, post filtering, QC stat collection, etc. These are large input files, so my jobs end up being IO bound. If I were reading and writing at each step, it would add days of time to the pipeline. Instead, what we do is read the data once, and pipe the data from program to program. But each program has different CPU and memory requirements. We need to reserve the $MAX requirements for each. As the data moves through the pipeline though, we eventually end up with max utilization for only a portion of the walltime. If I optimize for efficient walltime, I leave CPUs and memory idle for a large portion of the job. People also tend to like to manage fewer jobs. So instead of splitting a job into multiple dependent submissions that are tailored to each program, people will write a bash script that runs, but is not efficient. Many times, these patterns are difficult to predict and you can’t submit a job that says I need 20 cores for an hour, but only 2 for the last two hours. It’s difficult to balance utilization vs wall-time. No one likes waiting, but there is usually little incentive to have high utilization rates. And sometimes the balance is total walltime. Sometimes it’s execution complexity. This is the problem this group is trying to solve - dynamically adapting the scheduler to know when a job isn’t going to use its full allocation of resources. I’m not sure there is a good way to do it. HPC users are concerned only with getting their jobs done fast. HPC admins want to see resources used efficiently. This is a classic pipelining problem: do you optimize for individual task time or overall system throughput? I think the only way to really do this well is to make HPC jobs a market system where resources cost money to the users. When money is involved, people are incentivized to optimize their workloads. But that’s rarely the case for large HPC clusters and I’d personally hate it if I had to deal with a HPC processing budget. In lieu of this, a common way to handle this lack of efficiency is to do “fair share” scheduling. This means that a users prior work load is taken into account when prioritizing their queue position. So, if I did a lot of work last week, jobs for a user that didn’t run jobs last week would get a priority boost over me. This doesn’t address the utilization efficiency directly, but it does make access to the cluster seem more “fair”. |
I also got that, there aren't many incentives in HPC. Users want results and don't have the time or skill to invest in rewriting the software stack to be as efficient as possible.
An intuition that I have is that this problem could be solved at the application/framework layer. Instead of launching jobs, which are glorified bash scripts, control the logic of the resource allocation dynamically from the application software.
It will solve the problem of load balancing based on your actual compute needs, but what happens if a program requests resources in a loop? At least the current schedulers know in advance the requested resources and can stall huge allocations