|
|
|
|
|
by FattiMei
19 days ago
|
|
HPC student here, still learning. If I understood your problem statement, users of the cluster reserve resources far greater than what was needed for their computation. They fear that if the allocated resources are not enough, then their program will crash and lose partial results. Can you give an example of typical execution on the cluster? Is it a problem of number of hours allocated or number of compute cores? If I'm running a PDE simulation, and I allocate n machines I want to use all of them, so there is no risk of idle machines. It's not trivial to estimate a priori the amount of time required for my simulation to complete, so I overestimate. But when the simulation is complete (even before the deadline), the resources get freed and can be used right away for another job Maybe the problem is when many users are greedy. Also MPI simulations are difficult (if not impossible, correct me) to change dynamically: when a simulation is started with that number of ranks, I can't add new ranks at will if the resources are available Thank you for the patience for everyone that answers |
|
I do genomics work and my jobs tend to be bursty. They may use a lot of CPU initially, but the second half of the job is writing results. This takes only one core, but still the max amount of memory. Or, I can have jobs that are CPU light, but need the max amount of memory for only a fraction of their wall-time.
Here is an example for you. Let’s say I’m processing a genome sequencing experiment. This requires about 8 different steps between preprocessing the data, alignment, post filtering, QC stat collection, etc. These are large input files, so my jobs end up being IO bound. If I were reading and writing at each step, it would add days of time to the pipeline. Instead, what we do is read the data once, and pipe the data from program to program. But each program has different CPU and memory requirements. We need to reserve the $MAX requirements for each. As the data moves through the pipeline though, we eventually end up with max utilization for only a portion of the walltime. If I optimize for efficient walltime, I leave CPUs and memory idle for a large portion of the job.
People also tend to like to manage fewer jobs. So instead of splitting a job into multiple dependent submissions that are tailored to each program, people will write a bash script that runs, but is not efficient.
Many times, these patterns are difficult to predict and you can’t submit a job that says I need 20 cores for an hour, but only 2 for the last two hours. It’s difficult to balance utilization vs wall-time. No one likes waiting, but there is usually little incentive to have high utilization rates. And sometimes the balance is total walltime. Sometimes it’s execution complexity.
This is the problem this group is trying to solve - dynamically adapting the scheduler to know when a job isn’t going to use its full allocation of resources. I’m not sure there is a good way to do it. HPC users are concerned only with getting their jobs done fast. HPC admins want to see resources used efficiently. This is a classic pipelining problem: do you optimize for individual task time or overall system throughput?
I think the only way to really do this well is to make HPC jobs a market system where resources cost money to the users. When money is involved, people are incentivized to optimize their workloads. But that’s rarely the case for large HPC clusters and I’d personally hate it if I had to deal with a HPC processing budget.
In lieu of this, a common way to handle this lack of efficiency is to do “fair share” scheduling. This means that a users prior work load is taken into account when prioritizing their queue position. So, if I did a lot of work last week, jobs for a user that didn’t run jobs last week would get a priority boost over me. This doesn’t address the utilization efficiency directly, but it does make access to the cluster seem more “fair”.