| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by FattiMei 19 days ago

HPC student here, still learning. If I understood your problem statement, users of the cluster reserve resources far greater than what was needed for their computation. They fear that if the allocated resources are not enough, then their program will crash and lose partial results.

Can you give an example of typical execution on the cluster? Is it a problem of number of hours allocated or number of compute cores?

If I'm running a PDE simulation, and I allocate n machines I want to use all of them, so there is no risk of idle machines. It's not trivial to estimate a priori the amount of time required for my simulation to complete, so I overestimate. But when the simulation is complete (even before the deadline), the resources get freed and can be used right away for another job

Maybe the problem is when many users are greedy. Also MPI simulations are difficult (if not impossible, correct me) to change dynamically: when a simulation is started with that number of ranks, I can't add new ranks at will if the resources are available

Thank you for the patience for everyone that answers

1 comments

mbreese 18 days ago

Many HPC jobs aren’t simulations that are CPU bound. In fact, most of the jobs on the clusters I’ve used have been single-node jobs (so technically HTC, but that term is rarely used).

I do genomics work and my jobs tend to be bursty. They may use a lot of CPU initially, but the second half of the job is writing results. This takes only one core, but still the max amount of memory. Or, I can have jobs that are CPU light, but need the max amount of memory for only a fraction of their wall-time.

Here is an example for you. Let’s say I’m processing a genome sequencing experiment. This requires about 8 different steps between preprocessing the data, alignment, post filtering, QC stat collection, etc. These are large input files, so my jobs end up being IO bound. If I were reading and writing at each step, it would add days of time to the pipeline. Instead, what we do is read the data once, and pipe the data from program to program. But each program has different CPU and memory requirements. We need to reserve the $MAX requirements for each. As the data moves through the pipeline though, we eventually end up with max utilization for only a portion of the walltime. If I optimize for efficient walltime, I leave CPUs and memory idle for a large portion of the job.

People also tend to like to manage fewer jobs. So instead of splitting a job into multiple dependent submissions that are tailored to each program, people will write a bash script that runs, but is not efficient.

Many times, these patterns are difficult to predict and you can’t submit a job that says I need 20 cores for an hour, but only 2 for the last two hours. It’s difficult to balance utilization vs wall-time. No one likes waiting, but there is usually little incentive to have high utilization rates. And sometimes the balance is total walltime. Sometimes it’s execution complexity.

This is the problem this group is trying to solve - dynamically adapting the scheduler to know when a job isn’t going to use its full allocation of resources. I’m not sure there is a good way to do it. HPC users are concerned only with getting their jobs done fast. HPC admins want to see resources used efficiently. This is a classic pipelining problem: do you optimize for individual task time or overall system throughput?

I think the only way to really do this well is to make HPC jobs a market system where resources cost money to the users. When money is involved, people are incentivized to optimize their workloads. But that’s rarely the case for large HPC clusters and I’d personally hate it if I had to deal with a HPC processing budget.

In lieu of this, a common way to handle this lack of efficiency is to do “fair share” scheduling. This means that a users prior work load is taken into account when prioritizing their queue position. So, if I did a lot of work last week, jobs for a user that didn’t run jobs last week would get a priority boost over me. This doesn’t address the utilization efficiency directly, but it does make access to the cluster seem more “fair”.

link

FattiMei 18 days ago

> I think the only way to really do this well is to make HPC jobs a market system where resources cost money to the users. When money is involved, people are incentivized to optimize their workloads. But that’s rarely the case for large HPC clusters and I’d personally hate it if I had to deal with a HPC processing budget.

I also got that, there aren't many incentives in HPC. Users want results and don't have the time or skill to invest in rewriting the software stack to be as efficient as possible.

An intuition that I have is that this problem could be solved at the application/framework layer. Instead of launching jobs, which are glorified bash scripts, control the logic of the resource allocation dynamically from the application software.

It will solve the problem of load balancing based on your actual compute needs, but what happens if a program requests resources in a loop? At least the current schedulers know in advance the requested resources and can stall huge allocations

link

mbreese 15 days ago

I think it’s a question of cost/benefit.

For the researcher, it can take a lot of extra time and effort (and skill) that they might not have. A unoptimized job that takes four days to run is still faster than taking a week to optimize the code to run in 1 day.

For the researcher, the main limit is time. In many places the cost of the HPC hardware isn’t passed onto them, so their main pressure is time. And running code is generally faster than optimizing code.

(Unless you’re running a week long analysis thousands of times)

Thinking of this as an allocation program for the application to manage is an interesting approach. But the program will need to be able to model their resource requirements from start to end, and know about how long each step will take. This sounds like a variant of the halting problem, but instead of predicting when a program will end, it’s predicting when it will need more resources.

link