|
|
|
|
|
by FattiMei
23 days ago
|
|
> I think the only way to really do this well is to make HPC jobs a market system where resources cost money to the users. When money is involved, people are incentivized to optimize their workloads. But that’s rarely the case for large HPC clusters and I’d personally hate it if I had to deal with a HPC processing budget. I also got that, there aren't many incentives in HPC. Users want results and don't have the time or skill to invest in rewriting the software stack to be as efficient as possible. An intuition that I have is that this problem could be solved at the application/framework layer. Instead of launching jobs, which are glorified bash scripts, control the logic of the resource allocation dynamically from the application software. It will solve the problem of load balancing based on your actual compute needs, but what happens if a program requests resources in a loop? At least the current schedulers know in advance the requested resources and can stall huge allocations |
|
For the researcher, it can take a lot of extra time and effort (and skill) that they might not have. A unoptimized job that takes four days to run is still faster than taking a week to optimize the code to run in 1 day.
For the researcher, the main limit is time. In many places the cost of the HPC hardware isn’t passed onto them, so their main pressure is time. And running code is generally faster than optimizing code.
(Unless you’re running a week long analysis thousands of times)
Thinking of this as an allocation program for the application to manage is an interesting approach. But the program will need to be able to model their resource requirements from start to end, and know about how long each step will take. This sounds like a variant of the halting problem, but instead of predicting when a program will end, it’s predicting when it will need more resources.