Hacker News new | ask | show | jobs
by wickberg 644 days ago
On a node-level, usually these are aiming for around 90-95% allocated. Note that, compared to most "cloud" applications, that usually involves a number of tricks at the system scheduling level to achieve.

At some point, in order to concurrently allocate a 1000-node job, all 1000 nodes will need to be briefly unoccupied ahead of that, and that can introduce some unavoidable gaps in system usage. Tuning in the "backfill" scheduling part of the workload manager can help reduce that, and a healthy mix of smaller single-node short-duration work alongside bigger multi-day multi-thousand-node jobs helps keep the machine busy.