Hacker News new | ask | show | jobs
by mattpallissard 1242 days ago
Went to implement memory/CPU limits in the HPC batch scheduler we used. It turns out it had a fatal flaw, it counted cached memory as used. So servers would slowly become idle as cache was filled.

I piggy backed some code on the job validation interface to attach the users requests to the job's environment variables. Then I wrote a daemon to run on the compute nodes that walked the jobs process group/tree, grabbed the environment variables[1], and managed cgroups.

Super quick and dirty but it worked well enough that we kept it in place for years despite the fact the bug had been long addressed.

[1]: it was was safe to use environment variables as they were stored RO on disk by the batch system itself.

1 comments

What schedular? Super interesting hack
Univa Grid Engine, formerly Sun Grid Engine.