|
|
|
|
|
by mattpallissard
1242 days ago
|
|
Went to implement memory/CPU limits in the HPC batch scheduler we used. It turns out it had a fatal flaw, it counted cached memory as used. So servers would slowly become idle as cache was filled. I piggy backed some code on the job validation interface to attach the users requests to the job's environment variables. Then I wrote a daemon to run on the compute nodes that walked the jobs process group/tree, grabbed the environment variables[1], and managed cgroups. Super quick and dirty but it worked well enough that we kept it in place for years despite the fact the bug had been long addressed. [1]: it was was safe to use environment variables as they were stored RO on disk by the batch system itself. |
|