Hacker News new | ask | show | jobs
by ec109685 2038 days ago
Another hidden issue is that as a container gets close to running out of memory, it furiously drops read only pages from memory, only to need to read some of them back into memory moments later.

This pathological swapping behavior can impact other workloads on the system.

cgroups2 has better protections against this behavior.

3 comments

How can you use cgroups2 for this? I know that BSD resource limits are basically useless for this as they only allow to limit virtual memory, not RSS use.

EarlyOOM [1] is a configurable daemon that kills processes early enough to (hopefully) prevent thrashing. I'm using it on my Linux desktops (it has proven to catch my own programs' runaway memory usage before it risks locking up the development machine), but it may also be useful on servers. It logs to syslog but also can be configured to run a program on kill events.

[1] https://github.com/rfjakob/earlyoom, https://launchpad.net/ubuntu/+source/earlyoom, https://packages.debian.org/search?keywords=earlyoom

(Why would a user space OOM killer be necessary if the kernel has better information about the state of the world? I don't know the details, but my interpretation is that because people disliked OOM killing, the kernel devs made the kernel OOM killer trigger so late that it is largely useless. If that's true and thus a social problem, maybe it needs to be solved on that level, too.)

BTW in my experience, Linux 2.2 used to handle out of memory situations much more gracefully than any later kernel version.

memory.min will ensure it doesn't try to reclaim memory once it's a lost cause: https://lwn.net/Articles/752423/
Is there an issue on cgroups2 adoption for Kubernetes somewhere?
Thank you!
I think the issue is that your nodes have swaps. Why will you have swap on container nodes? IMO, the idea with container management is to get predictability with resources. If you have 8gb on a node, you know that the containers get 8gb. You might not be able to tell exactly how based on how it's configured, but you know once they collectively use 8gb, that's it. Swap is going to mess up things really bad in ways you can't even predict.
Even without swap enabled or any explicit memory mapping, read only pages from executables (code, read-only data) are mapped into the process’ address space and may be evicted. Unless you explicitly lock those into RAM they still behave somewhat like swapped memory does, except the pages don’t need to be written back.