Another hidden issue is that as a container gets close to running out of memory, it furiously drops read only pages from memory, only to need to read some of them back into memory moments later.
This pathological swapping behavior can impact other workloads on the system.
cgroups2 has better protections against this behavior.
How can you use cgroups2 for this? I know that BSD resource limits are basically useless for this as they only allow to limit virtual memory, not RSS use.
EarlyOOM [1] is a configurable daemon that kills processes early enough to (hopefully) prevent thrashing. I'm using it on my Linux desktops (it has proven to catch my own programs' runaway memory usage before it risks locking up the development machine), but it may also be useful on servers. It logs to syslog but also can be configured to run a program on kill events.
(Why would a user space OOM killer be necessary if the kernel has better information about the state of the world? I don't know the details, but my interpretation is that because people disliked OOM killing, the kernel devs made the kernel OOM killer trigger so late that it is largely useless. If that's true and thus a social problem, maybe it needs to be solved on that level, too.)
BTW in my experience, Linux 2.2 used to handle out of memory situations much more gracefully than any later kernel version.
I think the issue is that your nodes have swaps. Why will you have swap on container nodes? IMO, the idea with container management is to get predictability with resources. If you have 8gb on a node, you know that the containers get 8gb. You might not be able to tell exactly how based on how it's configured, but you know once they collectively use 8gb, that's it. Swap is going to mess up things really bad in ways you can't even predict.
Even without swap enabled or any explicit memory mapping, read only pages from executables (code, read-only data) are mapped into the process’ address space and may be evicted. Unless you explicitly lock those into RAM they still behave somewhat like swapped memory does, except the pages don’t need to be written back.
would it have been sufficient to alert on high memory usage? It might be reasonable to set an alert on say 70% rss. As long as the pod does not pass this threshold and die before a metric can be sampled.
that "no such file or directory" looks to be coming from building a dynamic executable on debian and trying to run it on alpine.
as for the first question - that wouldn't be enough. AFAIK mmap-ed pages are part of RSS and it's quite usual for them to use up everything up to the memory limit (databases kind of rely on this 'feature'). None of that would provoke an OOMKill.
for the second comment - I've used images the author has published on Docker hub. Maybe there would've been a way to make it work, but if you take a look at the amount of code in missing-container-metrics, you will realise that I've used less time to write that than I would've spent debugging someone else's Docker build and golang code that is not really maintained.
I mean that's fine if you're ok with 30% wasted memory... We just recently had to tune some JVM and monitoring settings because we do the initial and max heap allocation to around 90% memory. There's very little else going on.
If a sub-process gets OOMKilled and the container doesn't die, then it's most likely that the parent process didn't handle that scenario. In which case the health-check wouldn't cover that issue.
It's not clear to me what happens after the OOM. Does the init process restarts the daemon? I would argue that it shouldn't.
If the process stops responding to a healthcheck, it's the scheduler's responsibility (k8s in this case) to handle it. Crashes in this case should be handled in a similar way, whether it's due to OOM or a bug.
The hate on K8s is misplaced in this case. You should redirect it to containers. Other container orchestration mechanisms will encounter similar, if not the same, issues.
Running out of memory vs wasting memory is also an issue on serverless although somewhat less extreme. Part of the issue is no one knows how much memory their code might use, and we don’t have frameworks that adapt to available memory much. Resource constrained computing is hard.
k8s unfortunately is the only way to maintain sanity if you want to maintain a multi-cloud environment. I don't relish the idea of duplicating functionality but maintaining code across aws lambda's and azure functions.
While I acknowledge you probably have solved for your use-case... I can't help but hardcore LOL at your somewhat terse perspective! Dude... K8S just got mature!
This pathological swapping behavior can impact other workloads on the system.
cgroups2 has better protections against this behavior.