I can't wait for swap support in k8s (at the pod level).
I've got a bunch of burst-y workloads that not easy to predict, and when they're running at their peak, they're doing important stuff that I'd rather not be terminated. Over-provisioning is one way to handle it, but then I risk OOM-ing the entire node. Throwing more memory at it is another solution but then we're paying a ton of money to let memory sit around unused.
This article is a little confusing, so I just want to clarify something for the audience. It makes it sound like OOM killing is asynchronous, but it is not. The OOM killer kicks in as soon as you try to realize more memory than your cgroup's limit. The kernel will first attempt to reclaim memory and if that fails it will kill something. There isn't some grace period during which your cgroup can skate along over its limit.
Another area to consider is kernel memory accounting in the cgroup. So Kernel memory for sockets and the like, can get counted for in the cgroup / kubernetes pod. So this is another area where you shouldn't give 100% of the memory to the application if it needs to communicate or is busy on the network.
It's also possible to boot with kmem accounting disabled, and I recommend it. Yes, it makes the accounting approximate, but kmem accounting is fundamentally unfair. Random cgroups get victimized by owning random slabs, and kernel reclaim is a mess of bugs.
True, it is harder when you need to maximize resource utilization. The k8s scheduler did want we requested, but seastar and memory allocation in Redpanda show us (OOM) that POD sandbox has some overhead.
I've got a bunch of burst-y workloads that not easy to predict, and when they're running at their peak, they're doing important stuff that I'd rather not be terminated. Over-provisioning is one way to handle it, but then I risk OOM-ing the entire node. Throwing more memory at it is another solution but then we're paying a ton of money to let memory sit around unused.