|
|
|
|
|
by Tuna-Fish
181 days ago
|
|
The fundamental problem is that your machine is running software from a thousand different projects or libraries just to provide the basic system, and most of them do not handle allocation failure gracefully. If program A allocates too much memory and overcommit is off, that doesn't necessarily mean that A gets an allocation failure. It might also mean that code in library B in background process C gets the failure, and fails in a way that puts the system in a state that's not easily recoverable, and is possibly very different every time it happens. For cleanly surfacing errors, overcommit=2 is a bad choice. For most servers, it's much better to leave overcommit on, but make the OOM killer always target your primary service/container, using oom-score-adj, and/or memory.oom.group to take out the whole cgroup. This way, you get to cleanly combine your OOM condition handling with the general failure case and can restart everything from a known foundation, instead of trying to soldier on while possibly lacking some piece of support infrastructure that is necessary but usually invisible. |
|