Hacker News new | ask | show | jobs
by lazyguy 2514 days ago
Yep and it's about as good as just picking a random process and killing it.

It's awesome when you run out of memory and you try to log in only to have it kill sshd.

3 comments

A classic from [1]:

> An aircraft company discovered that it was cheaper to fly its planes with less fuel on board. The planes would be lighter and use less fuel and money was saved. On rare occasions however the amount of fuel was insufficient, and the plane would crash. This problem was solved by the engineers of the company by the development of a special OOF (out-of-fuel) mechanism. In emergency cases a passenger was selected and thrown out of the plane. (When necessary, the procedure was repeated.) A large body of theory was developed and many publications were devoted to the problem of properly selecting the victim to be ejected. Should the victim be chosen at random? Or should one choose the heaviest person? Or the oldest? Should passengers pay in order not to be ejected, so that the victim would be the poorest on board? And if for example the heaviest person was chosen, should there be a special exception in case that was the pilot? Should first class passengers be exempted? Now that the OOF mechanism existed, it would be activated every now and then, and eject passengers even when there was no fuel shortage. The engineers are still studying precisely how this malfunction is caused.

[1] https://lwn.net/Articles/104185/

Fortunately, engineers have invented a way to attach a Strolling Wheelbarrow After Plane, where you can stash the sleeping passengers without ejecting them out of the plane entirely. This has the unpleasant side effect of slowing down the journey for everyone when passengers wake up inorderly (and God forbid everyone wake up at the same time), though.
What I still do not understand is why people continue to turn a blind eye to this instead of switching to SmartOS. I just don't get it.
How does Solaris/SmartOS handles that situation?
It doesn't get in that situation, because malloc() can return null on Solaris (i.e. it never¹ overcommits).

While in general I think this is vastly better than the somewhat insane Linux OOM killer, you can get in awkward situations where you can't start any processes (including a root shell) because you're out of memory.

I rather like the FreeBSD solution to this, which is to not overcommit, but after a certain number of allocation failures it kills the process using the most memory. This prevents situations where you can't start any processes.

There's no one-size-fits-all solution to handling low memory conditions, but the Linux solution manages to almost never do what you want which is kind of impressive in a way.

¹ I seem to recall hearing somewhere that you can allow allocations to overcommit on a per-application basis on later versions of Solaris, but don't quote me on this.

> FreeBSD solution to this, which is to not overcommit

Where did this myth come from? Did y'all just assume that the vm.overcommit sysctl actually makes sense and zero means "no overcommit"? :)

https://news.ycombinator.com/item?id=20623919

But indeed, OOM killer kills the largest process, which makes more sense in most scenarios than Linux's "badness" scoring.

Huh, I had no idea it worked like that. That's bizarre.
Running sshd as an on-demand (Type=socket) service would probably work better, since then the sshd process would be new and thus have a better heuristic score - also not be tying up memory sitting unused in the meantime.

systemd still seems to run it (Type=notify) with the -D option all the time though, at least on the systems I can check.

Dropbear is configured by default as a Type=socket service though.

This is sort of just kicking the problem down the road. Your idea actually might work for (presumably) low-volume use ssh, but what about the next important service? When does the work-around to a papered-over work-around to a virtual problem that is supposed to just be RAM-backed or handled at

  ptr = malloc(42);
  if(!ptr) exit_error();
end?
Well, there probably needs to be a way to override the heuristic at least, sort of a 'this process is important, don't auto-kill it if trying to find memory'.

As for ssh specifically, I rarely ssh into my desktop machine, but I keep sshd running for just this kind of situation where I might need to try and rescue a swamped machine. So in most cases low-volume sshd use is exactly what is called for.

If you're running into the memory purge of doom on a server that's probably a whole different nightmare scenario.

malloc returning NULL has been a broken assumption for a long time though, and that isn't going to change afaik.

I blame the specific algorithm for that, not the basic concept. Nothing with less than 10MB of memory use should ever get killed unless you're in some kind of fork bomb.