Hacker News new | ask | show | jobs
by thwarted 3757 days ago
Poettering says that PID 1 has special requirements. One of these is killing "zombie" processes that have been abandoned by their calling session. This is a real problem for Docker since the application runs as PID 1 and does not handle the zombie processes. For example, containers running the Oracle database can end up with thousands of zombie processes.

Why does Poettering keep claiming this when he's the one who submitted the patch that adds the PR_SET_CHILD_SUBREAPER prctl(2) [0] functionality?

[0] http://man7.org/linux/man-pages/man2/prctl.2.html

2 comments

That doesn't have anything to do with Poettering's quote.

PR_SET_CHILD_SUBREAPER moves the ownership of an orphaned process to whichever process was selected rather than the default PID1, and that only works for descendant of the subreaper.

The problem pointed by the quote is that normal software doesn't go around checking if it has zombie children and waiting on them, so in a container with random software S set as PID1 and creating subprocesses, zombies may accumulate until resources are exhausted[0].

PR_SET_CHILD_SUBREAPER is a way to cause that problem on a system with a proper init (or to test that your init works properly without needing to boot into it)

It's not a new observation: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zomb...

Previous HN discussion: https://news.ycombinator.com/item?id=8916785

[0] by default the limit is 32k processes after which the kernel will simply refuse to create new ones

Yes it does. He's claiming that systemd should manage the container processes as pid1, because systemd will then clean up the zombies. But anything that reaps zombies can be pid1 -- systemd isn't special in this regard. And even if you did use something that didn't reap zombies as pid1, you could leverage PR_SET_CHILD_SUBREAPER as some other non-pid1 process to grab zombies for descendants it spawns.

If you do use PR_SET_CHILD_SUBREAPER, then you need to reap whatever gets reparented to you; if you don't do this then the process table will eventually fill up with zombies. He is correct that few programs do that, but there's nothing that requires that to be done by pid1 if all the processes within the container are spawned by something that provides that functionality and uses PR_SET_CHILD_SUBREAPER.

> Yes it does.

No, it still doesn't, sorry.

> He's claiming that systemd should manage the container processes as pid1, because systemd will then clean up the zombies.

The part that'a quoted only notes that PID1 is responsible for reaping orphaned zombies, that Random P. Application Process most likely doesn't do that, and that it causes problems.

> But anything that reaps zombies can be pid1 -- systemd isn't special in this regard.

The part you've quoted doesn't try to claim otherwise.

> And even if you did use something that didn't reap zombies as pid1, you could leverage PR_SET_CHILD_SUBREAPER as some other non-pid1 process to grab zombies for descendants it spawns.

That's a completely inane claim, the whole point of the article is the issue of people starting their application process as PID1, what are you suggesting, that applications should be modified to spawn an init which would use PR_SET_CHILD_SUBREAPER to which it would delegate spawning subprocesses? That's utter lunacy. Have some decency and regard for basic sanity and the context in which the quote appears.

> If you do use PR_SET_CHILD_SUBREAPER, then you need to reap whatever gets reparented to you; if you don't do this then the process table will eventually fill up with zombies. He is correct that few programs do that, but there's nothing that requires that to be done by pid1 if all the processes within the container are spawned by something that provides that functionality and uses PR_SET_CHILD_SUBREAPER.

Are you just making that hare-brained bullshit on the spot so that you don't have to admit your original comment was wrong?

What's the point of spawning a broken PID1 just so you can spawn a process using PR_SET_CHILD_SUBREAPER and doing the actual reaping correctly? Just spawn that as PID1 in the first place FFS.

This is true, but what if the thing that spins up the actual container process sets this?
What do you mean "the thing that spins up the actual container"? The root process for the container? It's already PID1. The external process creating the container? It's sitting outside the container and "below" PID1, what could that do that'd be of any use?
I guess he's saying, that you can't just take any random binary and run it in a Docker container, because if that binary spawns a lot of children but does not wait for them, then you'll have a lot of zombies.

Docker could run a minimal pid1 in each container to address this. Though if this had been a big issue I guess this would have been already fixed.

Naturally, a proof of concept of the problem would be great. (Let's say a Dockerfile.)

It has been a reasonably big issue. E.g. I kept seeing zombies with Consul for a while until we realised that every single Consul Docker container on Dockerhub just had Consul run as pid 1 in the container (this is a while ago, no idea if that's still the case), without realising that Consul health checks then could end up as zombies if you weren't very careful about how you wrote them (e.g. typical example: Spawning curl from a shell script, with a timeout on the health check that was shorter than any timeouts on the curl requests).

It's usually fairly simple to fix (e.g. for Consul above, I raised it with the Consul guys and they said they'd look at adding waiting on children to it as a precaution - it's just a couple of lines -, but people building containers could also introduce a minimal init, or you can write your health checks to guard against it), but it happens all over the place, and people are often unaware and so not on the lookout for it and it may not be immediately obvious.

The reason I raised it as an issue for Consul, for example, even though it wasn't really their fault, but an issue with the containers, is that people need to be aware of the problem when packaging the containers, need to be aware that a given application may spawn children, and that they may not wait for them. Even a lot of people aware of the zombie issue end up packaging software that they didn't realise where spawning child processes that could end up as zombies (in this case, it took running it in a container without a proper pid 1, using health checks which not everyone will do, and writing the health checks in a particular way in order to notice the effects).

Thankfully there are a number of tiny little inits. E.g. there's suckless sinit [1], Tini[2] , and here's a tiny little proof of concept Go init [3] I wrote (though frankly, suckless or Tini compiled with musl will give you a much smaller binary) as what little you actually need to do is very trivial.

[1] http://git.suckless.org/sinit

[2] https://github.com/krallin/tini

[3] https://gist.github.com/vidarh/91a110792c86d6c3bb41

Seeing how even the trivial pid1 "scripts" solve the problem, it's truly baffling why Docker doesn't have a --with-reaper flag.

Also thanks for the Consul example, makes it much-much easier to see the issue and argue for a general solution. (So not every random app/project/service/daemon has to implement pid1 functionality.)

> Seeing how even the trivial pid1 "scripts" solve the problem, it's truly baffling why Docker doesn't have a --with-reaper flag.

That doesn't fix the issue since you need to know about the issue and accept that it exists, at that point you can just as easily use one of the micro-inits available.

The alternative is to enable it by default, but now you've broken BC for the weirdo who actually expects orphan processes to be adopted by the root process they're starting.

Yes, the problem is that we would need to change the default behavior of Docker, which many people and scripts expect to be stable. It's a case of interface stability vs. functionality improvement. So far interface stability has won. I personnally think it would be better to change the default, but anything that breaks an interface, even a subtle implicit one, has the burden of arguing a solution, thinking through migration issues, submitting patches... So far I have seen a lot of drive-by criticisms and dismissal of the need to even discuss the tradeoff (see for example this lovely fellow: https://lwn.net/Articles/677419/). But I have not seen anyone stepping up to do the work.

We all pick our battles - including me of course!

I'm not worried about that. When operations people have problems they are rather quick to search and try solutions. But baking it into the Dockerfile is much more portable and automatic (from the operations point of view).
Also see https://github.com/Yelp/dumb-init that is 20K statically built executable perfect for resource constrained containers that have to deal with reaping of arbitrary children.
Just to clarify: even with a proper init, if a process spawns children and doesn't wait on them, you still have zombies until the parent either dies (allowing init to inherit the zombie, at which point it waits on it), or the parent waits. This is the reason behind the double-fork trick.
See also the article linked earlier in the comments: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zomb...
Supervisord is the officially blessed solution:

https://docs.docker.com/engine/admin/using_supervisord/

supervisord specifically documents that it's not an init and shouldn't be used as an init, that's the second paragraph of its home page: http://supervisord.org/index.html?highlight=init
That is literally not at all what that is suggesting.