Hacker News new | ask | show | jobs
by thayne 5 days ago
With large enough processes, like say a server JVM process that uses 10s of GBs of RAM, even just copying the page tables for CoW can be slow. And unless you have aggressive overcommit settings you can get an OOM on fork, even if you're just going to exec something small.

vfork helps a little, but it has a lot of restrictions on what you can do before the exec, and on unix that's basically the only place you can do things like close files, change signal masks, drop privileges or set up seccomp, etc.

1 comments

vfork() helps a LOT. The restrictions on what you can do on the child-side of vfork() are pretty much the same ones as for fork() + you must not do anything to damage the stack frame of the vfork() caller (i.e., you can't return).
> the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions

That's a lot more restrictive. You can't use local variables, or call any functions other than _exit or execve. On linux specifically, I _think_ those restrictions are more relaxed and you can call async-signal-safe functions, however I'm not entirely clear on how relaxed that is, and as far as I understand that isn't portable.

But some of that is nonsense and incorrect. You can very much use local variables, and you'll find tons of vfork()-using code that does that and calls plenty of async-signal-safe functions.

The real restrictions are:

  - you can't damage the function call frame
    of the caller of vfork(), thus you can't
    return from it

  - you may only call async-signal-safe
    functions on the child side of vfork()
That's basically it. Yes, you'll want to call execve(2) or _exit(2) before long, but there is no time limit as to that, it's just that the whole point of calling vfork() is to make it real cheap to spawn a process, which means ultimately calling execve(2), with _exit(2) being what you do if it execve(2) fails (e.g., because ENOENT).

There is a ton of vfork()-using code that adheres to these real restrictions and has been working fine for decades. That includes several posix_spawn() implementations, the C shell, etc.

I demand evidence that this part: "the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork()" is remotely true. That evidence must be of the form of bug reports that were accepted and which stand to scrutiny.

I've never found any such evidence. Have you?

Meanwhile I have a proof by existence that vfork() is safe used much more liberally than you say it may be used.

> You can't use local variables, or call any functions other than _exit or execve.

There are other async-signal-safe functions, and they get used routinely by posix_spawn() and other code to do child-side setup before execve(2), including: I/O redirection, process group setup, signal handling changes, etc.