Hacker News new | ask | show | jobs
by rom1v 7 days ago
Related to the discussion: "A fork() in the road": https://www.microsoft.com/en-us/research/wp-content/uploads/...

> ABSTRACT

> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.

> As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.

7 comments

> The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design.

No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program. The original implementation worked by swapping out the forking program to disk on a fork() call. Then, at the moment the program was swapped out but control had not returned, the process table entry was duplicated and adjusted so that there were now two processes, one in memory and one swapped out. The one in memory then got control, and could do an exec() call.

This allowed large programs to run on small PDP-11 machines. It was needed back in the era of really expensive memory. That's why.

QNX had an interesting approach. Program loading isn't in the OS at all. There's "fork", but program loading is in a library. It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.

This comment starts with a no, but agrees with the parent...
I think fork() is more of a PDP-7 mistake than a PDP-11 mistake. On the original UNIX system, memory was so limited that the only sane partitioning was to write the running program's memory image to disk, then reuse the running image as the child. An immediate consequence is the UNIX I/O model, where disk I/O is always synchronous (can't swap processes while waiting for disk I/O because swapping processes requires disk I/O). Anyway, as soon as the UNIX group got a PDP-11, the model broke down, because they had enough memory for multiple processes, but fork() didn't allow them to run concurrently, because their first PDP-11 didn't have an MMU. So they whined until they got one with an MMU instead of fixing their broken design.
> It was needed back in the era of really expensive memory.

Well, it seems we are back in an era with really expensive memory.

That’s funny, but most new cpus today have more L3 cache than those computers had memory and disk space combined.
The QNX approach is also pretty much how the dynamic linker loads shared libraries today in Linux .

“An era of really expensive memory”. That sounds familiar…

I think GP was saying that in QNX the spawning process was responsible for dynamically linking it's child process before running it. With Linux, I think it's the spawned process taking care of it's own dynamic linking.
On QNX the process spawning is done by sending a message to the userspace process manager, which creates a new process table entry and queues up its initial thread. When its initial thread gets a timeslice its entry point may be the dynamic loader (as specified in the PT_INTERP segment) which then does all the dynamic linking as the spawned process or it might be some other entry point like with a statically-linked executable.

So on QNX, the spawned process does all the dynamic linking. The spawning process just sends an asynchronous message to the process manager and then gets on with things in a very deterministic manner as befitting a hard realtime system.

It is almost as if you agree with the authors ..

"In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability"

(But thanks for the good explanation)

> It links to a .so file which reads the executable header, allocates memory, loads the program, gets it ready to run, and starts it. The program loader runs in user space and is unprivileged. This is probably the right way to do it.

aiui this is what exec does, the problem outlined here is the split between process creation (expensive, kernel space, has to be done each time even if spawning the same process "template" repeatedly) and loading (cheap and in userspace).

> > The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design.

> No, it was done that way so that you could launch a program that was too big to fit in memory with the parent program.

Ironically vfork() is even better in this regard. I wish Unix had only ever had vfork().

Don’t pretty much all OSes implement process startup in userspace? On macOS, the kernel creates a process with an image of dyld and points it at dyld_start, which actually takes care of parsing the Mach-O header. I assumed ld.so does the same job on Linux.
Nope, the kernel can load static ELF binaries. ld.so is only needed for dynamically linked binaries, and in fact many Go applications (for example, as they're statically linked) ship as containers with nothing but the single binary.
You can do this on macOS too, if you're willing to break all forward/backward compatibility and make direct syscalls you can have a purely static binary. Without the LC_LOAD_DYLINKER command on the mach-o binary the kernel should just jump to the entrypoint based on LC_UNIXTHREAD. (This may not longer work on arm machines though if they actually trap on direct syscalls not through libSystem, similar to the BSDs)
Thanks. I completely forgot about static binaries.
Of course ld-linux itself is an ELF binary. The kernel loads it.
Not only is it an ELF binary, but it is ironically a static ELF binary.
Yes, it can all be done in userspace. When the "fork in the road" paper came up a while back someone linked to an example. https://grugq.github.io/docs/ul_exec.txt
But why is having a pair of separate independent operations, fork and exec, required to achieve this? A single fexec call could be implemented to work in the way you describe, no?
Fork isn't necessary for this, you could just exec directly?
Cygwin's fork() is similar to what you describe for QNX.
It's a fairly widespread idea for architectures that try to move things out of kernel mode. The Hurd does program image file loading in userspace, too, in its exec server(s).

The tricky part is setting up the initial process. The way out for that is static linking and re-use of the fact that the operating system kernel loader has to understand and be able to load (at least a small subset of) program image file formats too.

It is somewhat interesting that the most widely used "big" OS that doesn't use fork, i.e. Windows, has dog slow process creation...

I agree that there should be non-fork primitives, I'm just not that sure that performance is the best argument.

The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1, and you need overcommit.

Now these decisions aren't objectively bad, but they have significant trade-offs and it's probably not a good idea that they're forced simply because we use fork()+exec() for process creation.

CoW is probably a good idea whether you use fork or not. Or rather, fork is probably a better option than just exec exactly because it can benefit from CoW.

At least on systems with virtual addressing. If you want to go into physical addressing, then yes, maybe it's a problem. But Linux will never touch anything with physical addressing, so I don't see what people are complaining about.

CoW is probably a good idea regardless, yeah. Overcommit is more questionable. Regardless, both ought to be argued based on their own merits. It's unfortunate that both are necessary as a consequence of fork().
I don't think fork() mandates overcommit. OpenBSD doesn't seem to even allow overcommit or have an OOM killer, memory allocations that exceed available capacity fail immediately even if the memory is not touched.
Let's say you have 1GB RAM. You're running program that occupies 600 MB. Now this program wants to launch second small program that occupies 1 MB.

You're doing fork + exec.

If you're overcommiting, fork will not reserve another 600 MB, and exec immediately after fork will cause total system usage to be 601 MB.

If you're not overcommiting, that fork will fail, because total memory consumption will be 1200 MB which is more than 1GB. That somewhat restricts program design.

> The problem with fork isn't really that it's slow. The problem is that if you want it to be not-slow, it locks you into a bunch of OS design decisions: you more or less need a memory subsystem where all writable pages are refcounted and copy-on-write when the refcount is bigger than 1

It may not be slow, but for the common case where fork is almost immediately followed by exec in the process where fork returns zero fork increases those refcounts and exec almost immediately decreases them again hand does typically unnecessary checks whether refcounts became zero). A combined fork/exec syscall can avoid that work.

On the other hand, a sufficiently powerful combined fork/exec call has to have a lot of parameters that it has to check (whether to inherit open pipes, open files, setting the working directory, etc), and that slows it down.

That can be avoided by having multiple variants of combined fork/exec calls, but you would need lots of them to cover all combinations of flags.

I expect either approach should be faster then having fork, then exec as separate calls, especially when the process calling fork has many resources allocated.

Another possible design is instead of forking the current process, you create a new empty process, then the parent calls syscalls to set up the new process, and eventually call exec on the child process. That does mean you either need new syscalls for that, or adapt existing syscalls to take a pidfd as an argument. That also solves some other problems with fork/exec where the default is to inherit a lot of things you probably don't want. With this, you can opt in to inheritance instead of having to opt out.

Or you could create a hybrid between a thread and a process, where it still uses the parent's memory space (unlike fok), but has it's own stack (unlike vfork), and is in its own process (unlike a thread). I think this is technically possible on linux, but there isn't a readily available interface for it. Although it seems like posix_spawn could be implemented that way...

> you create a new empty process, then the parent calls syscalls to set up the new process ...

That does seem like a much better design to me. But I wonder if that was considered way back at the dawn of computing and rejected for good reason?

> I think this is technically possible on linux, but there isn't a readily available interface for it.

Yes there is, see `man clone`. POSIX and glibc are quite different from the kernel in this regard. AFAIK under linux there are just threads of execution that might or might not share various namespaces and memory mappings. That said, the kernel does place a few artificial restrictions on what combinations are allowed in order to (as I understand it) guard against the unintended exercise of entirely untested combinations that serve no known practical purpose.

The practical problem is that if you start doing as you please with the various namespaces and mappings you quickly become incompatible with glibc and by extension most likely the majority of the dynamic libraries available on your system.

https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c...

Though I want a posix_spawn-as-a-system-call approach as well / instead of that.

I remember reading about an OS where processes weren’t basic building blocks. Instead it had a syscall to create an address space and to create a thread in an address space.

Create a thread in your own address space, and your process becomes multi-threaded. Create an address space, load some code in it, and create a thread there, and you fork/exec-ed.

In my memory, that OS was MACH, but Google doesn’t confirm that for me.

Syscalls aren’t all that cheap either.
io_uring taught us that if syscalls are expensive, queue them up in a buffer with one syscall to transfer the thread to the os to process it. So, queue up the new process mutations in a buffer with a single syscall to process all of them in a batch. This model should have replaced repetitive syscalls across the kernel years ago.
This true, but these methods don't increase the number of syscalls you need to make.
In addition to what you said: forking from a process running on multiple cores is slow once you have mark all pages as read-only and shoot this out to all cores. TLB synchronization is super expensive. Unix originally didn't support threads (want concurrency? just fork!) but with modern multicore that's clearly unsustainable.
With large enough processes, like say a server JVM process that uses 10s of GBs of RAM, even just copying the page tables for CoW can be slow. And unless you have aggressive overcommit settings you can get an OOM on fork, even if you're just going to exec something small.

vfork helps a little, but it has a lot of restrictions on what you can do before the exec, and on unix that's basically the only place you can do things like close files, change signal masks, drop privileges or set up seccomp, etc.

vfork() helps a LOT. The restrictions on what you can do on the child-side of vfork() are pretty much the same ones as for fork() + you must not do anything to damage the stack frame of the vfork() caller (i.e., you can't return).
> the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions

That's a lot more restrictive. You can't use local variables, or call any functions other than _exit or execve. On linux specifically, I _think_ those restrictions are more relaxed and you can call async-signal-safe functions, however I'm not entirely clear on how relaxed that is, and as far as I understand that isn't portable.

But some of that is nonsense and incorrect. You can very much use local variables, and you'll find tons of vfork()-using code that does that and calls plenty of async-signal-safe functions.

The real restrictions are:

  - you can't damage the function call frame
    of the caller of vfork(), thus you can't
    return from it

  - you may only call async-signal-safe
    functions on the child side of vfork()
That's basically it. Yes, you'll want to call execve(2) or _exit(2) before long, but there is no time limit as to that, it's just that the whole point of calling vfork() is to make it real cheap to spawn a process, which means ultimately calling execve(2), with _exit(2) being what you do if it execve(2) fails (e.g., because ENOENT).

There is a ton of vfork()-using code that adheres to these real restrictions and has been working fine for decades. That includes several posix_spawn() implementations, the C shell, etc.

I demand evidence that this part: "the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork()" is remotely true. That evidence must be of the form of bug reports that were accepted and which stand to scrutiny.

I've never found any such evidence. Have you?

Meanwhile I have a proof by existence that vfork() is safe used much more liberally than you say it may be used.

> You can't use local variables, or call any functions other than _exit or execve.

There are other async-signal-safe functions, and they get used routinely by posix_spawn() and other code to do child-side setup before execve(2), including: I/O redirection, process group setup, signal handling changes, etc.

Didn't he just say that fork turns out to be comparatively faster to the non-fork samples we get? Ie Linux spawns processes faster than Microsoft's kernels?
Didn't I just say that "the problem with fork isn't really that it's slow"? It's all the other OS design choices it forces on you if you want it to be fast.
Right, you did. I somehow misread your comment.
We don't have any broadly used non-fork samples. Windows, macOS, and Linux all have fork. So the presence of fork can't be the reason for the performance difference.

(Windows's fork is called ZwCreateProcess)

MacOS has posix_spawn. See https://developer.apple.com/library/archive/documentation/Sy... (yes, that’s an iOS man page. MacOS has the call, too, but I couldn’t find the man page online and it looks identical to me)

I don’t know how they implemented it, though. Under the hood, it could do the equivalent of a fork/exec pair.

XNU is open source; here’s a link into the middle of the implementation, after it’s copied all the necessary attributes of the parent into the new process structure: https://github.com/apple-oss-distributions/xnu/blob/f6217f89...
XNU's posix_spawn implementation is not fork/exec-based. It does roughly what the API suggests it would do.
NtCreateProcess does not implement a forking model. It is analogous to posix_spawn.
If you pass null for the section handle, it shares pages with the calling process, thus implementing a forking model. Or at least the parts of a forking model that some people erroneously believe are responsible for performance differences.
The nice thing about fork+exec is that's its simple and flexible.

To avoid the problems, see roc's comment under the article. Esp use of a zygote process.

One os level thing that is interesting to me is if it would be possible/wise to make an OS based on (concurrent) garbage collection.
How else does consistency work, then?

Only being half facetious here. Maybe you or someone else really has a better take.

What do you mean by consistency here?
Solaris and Windows NT both have fork() and strict accounting by default.
> The problem with fork isn't really that it's slow.

Did someone suggest that it was?

anarazel's comment focuses entirely on performance, indicating that they have an impression that the discussion about why fork is bad is about performance. I'm not entirely sure where this impression came from, as it's not mentioned in rom1v's quote nor a point in the linked paper, "A fork() in the road".
Because that OS best practices is to use threads.

Traditionally Windows applications that create processes all the time come from UNIX heritage.

Contrary to UNIX, Windows NT was designed with threads first mentality, from the get go.

While on UNIX they were added after fact, and to this day there are gotchas mixing posix threads with signals, fork and exec.

A more accurate way to describe this is that Windows' (NT onward) core execution context model is a bunch of threads that by default share memory, whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.

Both systems are implemented using threads as the execution context, but in Unix, the history means that that you fork+exec most of the time, resulting in a two tasks that do not share memory any more. By contrast, on Windows (NT onward) the common case when creating a new execution context is to create a thread that shares memory with others in its process.

Both systems allow the easy use of the other's core abstraction. On Unix, you can either code like its 1986 and use fork without exec, or use clone(3) or any of its higher level abstractions like pthreads.

You're right that POSIX semantics get tangled when using threads.

That's actually less accurate, not more. It's a post-hoc revision that conflates Unix with Linux.

The Unix model was invented over a decade before the idea of multithreading percolated into mainstream operating systems at all.

The reason that Windows NT started as it did, was that OS/2 had come out in 1987, with kernel threads, and the idea of multithreading had taken root. SunOS 5 gained threading, too.

Windows NT applications development began with threading available as a mechanism from the start, and with a lot of people in the IBM/Microsoft world already knowing about its use in applications development from OS/2.

Whereas with the Unices it came in more gradually, as the applications had often already been designed. The whole libthread versus libpthread thing made things interesting on SunOS for a few years, too. As did the first attempt (LinuxThreads) at providing threads on Linux.

PaulDavisThe1st is saying that the Unix pattern of forking a process (and not calling exec) was an early form of multi-threading (or multi-processing), but unlike threads in NT and later pthreads, they didn't share memory and communication between them required some form of IPC.
Yep, absolutely corrrect. It was true at the lowest level (the semantics of fork) and it was true at the app/platform design level: in Windows you used threads inside a process, on Unix you used multiple communicating processes.

This obviously changed as pthreads came into being, and at this point, I suspect that the typical use for threads-sharing-memory and threads-not-sharing-memory is the same on most platforms.

A reminder that the task_t data structure describes threads and processes not just in Linux, but earlier Unixen also.

Well, Windows before NT isn't the same design as Windows 16 bit, it only shares the name for all practical purposes, and has more influence from OS/2 than Windows 16 bit.

Which is why I took the effort to explicitly refer to Windows NT on my comment, already expecting some traditional answers from UNIX folks.

Also due to historical reasons POSIX threads are the outcome of every UNIX going their own way implementing threads, finally coming to an agreement years later, with all the plus and minus of relying in POSIX for portable code.

whereas Unixen have a core task context model of a bunch of threads that by default do not share memory.

How are those not simply child processes? I don't understand your use of the word 'threads' here.

Does the Unix world not distinguish between threads and processes? In Win32, threads exist within processes, and you can create new threads or child processes.

They are child processes.

Second answer: Linux doesn't differentiate between threads and processes. It has a "thread group ID" that serves a small number of purposes, and the rest of the difference is just whether the threads happen to share the same address space.

Actually on Windows a process is a thread with additional information.

The unit of execution is the thread.

On the UNIX world it depends on which UNIX you are talking about.

Linux has a similar model to Windows NT nowadays, hence clone() as key primitive.

Other UNIXes have different approaches.

I worked on the kernel of DEC Ultrix, Mach/BSD and a couple of other early Unixen. The approach in all the ones I worked on was broadly the same.
POSIX threads having problems with signals is, imho, mostly the problem with signals in general. They are pretty poorly designed: https://lwn.net/Articles/414618/
The problem is that threads are not fault boundaries but processes are. So they're not interchangeable when you care about resilience and misbehaving code.
True, but on Windows the approach is then to use COM servers, which have a faster IPC model, and can even serve multiple clients, depending on how the appartement space is configured.
"Faster IPC model" than what? Faster than writing to and reading from a pipe? Faster than POSIX shared memory?
Than UNIX fork/exec model, or calling into Create Process all the time.

Windows has a more rich set of IPC stuff than POSIX, especially since it has a microkernel like design.

If you are going to say it is everything on the same memory space anyway, it isn't.

Optional on Windows 10, and enforced on Windows 11, Hyper-V is always running, and several components including kernel and driver modules are sandboxed into their little worlds.

Several additional sandboxing changes were announced at BUILD.

That's like comparing apples and oranges. When tooling is tied to a platform, you're adding in the entire platform to the comparison.

Mozilla implemented an alternative to COM, called XPCOM. XP here means cross platform. Perhaps you could compare against that to take the platform out of the equation.

A one-shot process is easier to build and reason about than an event-driven server (speaking as someone who has written plenty of both).
If you want the isolation features of a separate process, you can’t substitute it with a single multithreaded COM server process.

.NET tried this with app domains, which are now deprecated.

App Domains were in process, which isn't was I am talking about with outproc COM.

Also App Domains are partially back in .NET Core, isolation features aren't there, but code unloading is, via AssemblyLoadContext.

the only difference between a thread and a process on linux is how many structures they share. the function is identical.
Agreed, however not all UNIXes are like Linux.
Windows was designed with threads-first mentality because on pre-386 machines you don't have viable process memory protection, so your tasks share memory by necessity. This is not a great argument.
Windows NT was never designed with pre-386 machines in mind. That was the territory of the old DOS+Windows. Windows NT from the get-go was for machines with page-based virtual memory.

* https://computernewb.com/~lily/files/Documents/NTDesignWorkb...

WinNT 3.5 was a solid offering.
This is not true. NT never had fork, was always based on the assumption of an MMU and Dave Cutler was a well known fork hater in the 80s long before this paper came out and made it cool to be so. By the time Windows 95 was out, the baseline was 386 with an MMU. CreateThread was initially designed for NT in 1993 though (which didn’t support pre-386 CPUs).
As mentioned elsewhere on this page, Windows NT had fork from the start. Vide NtCreateProcess and what happens if an image file is not explicitly supplied.

* https://computernewb.com/~lily/files/Documents/NTDesignWorkb...

NtCreateProcess was not a public Windows API. NT was flexible, that’s not what was being discussed, which should have been clear from the context.
NtCreateProcess doesn’t accept an image file parameter.
NT performed unnatural acts to implement fork semantics for the POSIX subsystem.
NT was designed to be platform-agnostic, and its original target was the DEC Alpha. Its process model owes nothing to pre-386 CPUs. The WinAPI CreateProcess function is a layer atop NtCreateProcess, so that is where the pre-386 heritage lives. But even the WinAPI process model changed significantly with 32-bit Windows.
No.

https://en.wikipedia.org/wiki/Windows_NT#Development

Windows NT was developed on various different CPUs before the Alpha was a thing. When it was released in 1993, it was released for three CPUs: IA-32, MIPS, and Alpha.

Sorry, I had conflated Windows NT development with development of 64-bit Windows as told by Raymond Chen: https://learn.microsoft.com/en-us/previous-versions/technet-...

Raymond also says elsewhere that most WinNT engineers did development on i386, but doesn’t explicitly say what time period he is describing: https://devblogs.microsoft.com/oldnewthing/20250513-00/?p=11...

Windows NT!

Misread on purpose to make a point?

I suspect it's a long tail sort of thing; it mostly doesn't matter except when it really matters. It's interesting that the stated motivation for the patch is in the context of agentic tools spawning subcommands. There's some related prior art in this area where the payoffs could be much greater, like fuzzing: https://gts3.org/assets/papers/2017/xu:os-fuzz.pdf is an example. It would be very interesting to see this patch applied to e.g. AFL++
That's not the reason for the performance difference. Windows does have a fork primitive (ZwCreateProcess) and it's still slower than Linux's equivalent.
Again, NtCreateProcess does not implement fork(). The fundamental characteristic of fork is that the child is an exact replica of the parent, down to the instruction pointer. Windows does not have a way to create a process object with such a configuration.

Also, using the Zw prefix doesn’t make you look more knowledgeable, it makes you look like you’re trying way too hard to borrow credibility.

Okay but people don't claim that copying the instruction pointer (a single machine register) is the reason for any speed difference. They claim it's due to the memory sharing. And that's easily disproven since you can share pages, just like on Linux, simply by passing null for the section handle, yet there's still a performance difference.

Why does it matter which prefix I used? They both point to the same routine so my point applies either way.

It's a completely uncontroversial fact that NT does implement fork(). Turn to page 183 of Helen Custer's "Inside Windows NT" and you will read about it.
This paper is great and I also really like one of its references [29] as it goes into some more subtle parts of scalable interfaces, including fork. It's a gem IMO: The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors https://people.csail.mit.edu/nickolai/papers/clements-sc.pdf
Discussion at the time:

https://news.ycombinator.com/item?id=19621799 - A fork() in the road (2019-04-10, 178 comments)

Discussed also in 2021: https://news.ycombinator.com/item?id=29709802 (16 comments)
Fork is marvelous for the zygote pattern

Hard to come up with an optimization that is equally efficient and elegant

The zygote pattern[1] is a great optimization to deal with the cost of forking, but IMHO, being able to inexpensively spawn a carefully tailored process regardless of the size and scope of the current process would be better.

I would guess it would be a small difference in measurable performance between zygote and a direct clean spawn, but it's one less trick an application needs to do, and it would be very helpful for libraries that spawn things. Spawning inside a library isn't always a great thing to do, but some things would really benefit from process level isolation.

[1] In case one isn't aware, the zygote pattern involves forking a 'zygote' process during application startup, and having that process do any forks that need to happen during application runtime. This reduces the cost of forking in large applications, because the zygote will have few fds open and use little memory. This lets your large application spawn new processes without delaying the application or the startup of the new processes. Some applications will spawn many zygotes to allow parallelism for spawning at runtime.

You're referring to something else, and maybe I'm using the term "zygote" incorrectly.

In all uses of zygotes that I have seen, here's what's really happening:

- `fork` is being used to reduce the cost of starting a process that has a high start-up cost. So, you start one process, run it through the expensive initialization, and then fork it from there to start new processes.

- To make this even faster, you have a pool of pre-forked processes sit around.

- Having pre-forked processes sitting around ready to be used is not expensive because of the CoW property and the fact that a process that forks and then immediately pauses will not have triggered any significant CoW yet.

So, the zygote optimization you speak of is in practice only meaningful on top of systems that are using an optimization uniquely enabled by `fork` (avoiding process initialization costs by cloning a process), and that zygote optimization is further optimized by another property of `fork` (memory sharing of forked processes that haven't done anything else yet).

Oh I see. I guess your zygotes have developed more than mine. I think Google may have coined or at least popularized the term zygote for this in Chrome and Android, Chrome documentation [1] says:

> A zygote process is one that listens for spawn requests from a main process and forks itself in response. Generally they are used because forking a process after some expensive setup has been performed can save time and share extra memory pages.

I think reading the first sentance and stopping covers my zygote, but adding the second sentance covers yours. So I think we're both right!

I think both paths are useful. If your children need time to startup and become ready, spawn one that does start up work, and then it (pre)forks at the ready state to have processes ready to handle requests (your zygote). This does require a traditional fork() to avoid duplication of work.

But if forking is expensive at runtime because you have a million FDs open and a whole lot of memory allocations, spawn spawners before you start doing work (my zygote). This could be unnecessary with a inexpensive way to spawn a new process from an process that has lots of resources in use.

Of course, you can also use my zygotes to spawn your zygotes. Zygoteception.

[1] https://chromium.googlesource.com/chromium/src/+/HEAD/docs/l...

I quite like the idea. I’m using OpenBSD on an oldish laptop, and fork-exec is expensive enough that it conflicts with the usb subsystem. Isochronous transfers have a 1ms realtime requirement and it seem that the fork-exec system calls hold the giant lock long enough to mess with it (audio stutters).

While I’ve not bothered to profile it, but it seems that process that have lot of mapped pages is the issue (firefox, emacs,…). In the emacs case, the issue is when the main process trying to fork-exec, if I start a shell session (with shell-mode or term-mode), it works fine.

> Oh I see. I guess your zygotes have developed more than mine. I think Google may have coined or at least popularized the term zygote for this in Chrome and Android, Chrome documentation [1] says:

Google may have popularized the term, but this approach was already in use by KDE developers in the KDE 2.x timeframe, where it was used as part of a system called kdeinit.

In this scheme, launching KDE apps from a KDE desktop could bypass much of the startup cost of dynamic linking by forking from a long-running kdeinit process (with kdeinit itself deliberately linked to all large dependency libs like Qt and kdelibs), dynamically loading the application logic (stored as a .so) and then launching the app.

This was more to save startup time due to how long it took to dynamically resolve a multitude of C++-based symbols back then, all the common logic came before the app's own main() would ever be called. But it did also save a bit of memory as well.

> being able to inexpensively spawn a carefully tailored process regardless of the size and scope of the current process would be better.

It's called clone(2)

adding on the the sibling, what argument to clone allows me to set the fds of the child? AFAIK, you either share the FD table with the parent, or get a copy of it. If the parent has 1 million FDs open and the child doesn't want most of those, dealing with that has real costs. Many applications that tend to have large numbers of FDs and also fork/exec will mitigate the cost by spawning a process during startup that they can then use to spawn processes during runtime without doing it from the main process; this is a nice mitigation, but it shows a missing interface.
Which argument to clone starts the process with an empty address space?
That happens with execve(). clone() allows you to not copy the page table prior to the execve() call.
Which argument to clone does that?
The paper explicitly covers it that various memory COW/snapshot mechanisms are probably faster and safer than the zygote pattern. As it stands getting the zygote pattern correct and safe is something you have to plan for upfront. You can’t retrofit it which is why the paper mentions it has poor composability. Also the advantages of the zygote pattern can be overstated since the memory sharing benefit is minimal since it has to happen so early and modern OSes already transparently CoW duplicate pages in the background.
In what sense can you not retrofit the zygote pattern?
I recommend at least skimming the paper as it covers this. But essentially you can’t just inject a call at a random point in code to start being a zygote. It’s something you have to plan up front as to the exact point you’re going to fork and that you’re going to do it at the start of program before any threads have started or any files are open and before any locks have been acquired. It’s basically all the challenges of invoking fork at arbitrary points in time.

The reason to do a zygote in the first place could be solved with alternative special APIs that are safer and harder to misuse. But we have fork so there’s not as big of a demand despite the warts.

Sure, but you can always retrofit a program to fork early on... this is a relatively trivial change. No?
And so easy to make into bottleneck.

Yes, zygote pattern makes it easy to make fork() into bottleneck - it requires a lot more discipline and low level tricks (linker scripts, compiler-specific extensions, custom sections, low level dependencies on pagesize that get "fun" on ARM servers).

If you don't, you might wake up with fork() causing latency issues.

Unless you want to create a thread in your zygote. Then it breaks down.

Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.

You can create threads in the zygote. It doesn't "break down", but sure, there's a bit more work.

My trick for that is that the set of threads that I create pre fork have to be suspendable and resumable, preferably lazily (they resume when they are actually needed). So, the zygotes are sitting with those threads suspended. When they become active, they can do work immediately. They might lazily resume those threads as needed.

There are other idioms for this too.

> Raw fork() is terrible. Instead we need a proper primitive to stop and make a snapshot of a process.

Folks have been saying that it's terrible for as long as I can remember. But it's still there, because it's better than the alternatives

> My trick for that is that the set of threads that I create pre fork have to be suspendable and resumable

Well, yes. You need to wait for all the threads to park themselves at safepoints. This can work if you control the whole runtime, and you don't use something that creates threads behind your back.

This is actually why I've always been interested in a better fork(), it has a lot of parallels with stop-the-world needed for GCs.

> Folks have been saying that it's terrible for as long as I can remember. But it's still there, because it's better than the alternatives

I don't think we have alternatives? Except maybe ptrace()?

Ah, my one time on the HN front-page: Fork() is evil; vfork() is goodness; afork() would be better; clone() is stupid (https://news.ycombinator.com/item?id=30502392).
Not sure if fork is outdated or not, but people calling it a “hack” obviously have pretty bad engineering taste.