| HN Mirror

NtCreateUserProcess supports copy-on-write fork semantics without overcommit. See https://github.com/huntandhackett/process-cloning#cloning-fo... And there's a wrapper, RtlCloneUserProcess, that is called using the same traditional fork pattern.

Precise memory accounting and CoW fork aren't intrinsically antagonistic, and the general ability to clone CoW mappings or similar kernel structures is useful beyond fork, which is why NT had all the necessary facilities in the kernel (it's the userspace CRT state that can be tricky, especially in the presence of threads, which is true on Unix systems as well).

The example of forking a process with a giant VM space just to exec some other program is, IMO, a straw man. Processes with such huge RW mappings typically don't fork and exec like that. Nobody architecting an app like PostgreSQL was relying on the ability to easily fork processes for minor tasks or exec utilities from processes already forked for resource intensive tasks. And when such a thing is desirable, it's easy enough to use the alternatives, like vfork, or architect a controller for spawning subprocesses, or just use threads. Heck, fork existed long before CoW. Expectations around fork, that you can and should be able to call it without any forethought about resource management was a consequence of Linux' popularity.

Linux embraced overcommit because people wanted to run existing big iron applications like networked databases on tiny PCs with fractions of the memory those applications were written to expect to be able to use. Overcommit was a hack that let your play around with those applications without them immediately falling over, partly because back then such applications often preallocated memory for cache, etc, but would never use all of it when running in an environment like early Linux, which would never see the same high loads and utilization as big iron servers.

Linux could have pivoted in the other direction and pursued strict memory accounting with the ability to expressly overcommit in, e.g., some process subtrees or dynamically allocate swap (which in the expected scenario it normally wouldn't have to actually do). But like most userspace developers they found it easier to write kernel code when they could pretend memory was infinite, and when the system hit the wall just blow up and blame the user. That choice can be defensible for userspace, but it's simply not defensible for a kernel.