Hacker News new | ask | show | jobs
by tsimionescu 849 days ago
You're mixing up many entirely different topics in this rant, so it's hard to unpack.

That we use the term "file descriptors" for pointers from userspace to any kernel object, even those that are not files, is unfortunate, but ultimately just a naming quirk. Windows has a better name, "Handle", but the concept is exactly the same.

The OS includes the file system, and file systems include a notion of paths, and relative paths are really useful. So, the OS helps you by automatically resolving relative paths to your current directory, instead of forcing every application to manually keep track of this.

Linux is perhaps the only popular OS whose interface is not defined in C. All syscalls are clearly documented at the assembler level in Linux, and kept backwards compatible. All other popular OSs (Windows, MacOS, FreeBSD) have a C lib you have to dynamically link if you expect compatibility.

Even if signals weren't a thing, you'd still have to worry about processor interrupts. There is no such thing as a purely single-threaded program on any gpCPU released in the last 30+ years.

The variety of calls in Linux to handle various kinds of events is unfortunate. Windows has a slightly cleaner interface, though even there it's not ideal. Hopefully io_uring will subsume all of the current use cases.

The numbers after the syscalls are related to the man pages where they are documented. Not all that relevant.

Sycalls are not functions, they are specific APIs that the kernel provides to userspace, defined at the assembler level (you put this value in this register/stack and jump to this address/invoke this CPU interrupt). It is up to your language to wrap syscalls into functions, which may have an entirely different calling convention. A kernel can't provide APIs as language-specific functions, as Python's calling convention is vastly different from Haskell's.

Fork() has many meanings that are not related to cutlery, used in CS in other places. Fork() is also an extraordinarily terrible interface for process creation for reasons which have nothing to do with its name. I would be happy if one day Linux gets rid of this insanity and adds a CreateProcess syscall that doesn't have to pretend to copy the entire address space of the current process.

4 comments

> I would be happy if one day Linux gets rid of this insanity and adds a CreateProcess syscall that doesn't have to pretend to copy the entire address space of the current process.

fork() is going to exist forever, but posix_spawn() already exists:

https://linux.die.net/man/3/posix_spawn

I think clone(), or better yet clone3(), is closer to what I had in mind, as posix_spawn() is not a syscall, it's just a utility function calling fork()/vfork()/clone() and then exec().
fork+exec is great in so far as it lets you do arbitrarily complex process setup between those syscalls. APIs like posix_spawn are far more restrictive. The issue is the overhead and the restricted post-fork environment in a multi-threaded process. Rather than CreateProcess we need io_uring_spawn[0] + all relevant syscalls ported to io_uring.

https://lwn.net/Articles/908268/

If custom process setup code is so common, a better abstraction would have been a CreateProcess() / StartProcess() pair, where CreateProcess() would return a struct that exposes all the necessary methods to control security, FD behavior, working directory etc, and StartProcess() would take that struct and actually run it.
The setup part needs to be turing-complete. A simple struct won't do. What you're talking about sounds more like posix_spawn. io_uring_spawn on the other hand would allow that because the parent process could execute the logic between syscall submissions to the newly created process.
What you need is some kind of handle with the ability to do system calls on behalf of another process via that handle. CreateProcess would return this handle, whereas StartProcess would actually use that handle to start execution.

Windows and Fuchsia are two handle-based OSes, and Fuchsia does in fact spawn processes in this manner (Windows just has a family of CreateProcess methods that do both).

This would probably be a cleaner approach, that doesn't duplicate all process functionality between syscalls and the Process struct I was thinking of, and it also is far easier to keep backwards compatible.
If you can do syscalls on behalf of another process, sure. But that's more than "just" CreateProcess.
Ultimately, after exec(), a process only inherits a limited number of things from before exec(). Those things should be directly configurable from the parent process before Start(), using the Turing complete language of the parent process, in my proposal.

If you, say, want to open a socket that will be persisted after exec(), you can open the socket in the original process and pass it in the list of open FDs of the Process object returned by CreateProcess(). After StartProcess() is called, the newly created process should see this open FD as one of its already opened FDs, just like if you had opened it between fork() and exec().

What am I missing? What could you do after fork() in the child that would persist after exec() but not be easily configurable from the outside in principle?

In practice, I can of course imagine that the kernel has all sorts of assumptions about who gets to access certain internal process data structures that it would be very hard to modify today.

The problem is that you end up with two syscalls for every setting. e.g. chroot to change your own root, then newproc_chroot to change the root of a newly spawned process. Start making a list of all the ways you can change process state and you'll realize there are probably hundreds, here. Now imagine being the kernel author and having to duplicate all that code.

Having used both, I can easily say that CreateProcess is deficient, and fork/exec is kind of genius in how many things it makes possible. Could Windows fix CreateProcess with more API? Sure, but they didn't, probably because nobody wanted to spend the effort duplicating all their kernel code.

> The problem is that you end up with two syscalls for every setting. e.g. chroot to change your own root, then newproc_chroot to change the root of a newly spawned process.

Or you add just one syscall to get the handle for the current process and then make all of the syscalls like chroot a userspace alias for proc_chroot(curprocess(), new_root).

The thing is not about whether it's theoretically possible to configure it but more about evolving OS APIs and all those APIs having to be mirrored in the process construction API. E.g. all the security/namespacing stuff.
For the syscall raw ASM/libc debate, would it be possible to provide an interface that just does syscalls and separate that from the rest of libc? It would be more inconvenient for people using ASM, but they wouldn't have to conform to libc. I imagine it's a breaking change for everyone, so consider this in a hypothetical OS.
It might be possible, but I don't think you'd gain much. Even today, you can dynamically link to libc but only use its syscall interfaces, not anything else from it, not even malloc(). I think most runtimes for GC languages work like this.

However, this still means that your process will be affected by any memory correctness issues in "libsyscall", in addition to the issues in the kernel itself. Plus, the maintainers of libsyscall would have to write it in a bizarre dialect of C that doesn't use any stdlib functions, which might price even more error prone than standard C.

It's perhaps important to note here that the parts of libc that implement syscalls in OpenBSD are not simple syscall-to-C wrappers, they can have quite a bit of code occasionally. And Windows' runtime library is even more complex than that. That's their whole point - they can keep a backwards compatible system interface in spite of significant changes at the syscall layer, probably by doing lots of small pieces of work in userspace to bridge the gap.

I don't think there's any way to mitigate memory safety issues in the syscall wrappers or the kernel; if the very overseer that is depended upon to enforce some degree of security isn't secure, then it can't be relied upon.

I was moreso thinking that "libsyscall" would be like libc, so that people can use it as a stable interface as in Windows or OpenBSD. If you were to use both libsyscall and libc, it wouldn't be meaningfully different from linking to all of libc today. It gives somewhat more separation to treat syscalls independently of libc, that's all.

Could you educate me on what's wrong with fork()?
Well, there are two categories of problems.

One is that fork() is by definition a very costly operation (a copy of the entire address space of the current process), and the kernel has to do a lot of work to implement it efficiently (implementing copy-on-write clones of all of the pages of a process). And that all that work is done for nothing in the very very very common case of doing fork() + exec().

The other problem is that the semantics of fork() just fundamentally can't work properly for a multi-threaded process. In any multi-threaded process, if you do fork(), the only thing you can safely do in the child process is to call exec(). Any other call, even a printf() or some path logic, has a very good chance to lead to a deadlock, quite possibly inside malloc() itself.

So fork() as a standalone operatikn is actually an extremely niche utility (duplictaing single-threaded processes) that has been made the main way of spawning new processes. Similarly, exec() by itself is an even more niche utility, sometimes useful for "launcher" style processes.

So, instead of achieving an extremely common task (launch some binary file as a new process) using a dedicated system call, Unix has chosen to define two extremely niche syscalls that you should almost never use individually, but that together can implement this common process, but only with a lot of behind the scenes work to make it efficient.

The way the shell works is an essential part of Unix design, and the fork/exec pair suits it very well. In the forked child process before the exec, the shell can execute arbitrary code that inherits all file descriptors and manipulates/redirects them in a certain way.
I didn't say it's useless, I said it's a niche utility. There are, what, 10 even slightly commonly used shells?

Compared to the thousands of other programs that spawn processes, I think shells count as a niche use.

If fork() and exec() had been kept as the niche utilities they are in addition to a more commonly used spawn() syscall, fork() and exec() could have remained simple and small, and not needed all the CoW magic and many other complex features at all.

I wonder now if one could emulate fork-shenanigans-exec by:

- clone CLONE_FILES with AF_UNIX socketpair

- send file descriptors through the socket

- do whatever was in shenanigans (you have the socket for communication/RPC/whatever)

- exec

or maybe even easier:

- clone CLONE_FILES with pipe or two

- do whatever was in shenanigans (you have the pipe for communication/RPC/whatever)

- exec

> One is that fork() is by definition a very costly operation

Isn't process creation much slower on Windows (not using fork) than forking on Unix-likes?

It is, but there are many other differences between Unix and Windows processes than just fork() VS CreateProcess().

Even disregarding this, the actual fork() syscall has had many hundreds of dev hours poured into it to implement the costly semantics (copy all resources) in an efficient way for the common use case (copy-on-write semantics).

What do people who care a lot about latency do about this? Do they have no choice but to eat the cost of fork()?
They avoid spawning new processes.
Also, fork() interacts very poorly with e.g. CUDA: turns out, you can't fork the GPU itself.
There's basically three uses of fork.

The first is to implement POSIX shells, and that's less because this is a good design and more because shells are a wrapper around the original Unix system calls. Note that if you're designing a scripting language that isn't beholden to compatibility with /bin/sh (especially one that can be portable to OSes that don't have fork()!), then you're liable to not design it in such a way that requires you to use fork().

The second use case is an alternative to threads for parallel processing. And there are some reasons that processes can work better than threads for parallel processing. But fork() has such a bad interaction with multithreaded code [1] that you end up having to choose fork() xor threads. And as threading has become an increasingly important part of modern environments, well, given that xor choice, almost everybody is going to come down on the threads side of the equation.

The final use case, and by far the most common, is to be able to spawn a new process. This means you break up one logical system call (spawn) into two (fork + exec), the first of which semantically requires you to do a lot of work (clone memory state) that you're immediately going to throw away. Even in the case where you want to do more expansive process-twiddling magic before spawning the process, there are better designs (especially if you're willing to commit to a handle-based operating system).

Of the three use cases, one amounts to "backwards compatibility", and the other two amount to "fork() is actively fighting you". That is not the hallmark of a good API.

[1] Think things like "locks are held by threads that don't exist."

https://news.ycombinator.com/item?id=30502392 340 comments

https://news.ycombinator.com/item?id=19621799 180 comments

https://news.ycombinator.com/item?id=22462628 117 comments

https://news.ycombinator.com/item?id=8204007 314 comments

https://news.ycombinator.com/item?id=31739794 135 comments

https://news.ycombinator.com/item?id=16068305 89 comments

that's some 1000+ comments for light background reading about fork(). I think you can make some aggregate sentiment analysis from that and conclude that fork() is not great.