Hacker News new | ask | show | jobs
by wahern 2538 days ago
You said, "avoiding expensive kernel<-->user-space switching", which is wrong. Filesystems are implemented entirely in user space, just not in the same user space processes. Consumers exist in separate processes from the producers--plural, because the underlying block device storage may be managed by processes separate from the processes managing VFS state. Context switches are a necessary part of having separate user space processes, and context switching through kernel space (or at least some protected, privileged context) is necessary in order to authenticate messaging capabilities.

Note that there are ways to minimize the amount of time spent in privileged contexts. Shared memory can be used to pass data directly, but unless you want all your CPUs pegged at 100% utilization the kernel must be involved somehow to optimize IPC polling. In any event, the same strategies can be used for in-kernel VFS services, so it's not a useful distinction.

1 comments

The only reason to use an IPC or go through the kernel is to expose the fs to the rest of the OS. If an app doesn't intend to expose the fs, the entirety of the fs can exist within the app process.

Quoting: > "Unlike many other operating systems, the notion of “mounted filesystems” does not live in a globally accessible table. Instead, the question “what mountpoints exist?” can only be answered on a filesystem-specific basis -- an arbitrary filesystem may not have access to the information about what mountpoints exist elsewhere."

libfs is an abstraction layer around some of the VFS bits. An analogous Unix approach be would be shifting the burden of compact file descriptor allocation (where Unix open(2) must return the lowest numbered free descriptor) to the process rather than the kernel. (IIRC this is also actually done in Fuschia as part of its POSIX personality library.)

Notice in the above that it's implied that that actual filesystem server (e.g. that manages ext4 state on a block device) is in another process altogether. And so for every meaningful open, read, write, and close there's some sort of IPC involved.

A process accessing a block device directly without any IPC is something that can already be done in Unix. For example, you can open a block device in read-write mode directly and mmap it into your VM space. Also, see BSD funopen(3) and GNU fopencookie(3)[1], which is a realization of similar consumer-side state management, except for ISO C stdio FILE handles; it's simpler because ISO C doesn't provide an interface for hierarchical FS management.

There's no denying that Fuschia's approach is more flexible and the C interface architecture more easily adaptable to monolithic, single-process solutions. But it stems from it's microkernel approach which has the side effect of forcing VFS implementation code to be implemented as a reusable library. There's no reason a Unix process couldn't reuse the Linux kernel's VFS and ext4 code directly except that it was written for a monolithic code base that assumes a privileged context. Contrast NetBSD's rump kernel architecture where you can more easily repurpose kernel code for user space solutions; in terms of traditional C library composition, NetBSD's subsystem implementations have always fallen somewhere between Linux and traditional microkernels and so were naturally more adaptable to the rump kernel architecture.

[1] See also the more limited fmemopen and open_memstream interfaces adopted by POSIX.

The catch is, if you want to securely mediate access to a shared resource, you need to have something outside your protection boundary do it, be it a kernel or user-space server.