Hacker News new | ask | show | jobs
by avianlyric 233 days ago
There is no sensible OS API that could support this, because fundamentally memory access is a hardware API. The OS isn’t involved in normal memory reads, because that would be ludicrously inefficient, effectively requiring a syscall for every memory operation, which effectively means a syscall for any operation involving data I.e. all operations.

Memory operations are always synchronous because they’re performed directly as a consequence of CPU instructions. Reading memory that’s been paged out results in the CPU itself detecting that the virtual address isn’t in RAM, and performing a hardware level interrupt. Literally abandoning a CPU instruction mid execution to start executing an entirely separate set of instructions which will hopefully sort out the page fault that just occurred, then kindly ask the CPU to go back and repeat the operation that caused the page fault.

OS is only involved only because it’s the thing that provided the handling instructions for the CPU to execute in the event of a page fault. But it’s not in anyway actually capable of changing how the CPU initially handles the page fault.

Also the current model does allow other threads to continue executing other work while the page fault is handled. The fault is completely localised to individual thread that triggered the fault. The CPU has no concept of the idea that multiple threads running on different physical core are in anyway related to each other. It also wouldn’t make sense to allow the interrupted thread to someone kick off a separate asynchronous operation, because where is it going to execute? The CPU core where the page fault happened is needed to handle the actual page fault, and copy in the needed memory. So even if you could kick off an async operation, there wouldn’t be any available CPU cycles to carry out the operation.

Fundamentally there aren’t any sensible ways to improve on this problem, because the problem only exists due to us pretending that our machines have vastly more memory than they actually do. Which comes with tradeoffs, such as having to pause the CPU and steal CPU time to maintain the illusion.

If people don’t like those tradeoffs, there’s a very simple solution. Put enough memory in your machine to keep your entire working set in memory all the time. Then page faults can never happen.

3 comments

> There is no sensible OS API that could support this, because fundamentally memory access is a hardware API.

Not only is there a sensible OS API that could support this, Linux already implements it; it's the SIGSEGV signal. The default way to respond to a SIGSEGV is by exiting the process with an error, but Linux provides the signal handler with enough information to do something sensible with it. For example, it could map a page into the page frame that was requested, enqueue an asynchronous I/O to fill it, put the current green thread to sleep until the I/O completes, and context-switch to a different green thread.

Invoking a signal handler only has about the same inherent overhead as a system call. But then the signal handler needs another couple of system calls. So on Linux this is over a microsecond in all. That's probably acceptable, but it's slower than just calling pread() and having the kernel switch threads.

Some garbage-collected runtimes do use SIGSEGV handlers on Linux, but I don't know of anything using this technique for user-level virtual memory. It's not a very popular technique in part because, like inotify and epoll, it's nonportable; POSIX doesn't specify that the signal handler gets the arguments it would need, so running on other operating systems requires extra work.

im3w1l also mentions userfaultfd, which is a different nonportable Linux-only interface that can solve the same thing but is, I think, more efficient.

Just to clarify, I think the parent posts are talking about non-failing page faults, ie where the kernel just needs to update the mapping in the MMU after finding the existing page already in memory (minor page fault), or possibly reading it from filesystem/swap (major page fault).

SIGSEGV isn't raised during a typical page fault, only ones that are deemed to be due to invalid reads/writes.

When one of the parents talks about "no good programming model/OS api", they basically mean an async option that gives the power of threads; threading allows concurrency of page faults, so the kernel is able to perform concurrent reads against the underlying storage media.

Off the top of my head, a model I can think of for supporting concurrent mmap reads might involve a function:

  bool hint_read(void *data, size_t length);
When the caller is going to read various parts of an mmapped region, it can call `hint_read` multiple times beforehand to add regions into a queue. When the next page fault happens, instead of only reading the currently accessed page from disk, it can drain the `hint_read` queue for other pages concurrently. The `bool` return indicates whether the queue was full, so the caller stops making useless `hint_read` calls.

I'm not familiar with userfaultfd, so don't know if it relates to this functionality. The mechanism I came up with is still a bit clunky and probably sub-optimal compared to using io_uring or even `readv`, but these are alternatives to mmap.

You’ve actually understood my suggestion - thank you. Unfortunately I think hint_read inherently can’t work because it’s a race condition between the read and how long you access the page. And this race is inherent in any attempted solution that needs to be solved. Signals are also the wrong abstraction mechanism (and are slow and have all sorts of other problems).

You need something more complicated I think, like rseq and futex you have some shared data structure that both understand how to mutate atomically. You could literally use rseq to abort if the page isn’t in memory and then submit an io_uring task to get signaled when it gets paged in again but rseq is a bit too coarse (it’ll trigger on any preemption).

There’s a race condition starvation danger here (it gets evicted between when you get the signal and the sequence completes) but something like this conceptually could maybe be closer to working.

But yes it’s inherently difficult which is why it doesn’t exist but it is higher performance. And yes, this only makes sense for mmap not all allocations so SIGSEGV is irrelevant if looking at today’s kernels.

If you want accessing a particular page to cause a SIGSEGV so your custom fault handler gets invoked, you can just munmap it, converting that access from a "non-failing page fault" into one "deemed to be invalid". Then the mechanism I described would "allow[] concurrency of page faults, so the [userspace threading library] is able to perform concurrent reads against the underlying storage media". As long as you were aggressive enough about unmapping pages that none of your still-mapped pages got swapped out by the kernel. (Or you could use mlock(), maybe.)

I tried implementing your "hint_read" years ago in userspace in a search engine I wrote, by having a "readahead thread" read from pages before the main thread got to them. It made it slower, and I didn't know enough about the kernel to figure out why. I think I could probably make it work now, and Linux's mmap implementation has improved enormously since then, so maybe it would just work right away.

The point about inducing segmentation faults is interesting and sounds like it could work to implement the `hint_read` mechanism. I guess it would mostly be a question of how performant userfaultfd or SIGSEGV handling is. In any case it will be sub-optimal to having it in the kernel's own fault handler, since each userfaultfd read or SIGSEGV callback is already a user-kernel-user switch, and it still needs to perform another system call to do the actual reads, and even more system calls to mmap the bits of memory again.

Presumably having fine-grained mmaps will be another source of overhead. Not to mention that each mmap requires another system call. Instead of a single fault or a single call to `readv`, you're doing many `mmap` calls.

> I tried implementing your "hint_read" years ago in userspace in a search engine I wrote, by having a "readahead thread" read from pages before the main thread got to them.

Yeah, doing it in another thread will also have quite a bit of overhead. You need some sort of synchronisation with the other thread, and ultimately the "readahead" thread will need to induce the disk reads through something other than a page fault to achieve concurrent reads, since within the readahead thread, the page faults are still synchronous, and they don't know what the future page faults will be.

It might help to do `readv` into dummy buffers to force the kernel to load the pages from disk to memory, so the subsequent page faults are minor instead of major. You're still not reducing the number of page faults though, and the total number of mode switches is increased.

Anyway, all of these workarounds are very complicated and will certainly be a lot more overhead than vectored IO, so I would recommend just doing that. The overall point is that using mmap isn't friendly to concurrent reads from disk like io_uring or `readv` is.

Major page faults are basically the same as synchronous read calls, but Golang read calls are asynchronous, so the OS thread can continue doing computation from other Goroutines.

Fundamentally, the benchmarks in this repository are broken because in the mmap case they never read any of the data [0], so there are basically no page faults anyway. With a well-written program, there shouldn't be a reason that mmap would be faster than IO, and vectored IO can obviously be faster in various cases.

[0] Eg, see here where the byte slice is assigned to `_` instead of being used: https://github.com/perbu/mmaps-in-go/blob/7e24f1542f28ef172b...

Inducing segmentation faults is literally how the kernel implements memory mapping, and virtual memory in general, by the way. From the CPU's perspective, that page is unmapped. The kernel gets its equivalent of a SIGSEGV signal (which is a "page fault"=SIGSEGV "interrupt"=signal), checks its own private tables, decides the page is currently on disk, schedules it to be read from disk, does other stuff in the meantime, and when the page has finished being read from disk, it returns from the interrupt.

(It does get even deeper than that: from the CPU's perspective, the interrupt is very brief, just long enough to take note that it happened and avoid switching back to the thread that page-faulted. The rest of the stuff I mentioned, although logically an "interrupt" from the application's perspective, happens with the CPU's "am I handling an interrupt?" flag set to false. This is equivalent to writing a signal handler that sets a flag saying the thread is blocked, edits its own return address so it will return to the scheduler instead of the interrupted code, then calls sigreturn to exit the signal handler.)

There are some differences, including the cross-CPU TLB shootdowns vlovich mentioned.
munmap + signal handling is terrible not least of which that you don’t want to be fucking with the page table in that way as an unmap involves a cross cpu TLB shoot down which is slooow in a “make the entire machine slow” kind of way.
That is correct, although my laptop is only four cores.
Are you reinventing madvise?
I think the model I described is more precise than madvise. I think madvise would usually be called on large sequences of pages, which is why it has `MADV_RANDOM`, `MADV_SEQUENTIAL` etc. You're not specifying which memory/pages are about to be accessed, but the likely access pattern.

If you're just using mmap to read a file from start to finish, then the `hint_read` mechanism is indeed pointless, since multiple `hint_read` calls would do the same thing as a single `madvise(..., MADV_SEQUENTIAL)` call.

The point of `hint_read`, and indeed io_uring or `readv` is the program knows exactly what parts of the file it wants to read first, so it would be best if those are read concurrently, and preferably using a single system call or page fault (ie, one switch to kernel space).

I would expect the `hint_read` function to push to a queue in thread-local storage, so it shouldn't need a switch to kernel space. User/kernel space switches are slow, in the order of a couple of 10s of millions per second. This is why the vDSO exists, and why the libc buffers writes through `fwrite`/`println`/etc, because function calls within userspace can happen at rates of billions per second.

you can do fine grained madvise via io_uring, which indeed uses a queue. But at that point why use mmap at all, just do async reads via io_uring.
The entire point I was trying to make at the beginning of the thread is that mmap gives you memory pages in the page cache that the OS can drop on memory pressure. Io_uring is close on the performance and fine-grained access patterns front. It’s not so good on the system-wide cooperative behavior with memory front and has a higher cost as either you’re still copying it from the page cache into a user buffer (non trivial performance impact vs the read itself) + trashing your CPU caches or you’re doing direct I/O and having to implement a page cache manually (and risks duplicating page data inefficiently in userspace if the same file is accessed by multiple processes.
> There is no sensible OS API that could support this, because fundamentally memory access is a hardware API.

there's nothing magic about demand paging, faulting is one way it can be handled

another could be that the OS could expose the present bit on the PTE to userland, and it has to check it itself, and linux already has asynchronous "please back this virtual address" APIs

> Memory operations are always synchronous because they’re performed directly as a consequence of CPU instructions.

although most CPU instructions may look synchronous they really aren't, the memory controller is quite sophisticated

> Fundamentally there aren’t any sensible ways to improve on this problem, because the problem only exists due to us pretending that our machines have vastly more memory than they actually do. Which comes with tradeoffs, such as having to pause the CPU and steal CPU time to maintain the illusion.

modern demand paging is one possible model that happens to be near universal amongst operating system today

there are many, many other architectures that are possible...

> although most CPU instructions may look synchronous they really aren't, the memory controller is quite sophisticated

I was eliding at lot of details. But my broader point is that from the perspective of the thread being interpreted, the paging process is completely synchronous. Sure advanced x86 CPU maybe be tracking data dependencies between instructions and actively reordering instructions to reduce the impact of the pipeline stalling caused by the page fault. But that’s all low level optimisation that are (or should be) completely invisible to the executing thread.

> there are many, many other architectures that are possible...

I would be curious to see any examples of those alternatives. Demand paging provides a powerful abstraction, and it’s not clear to me how you can sensibly move page management into applications. At a very minimum that would suggest that every programming language would need a memory management runtime capable to predicting possible memory reads ahead of time in a sensible fashion, and triggering its own paging logic.

I think you have a misunderstanding of how disk IO happens. The CPU core sends a command to the disk "I want some this and that data", then the CPU core can go do something else while the disk services that request. From what I read the disk actually puts the data directly into memory by using DMA, without needing to involve the CPU.

So far so good, but then the question is to ensure that the CPU core has something more productive to do then just check "did the data arrive yet?" over and over and coordinating that is where good apis come in.

(Not the person you are replying to.)

There is nothing in the sense of Python async or JS async that the OS thread or OS process in question could usefully do on the CPU until the memory is paged into physical RAM. DMA or no DMA.

The OS process scheduler can run another process or thread. But your program instance will have to wait. That’s the point. It doesn’t matter whether waiting is handled by a busy loop a.k.a. polling or by a second interrupt that wakes the OS thread up again.

That is why Linux calls it uninterruptible sleep.

EDIT: io_uring would of course change your thread from blocking syscalls to non-blocking syscalls. Page faults are not a syscall, as GP pointed out. They are, however, a context-switch to an OS interrupt handler. That is why you have an OS. It provides the software drivers for your CPU, MMU, and disks/storage. Here this is the interrupt handler for a page fault.

What everyone forgets is just how expensive context switches are on modern x86 CPUs. Those 512 bit vector registers fill up a lot of cache lines. That's why async tends to win over processes / threads for many workloads.
(I am the person you are replying to)

It could work like this. "Hey OS I would like to process these pages* are they good to go? If not could you fetch and lock them for me" and then if they are ready you process them knowing it won't fault, and if they are not you do something else and try again later.

It's a sort of hybrid of the mmap and fread paradigms in that there are both explicit read requests but the kernel can also get you data on its own initiative if there are spare resources for it.

* to amortize syscall overhead.

What advantages does that provide over using more OS threads. Ultimately this model is based on the idea that we want our programming runtimes to become increasingly responsible for low level scheduling concerns that have traditionally been handled by the OS scheduler.

I can broadly understand why there may be a desire to go down that path. But I’m not convinced that it would produce meaningful better performance than the current abstractions. Especially if you take a step back as ask the question: is mmap is the right tool to be using in these situations, rather using other tools like io_uring?

To be clear I don’t know the answer to this question. But the complexity of the solutions being suggested to potentially improve the mmap API really make me question if they’re capable of producing meaningful improvements.

It's hard to say on one hand "I use mmap because I don't want fancy APis for every read" and on the other "I want to do something useful on page fault" because you don't want to make every memory read a possible interruption point.
I think you have a misunderstanding of how the OS is signaled about disk I/O being necessary. Most of the post above was discussing that aspect of it, before the OS even sends the command to the disk.