Hacker News new | ask | show | jobs
by PaulHoule 1979 days ago
I like mmap and I don't.

It is incompatible with non-blocking I/O since your process will be stopped if it tries to access part of the file that is not mapped -- this isnt a syscall blocking (which you might work around) but rather any attempt to access mapped memory.

I like mmap for tasks like seeking into ZIP files, where you can look at the back 1% of the file, then locate and extract one of the subfiles; the trouble there is that the really fun case is to do this over the network with http (say to solve Python dependencies, to extract the metadata from wheel files) in which case this method doesnt work.

5 comments

mmap is great for rapid prototyping. For anything I/O-heavy, it's a mess. You have zero control over how large your I/Os are (you're very much at the mercy of heuristics that are optimized for loading executables), readahead is spotty at best (practical madvise implementation is a mess), async I/O doesn't exist, you can't interleave compression in the page cache, there's no way of handling errors (I/O error = SIGBUS/SIGSEGV), and write ordering is largely inaccessible. Also, you get issues such as page table overhead for very large files, and address space limitations for 32-bit systems.

In short, it's a solution that looks so enticing at first, but rapidly costs much more than it's worth. As systems grow more complex, they almost inevitably have to throw out mmap.

> It is incompatible with non-blocking I/O since your process will be stopped if it tries to access part of the file that is not mapped

Yeah, but the same problem occurs in normal memory when the OS has swapped out the page.

So perhaps non-blocking I/O (and cooperative multitasking) is the problem here.

> Yeah, but the same problem occurs in normal memory when the OS has swapped out the page.

I'd argue that swapping is an orthogonal problem which can be solved in a number of ways: disable swap at the OS level, mlock() in the application, maybe others.

mmap is really a bad API for IO — it hides synchronous IO and doesn't produce useful error statuses at access.

> So perhaps non-blocking I/O (and cooperative multitasking) is the problem here.

I'm not sure how non-blocking IO is "the problem." It's something Windows has had forever, and unix-y platforms have wanted for quite a long time. (Long history of poll, epoll, kqueue, aio, and now io_uring.)

> it hides synchronous IO and doesn't produce useful error statuses at access.

You can trap IO errors if necessary. E.g. you can raise signals just like segfaults generate signals.

> I'm not sure how non-blocking IO is "the problem."

The point is that non-blocking IO wants to abstract away the hardware, but the abstraction is leaky. Most programs which use non-blocking IO actualy want to implement multitasking without relying threads. But that turns out to be the wrong approach.

> The point is that non-blocking IO wants to abstract away the hardware, but the abstraction is leaky.

Why do you say it doesn't match hardware? Basically all hardware is asynchronous — submit a request, get a completion interrupt, completion context has some success or failure status. Non-blocking IO is fundamentally a good fit for hardware. It's blocking IO that is a poor abstraction for hardware.

> Most programs which use non-blocking IO actualy want to implement multitasking without relying threads. But that turns out to be the wrong approach.

Why is that the wrong approach? Approximately every high-performance httpd for the last decade or two has used a multitasking, non-blocking network IO model rather than thread-per-request. The overhead of threads is just very high. They would like to use the same model for non-network IO, but Unix and unix-alikes have historically not exposed non-blocking disk IO to applications. io_uring is a step towards a unified non-blocking IO interface for applications, and also very similar to how the operating system interacts with most high-performance devices (i.e., a bunch of queues).

> Why do you say it doesn't match hardware?

Because the CPU itself can block. In this case on memory access. Most (all?) async software assumes the CPU can't block. A modern CPU has a pipelining mechanism, where parts can simply block, waiting for e.g. memory to return. If you want to handle this all nicely, you have to respect the api of this process which happens to go through the OS. So for example, while waiting for your memory page to be loaded, the OS can run another thread (which it can't in the async case because there isn't any other thread).

A CPU stall on L3 miss (100ns?) is orders of magnitude shorter than the kinds of blocking IO we don't want to wait on (10s-100s of µs even for empty-queue NVMe; slower for everything else).

The OS can't run another thread while fulfilling an mmap page fault because it has to actually do the IO to fill the page while taking that trap. And in the async scenario, CPUs and high speed devices can do clever things like snoop DMAs directly into L3 cache, avoiding your L3 miss scenario as well.

The comparison between L3 miss and mmap faults is apples and oranges.

> the trouble there is that the really fun case is to do this over the network with http (say to solve Python dependencies, to extract the metadata from wheel files) in which case this method doesnt work

If the web server can tell you the total size of the file by responding to a HEAD request, and it support range requests then it will be possible.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ...

Or am I missing something?

You can't do this with mmap though, you can't instruct the OS to grab pages via HTTP range requests.
With userfaultfd(), you can. Not necessarily a good idea, though...
Write a fuse layer.
Or a signal handler (but yes, it is overkill).
You are correct, this works. There even is a file system built around this idea: https://github.com/fangfufu/httpdirfs
You use mmap whether you want to or not: the system executes your program by mmaping your executable and jumping into it! You can always take a hard fault at any time because the kernel is allowed to evict your code pages on demand even if you studiously avoid mmap for your data files. And it can do this eviction even if you have swap turned off.

If you want to guarantee that your program doesn't block, you need to use mlockall.

You're not wrong. Applications and libraries that want to be non-blocking should mlock their pages and avoid mmap for further data access. ntpd does this, for example.

After application startup, you can avoid additional mmap.

This is technically true, but the use case we're talking about is programs that are much smaller than their data. Postgres, for instance, is under 50 MB, but is often used to handles databases in the gigabytes or terabytes range. You can mlockall() the binary if you want, but you probably can't actually fit the entire database into RAM even if you wanted to.

Also, when processing a large data file (say you're walking a B-tree or even just doing a search on an unindexed field), the code you're running tends to be a small loop, within the same few pages, so it might not even leave the CPU's cache, let alone get swapped out of RAM, but you need to access a very large amount of data, so it's much more likely the data you want could be swapped out. If you know some things about the data structure (e.g., there's an index or lookup table somewhere you care about, but you're traversing each node once), you can use that to optimize which things are flushed from your cache and which aren't.

Indeed. It's a question of scale: I write programs that can't afford to get blocked behind IO, ever, and that level, I need to pay attention to things like code paging, and even more esoteric things like synchronous reclaim.

If you're just optimizing stuff generally instead of trying to guarantee invariants, sure, ignore code paging and use direct IO for your own data.

But that's a different order of magnitude problem: control plane vs data plane.

At some point, we could also say that the line fill buffer blocks our programs (more often than we realize).

All of this is accurate, but at different scales.

Also many systems in 2021 have a lot of RAM and hardly ever swap.
Process will be stopped or thread?
Thread