Hacker News new | ask | show | jobs
by bcrl 241 days ago
If you're doing large amounts of sequential reads from a filesystem, it's probably not in cache. You only get latency that low if you're doing nothing else that stresses the memory subsystem, which is rather unlikely. Real applications have overhead, which is why microbenchmarks like this are useless. Microbenchmarks are not the best first order estimate for programmers to think of.
1 comments

Yes, I went into more detail on those issues in https://news.ycombinator.com/item?id=45689464, but overhead is irrelevant to the issue we were discussing, which is about how long it takes to read 100 bytes from memory. Microbenchmarks are generally exactly the right way to answer that question.

Memory subsystem bottlenecks are real, but even in real applications, it's common for the memory subsystem to not be the bottleneck. For example, in this case we're discussing system call overhead, which tends to move the system bottleneck inside the CPU (even though a significant part of that effect is due to L1I cache evictions).

Moreover, even if the memory subsystem is the bottleneck, on the system I was measuring, it will not push the sequential memory access time anywhere close to 1 nanosecond per byte. I just don't have enough cores to oversubscribe the memory bus 30×. (1.5×, I think.) Having such a large ratio of processor speed to RAM interconnect bandwidth is in fact very unusual, because it tends to perform very poorly in some workloads.

If microbenchmarks don't give you a pretty good first-order performance estimate, either you're doing the wrong microbenchmarks or you're completely mistaken about what your application's major bottlenecks are (plural, because in a sequential program you can have multiple "bottlenecks", colloquially, unlike in concurrent systens where you almost always havr exactly one bottleneck.) Both of these problems do happen often, but the good news is that they're fixable. But giving up on microbenchmarking will not fix them.

If you're bottlenecked on a 100 byte read, the app is probably doing something really stupid, like not using syscalls the way they're supposed to. Buffered I/O has existed from fairly early on in Unix history, and it exists because it is needed to deal with the mismatch between what stupid applications want to do versus the guarantees the kernel has to provide for file I/O.

The main benefit from the mmap approach is that the fast path then avoids all the code the kernel has to execute, the data structures the kernel has to touch, and everything needed to ensure the correctness of the system. In modern systems that means all kinds of synchronization and serialization of the CPU needed to deal with $randomCPUdataleakoftheweek (pipeline flushes ftw!).

However, real applications need to deal with correctness. For example, a real database is not just going to just do 100 byte reads of records. It's going to have to take measures (locks) to ensure the data isn't being written to by another thread.

Rarely is it just a sequential read of the next 100 bytes from a file.

I'm firmly in the camp that focusing on microbenchmarks like this is frequently a waste of time in the general case. You have to look at the application as a whole first. I've implemented optimizations that looked great in a microbenchmark, but showed absolutely no difference whatsoever at the application level.

Moreover, my main hatred for mmap() as a file I/O mechanism is that it moves the context switches when the data is not present in RAM from somewhere obvious (doing a read() or pread() system call) to somewhere implicit (reading 100 bytes from memory that happens to be mmap()ed and was passed as a pointer to a function written by some other poor unknowing programmer). Additionally, read ahead performance for mmap()s when bringing data into RAM is quite a bit slower than on read()s in large part because it means that the application is not providing a hint (the size argument to the read() syscall) to the kernel for how much data to bring in (and if everything is sequential as you claim, your code really should know that ahead of time).

So, sure, your 100 byte read in the ideal case when everything is cached is faster, but warming up the cache is now significantly slower. Is shifting costs that way always the right thing to do? Rarely in my experience.

And if you don't think about it (as there's no obvious pread() syscall anymore), those microseconds and sometimes milliseconds to fault in the page for that 100 byte read will hurt you. It impacts your main event loop, the size of your pool of processes / threads, etc. The programmer needs to think about these things, and the article mentioned none of this. This makes me think that the author is actually quite naive and merely proud in thinking that he discovered the magic Go Faster button without having been burned by the downsides that arise in the Real World from possible overuse of mmap().

Perhaps surprisingly, I agree with your entire comment from beginning to end.

Sometimes mmap can be a real win, though. The poster child for this is probably LMDB. Varnish also does pretty well with mmap, though see my caveat on that in my linked comment.

Varish was very well done. It's disappointing that with HTTPS-first nowadays there is very little oppourtunity to make good use of local web caches of web content across browsers / clients. Caches would have been a godsend back in the 1990s when we had to use shared dialup to connect to the internet while using NetScape in a classroom full of computers.
Yeah. But it sees a lot of reverse proxy use, and projects like IPFS are exploring the possibilities of securely-locally-cacheable data.