Hacker News new | ask | show | jobs
by ggreer 3939 days ago
Your proposed solution (read from disk on startup and never again) is really a memory-backed data store, not a cache. Caches can miss.

But let's analyze your example. If disk reads take tens of seconds and memory usage is high enough to purge the kernel's disk cache, nothing can save you. Had your process read in everything at the start, it would be using even more memory. Given the same load, one of two things will happen:

1. If you have swap enabled, parts of your process's memory will be swapped-out. Accessing "memory" in this case would cause a page fault and tens of seconds of delay.

2. If you have swap disabled, the OOM-killer will reap your process. When it respawns, it's going to read lots of stuff from disk... and disk reads take tens of seconds. Oops.

Even if an application-level data cache improved performance on heavily-loaded shared hosts, the added costs of software development and maintenance far exceed the cost of better hardware. Hardware is cheap. Developers are expensive.

1 comments

Here's an example. You have a 100MB C++ executable that needs 4GB for its own various purposes and 20GB of data that it's serving. The machine has 64GB of memory. If you allocate 24.1GB of memory to the container for this service, disable swap, and mlock the binary and the data files, nothing will go wrong.

On the same machine is a batch process which is reading a 1TB file and writing another 1TB file. If your serving process was reliant on the OS page cache, it would find that its pages were routinely evicted in favor of the batch process.

You're right about swap, that's why only a crank would enable swap. The moment at which swap was a reasonable solution was already behind us 20 years ago.

In that example, I'm pretty sure forgoing containers and mlock would result in similar performance while using less memory. Process startup time would also be significantly improved. (If there's such high contention for disk I/O, reading 20GB on startup is going to take a very long time.)

The kernel's page cache eviction strategy is smarter than naïve LRU. On the first read, a page is placed in the inactive file list. If it's read again, it's moved to the active file list. Pages in the inactive file list are purged before the active file list.[1] So large sequential reads may cause disk contention, but they won't massacre the file cache.

This I/O situation isn't uncommon. Consumer systems also have big batch jobs that can pollute file caches: large copies, rsyncs, backup software (Déjà Dup, Time Machine, etc). They don't solve this with containers, limits, and mlock()ing. Some programs add a couple calls to fadvise(), using the FADV_NOREUSE or FADV_DONTNEED flags.[2] But for the most part, doing nothing yields excellent performance. Operating systems are pretty good at their job.

1. https://www.kernel.org/doc/gorman/html/understand/understand...

2. This is handy for applications like bittorrent, where multiple reads of the same page are possible, but caching isn't desired.

If only O_STREAMING had made it to the kernel... https://lwn.net/Articles/12100/