Hacker News new | ask | show | jobs
by ggreer 3940 days ago
I agree with pretty much everything in this post, though I would add one more thing. It's not so much a downside of caching as a misuse: Application-level caches should never cache local data. Cache network responses. Cache the results of computations. Don't cache files or disk reads. Operating systems already implement disk caches, and they do a better job of it than you. That's in addition to a modern computer's numerous hardware caches. For example, take this code:

    ...
    FILE *fp = fopen("example.txt", "r");
    char dest;
    int bytes_read = fread(&dest, 1, 1, fp);
    putchar(dest);
    ...
Think of how many caches likely contain the first byte of example.txt. There's the internal cache on the hard disk or SSD. There's the OS's filesystem cache in RAM. There's your copy (dest) in RAM, and also in L3, L2, and L1 cache. (These aren't inclusive on modern Intel CPUs. I'm just talking about likelihood.) Implementing your own software RAM cache puts you well into diminishing returns. The increased complexity simply isn't worth it.
2 comments

I wouldn't say "never". Let's say you have a local file with 1e6 words. The file can be updated at some point. Your service gets a word in a request and needs to return "is this word in the list".

Do you really want to read the file every time at request comes in? No, you're going to read it once and store it in an indexed set for quick lookup. You just cached a local data file.

It's about the benefit vs. not caching. Not about local/remote.

Your example is quite valid, and I would probably implement something similar to solve the same problem. But it's not a cache. Caches can miss. Caches have a replacement policy. If it contains the complete, authoritative copy of the data, it's a memory-backed data store.
Re replacement policy - that's why I mentioned that the file can be changed. You'll need an mtime/inode/time check on each request / periodically.

Cache can miss? I don't think that's required. It can miss in a general sense, as in it needs to be lazy loaded. But I'd still call it caching if you're getting a single value. For example, you can still cache whole front page with server-side push into the cache. You also can't miss in that case.

But yeah, that's just details. Cache/memory structure is a rather vague separation.

I wouldn't call that a cache. It's certainly computed state, of the kind that can be thrown away and recomputed ala Crash-Only programming, but in my experience a "cache" is supposed to be transparent: to decorate an file/device/socket/RPC service/object/whatever and expose the same semantics as that whatever, but more performantly. Your indexed set doesn't expose the same semantics as the data file it was constructed from.
I don't think that's very good advice in a heavily-loaded shared hosting environment. A disk read could easily stall for tens of seconds, just because the kernel whimsically decided to throw out the cache (or because your server crowded its memory container). I actually don't want any server touching a disk while it's serving. Everything should be read before service begins and never again.
Your proposed solution (read from disk on startup and never again) is really a memory-backed data store, not a cache. Caches can miss.

But let's analyze your example. If disk reads take tens of seconds and memory usage is high enough to purge the kernel's disk cache, nothing can save you. Had your process read in everything at the start, it would be using even more memory. Given the same load, one of two things will happen:

1. If you have swap enabled, parts of your process's memory will be swapped-out. Accessing "memory" in this case would cause a page fault and tens of seconds of delay.

2. If you have swap disabled, the OOM-killer will reap your process. When it respawns, it's going to read lots of stuff from disk... and disk reads take tens of seconds. Oops.

Even if an application-level data cache improved performance on heavily-loaded shared hosts, the added costs of software development and maintenance far exceed the cost of better hardware. Hardware is cheap. Developers are expensive.

Here's an example. You have a 100MB C++ executable that needs 4GB for its own various purposes and 20GB of data that it's serving. The machine has 64GB of memory. If you allocate 24.1GB of memory to the container for this service, disable swap, and mlock the binary and the data files, nothing will go wrong.

On the same machine is a batch process which is reading a 1TB file and writing another 1TB file. If your serving process was reliant on the OS page cache, it would find that its pages were routinely evicted in favor of the batch process.

You're right about swap, that's why only a crank would enable swap. The moment at which swap was a reasonable solution was already behind us 20 years ago.

In that example, I'm pretty sure forgoing containers and mlock would result in similar performance while using less memory. Process startup time would also be significantly improved. (If there's such high contention for disk I/O, reading 20GB on startup is going to take a very long time.)

The kernel's page cache eviction strategy is smarter than naïve LRU. On the first read, a page is placed in the inactive file list. If it's read again, it's moved to the active file list. Pages in the inactive file list are purged before the active file list.[1] So large sequential reads may cause disk contention, but they won't massacre the file cache.

This I/O situation isn't uncommon. Consumer systems also have big batch jobs that can pollute file caches: large copies, rsyncs, backup software (Déjà Dup, Time Machine, etc). They don't solve this with containers, limits, and mlock()ing. Some programs add a couple calls to fadvise(), using the FADV_NOREUSE or FADV_DONTNEED flags.[2] But for the most part, doing nothing yields excellent performance. Operating systems are pretty good at their job.

1. https://www.kernel.org/doc/gorman/html/understand/understand...

2. This is handy for applications like bittorrent, where multiple reads of the same page are possible, but caching isn't desired.

If only O_STREAMING had made it to the kernel... https://lwn.net/Articles/12100/
In other words, you're advocating using more memory in a shared environment just so your server can (try to) be faster? What happens if everyone else does the same? Everyone ends up competing for the limited amount of memory, and no one wins.

"I'll grab all the memory I can so others can't use it" is a horrible way to think, as anyone who has attempted to simultaneously use multiple applications written with this mindset will know. One takes most of the memory, forcing other apps into swap, and then the opposite happens when you start working with one of the others, accompanied by massive swapping slowdowns.

Can I suggest that if you've got problems including "stalls for tens of seconds on disk reads" you are almost certainly better off directing available resources towards fixing your hosting problems, rather than going down the cache rabbit hole on a hosting platform that's not really suitable for production use?

(With caveats for zero resource projects of course, but even for those I strongly suspect for many people paying $5 or $10 per month for "less crap hosting" is probably a better solution that prematurely optimising by adding caching and all it's inherent complexity to a fundamentally broken platform)

There's nothing you or I can do about the trend. They put more and more cores into a machine and the same number of disks (current-model Xeon servers have 72 threads and 1 or 2 disks), which guarantees that, at some point, the disk is highly oversubscribed.
Sure - the "race to the bottom" for hosting prices inevitably means there's going to be options like GoDaddy offering "a year's worth of webhosting for $5" which clusters 400 WordPress and Drupal sites onto a single RaspberryPi or similar, but you don't _have_ to go there.

I can understand if you're an open source developer who gets paid in Uzbeki Som or Nigerian Naira, the calculation of "do I spend a day or two putting caching in place" versus "do I spend an extra $50 or $100 per year on hosting" might lean very much the other way, but I suspect for the vast majority of HN readers, the prudent approach is "pay a hundred or two dollars a year for hosting before bothering to implement complex caching strategies".

In that sort of environment, I wouldn't be surprised if your app's internal cache ended up paged out anyway...
I don't allow swap on my machines, and I mlock executable pages, so I'd personally be surprised if anything was paged out.