Hacker News new | ask | show | jobs
by geofft 1979 days ago
This is technically true, but the use case we're talking about is programs that are much smaller than their data. Postgres, for instance, is under 50 MB, but is often used to handles databases in the gigabytes or terabytes range. You can mlockall() the binary if you want, but you probably can't actually fit the entire database into RAM even if you wanted to.

Also, when processing a large data file (say you're walking a B-tree or even just doing a search on an unindexed field), the code you're running tends to be a small loop, within the same few pages, so it might not even leave the CPU's cache, let alone get swapped out of RAM, but you need to access a very large amount of data, so it's much more likely the data you want could be swapped out. If you know some things about the data structure (e.g., there's an index or lookup table somewhere you care about, but you're traversing each node once), you can use that to optimize which things are flushed from your cache and which aren't.

1 comments

Indeed. It's a question of scale: I write programs that can't afford to get blocked behind IO, ever, and that level, I need to pay attention to things like code paging, and even more esoteric things like synchronous reclaim.

If you're just optimizing stuff generally instead of trying to guarantee invariants, sure, ignore code paging and use direct IO for your own data.