Hacker News new | ask | show | jobs
by quazeekotl 2515 days ago
Unfortunately enabling swap in linux has a very annoying side effect, linux will preferentially push out pages of running programs that have been untouched for X time for more disk cache, pretty much no matter how much ram you have.

This comes into play when you copy or access huge files that are going to be read exactly once, they will start pushing out untouched program pages to disk, in exchange for disk cache that is completely 100% useless, even to the tune of hundreds of gigabytes of it.

Programs can reduce the problem with madvise(MADV_DONT_NEED), but that only applies to files you are mmap()ing, and every single program under the sun needs to be patched to issue these calls.

You can adjust vm.swapiness systctl to make X larger, but no matter what, programs will start to get pushed out to disk eventually, and cause unresponsiveness when activated. You can reduce vm.swapiness to 1, but if you do, the system only starts swapping in an absolute critical low ram situation and you encounter anywhere from 5 minutes, to 1+ hour of total, complete unresponsiveness in a low ram situation.

There _NEEDS_ to be a setting where program pages don't get pushed out for disk cache, peroid, unless approaching a low ram situation, but BEFORE it causes long periods of total crushing unresponsiveness.

8 comments

There _NEEDS_ to be a setting where program pages don't get pushed out for disk cache, peroid, unless approaching a low ram situation, but BEFORE it causes long periods of total crushing unresponsiveness.

Here's the thing: a mapped program page is just another page in the page cache. Now, you could maybe say that "any page cache page that is mapped into at least one process will be pinned", but the problem there is that means that any unprivileged process can then pin an unlimited amount of memory, which is an obvious non-starter.

A workable alternative might be to add an extended file attribute like 'privileged.pinned_mapping', which if set indicates that any pages of the file that have active shared mappings are pinned. That means the superuser can go along and mark all the normal executables in this way, and the worst-case memory consumption a user can cause is limited by the total size of all the executables marked in this way that the user has access to.

SuSE solves this in their SuSE Linux Enterprise Server (SLES) with a new sysctl tunable, which soft-limits the size of the page cache.

https://www.suse.com/documentation/sles-for-sap-12/book_s4s/...

It is quite effective, although historically there have been issues with bugs causing server lockups in the kernel code around this tunable. It seems to be quite stable in SLES 15, however.

While the tunable is available in their regular SLES product, it is only supported in the "SLES for SAP". The two share the same kernel, that is probably why.

Theres no reason extra data cannot be added to entries in the page cache to make smarter decisions. That’s how Windows and OS X do it in their equivalent subsystems.

Nobody is suggesting these pages be pinned which is an extreme measure.

The problem I'm trying to point out here is that if the extra metadata in the page cache is entirely under user control (like for example "is mapped shared" and/or "is mapped executable") then it amounts to a user-specified QOS flag.

That might be OK on a single-user system but it doesn't fly on a multi-user one. That's why I suggested you could gate that kind of thing behind some kind of superuser control.

Why can’t a user make QoS decisions for their own pages? Root controlled pages should obviously have higher priority.

The kernel could still “fairly” evict pages across users - just letting them choose which N pages they prefer to go first.

Why can’t a user make QoS decisions for their own pages?

Because then you just get everyone asking for maximum QOS / don't-page-me-out on everything they can.

The pages in the page cache are not owned by a particular user, they're shared. If there's three users running /usr/bin/firefox, they'll all have have shared read-only executable mappings of the same page cache pages. If you do a buffered read of a file immediately after I do the same, we both get our data copied from the same page cache page. So it's not at all clear how you'd do the accounting on this to implement that user-based fairness criterion.

> but that only applies to files you are mmap()ing

fadvise provides the same for file descriptors. some tools such as rsync make use of it to prevent clobbering the page cache when streaming files.

nice! was not aware of that syscall, however, patching the entire world remains...
it might be possible to create an LD_PRELOAD'd library that wraps open type syscalls (i.e. those that return an fd, might just be open, haven't kept up with all of linux syscalls) and that based on a config file calls fadvise on those fd's that correspond to specific files/paths on disk). Won't help for statically linked binaries or those that call syscalls directly without glibc's shims, but that should be a small number of programs.

heck, if I were still a phd student, I'd want to run performance numbers on this in many different scenarios and see how performance behaves. feel like there could be a paper here.

@the8472, hah, so someone thought of the same thing. I'd try to leverage /etc/ld.so.preload with a config file as a more transparent solution, but your link proves the point that its possible.
You probably don't want it in ld preload globally because it would also clobber the page cache in processes that do benefit from it.

And if you only do it in a container you can also limit the page cache size of the container to avoid impacting the other workloads.

hence why I said a config file based (i.e. include the path that stores one's media, won't matter what program you use to play it), but yes, page cache does play a role (but hence why I also said it be interesting to explore how different applications behave with and without it and how that impacts other system performance)

i.e. I really wonder for desktop workloads if one only caches "executable" data, how would that negatively impact perceived performance. I'd imagine it have some impact, but I'd be interested in seeing it quantified.

Can't the file cache detect streaming loads and skip caching it? ZFS does this for its L2ARC[1].

[1]: https://wiki.freebsd.org/ZFSTuningGuide#L2ARC_discussion

Is there a way to enable such an option for an entire process, in the same way as e.g. ionice(1)?

Whenever I take a backup of my computer it winds up swapping everything else out to disk. Normally I'm perfectly happy letting unused pages get evicted in favor of more cache, but for this specific program this behavior is very much not ideal. I'm asking here since I've done some searching in the past and not found anything, but I'm not sure if I was using the right keywords.

Follow-up, for people who encounter this thread in the future: I did some more hunting and found `nocache` (https://github.com/Feh/nocache , though I installed it via the Ubuntu repositories) which does this by intercepting the open() system call and calling fadvise() immediately afterwards.
> Unfortunately enabling swap in linux has a very annoying side effect, linux will preferentially push out pages of running programs that have been untouched for X time for more disk cache, pretty much no matter how much ram you have.

THIS. I ended up disabling swap because my kernel insisted on essentially reserving 50% of RAM for disk buffers; meaning even with 16GiB of RAM, I'd have processes getting swapped out and taking forever to run, because everything was stuck in 8GiB of RAM, and Firefox was taking 6GiB of that. I couldn't for the life of me figure out a way to get Linux to make that something more reasonable, like 20%. (And yes, I tried playing with `vm.swapiness`.)

Programs should really use unbuffered i/o for large files read only once (yes i know Linus doesn’t like unbuffered i/o but he’s wrong)

> This comes into play when you copy or access huge files that are going to be read exactly once

Readahead is still useful for large files read sequentially once, and that needs to be buffered. Such programs should use posix_fadvise().
you can readahead as far as you like with unbuffered io
If you are reading unbuffered (ie. O_DIRECT) then you are reading directly into the memory block the user supplied, so you cannot read ahead - there's nowhere to put the extra data.
Of course you can read ahead using multiple buffers, you can issue as many reads as you want concurrently
I think it is pretty clear I was referring to kernel-mediated readahead. Sure, you can achieve the same thing in userspace using async IO or threads.
> There _NEEDS_ to be a setting where program pages don't get pushed out for disk cache, peroid, unless approaching a low ram situation, but BEFORE it causes long periods of total crushing unresponsiveness.

Did you try different vm.vfs_cache_pressure values?

Probably the best solution to this is something like memlockd where you explicitly tell the kernel what memory must always be in resident set.