| What you're describing was actually the second problem. The first problem (that ~80% of the article was about) was having ~1k threads reading from mmap'ed files. When you run out of memory, the kernel's supposed to drop some of those mmap'ed pages (you can always read the file again if you need the data). Since it's a cgroup (Docker) the kernel only scans for droppable pages when the cgroup memory limit's reached. The scan is single threaded, and the kernel needs two scans with a zero access bit to decide a page is "droppable". OP's container runs ~1k threads on 4 CPU's so the scan takes minutes of wall-clock time because of thread contention, and it has to scan all pages (having never scanned them before). By the time the second scan runs, the application's memory access pattern (database) has already ended up setting most of the access bits again. Upshot is, even though the container would have plenty of free memory if all its mmap'ed pages were evicted, the kernel ends up repeatedly doing very slow scans that end up finding very few evictable pages. The lock held by the scan causes some other symptoms (like ps hanging). As far as a kernel patch, I would suggest these mitigation strategies: - (1) Boost the priority of the scan thread if it doesn't seem to be getting at least ~0.5 core worth of runtime - (2) Spontaneously initiate a scan of cgroup memory when ~75% (configurable) of its memory limit is used - (3) Always bring memory usage down to at most ~95% (configurable) of the cgroup limit, randomly picking pages to be evicted if necessary. I say "would suggest" because OP eventually admits "Oh by the way, this issue stopped happening when we upgraded our kernel, and the new version's release notes said they completely redesigned this whole subsystem." |
The article would have been a bit clearer to read with something like your summary up near the front to provide at least a framework of what to look for further onward.
After re-reading, does sound like Linux 6.1 ended having a fix with this portion almost near the end: