|
|
|
|
|
by araes
475 days ago
|
|
The article was eventually kind of interesting, although it had so much investigation involved I forgot what I was even reading by the end. General idea was interesting, and probably something to look at (apparently there's an issue open). Final result was (I think...) that the Least Recently Used (LRU) memory function requires a spinlock to actually swap out memory pages, and there's huge amounts of contention. "during 3 seconds there were at least 138 threads active. 84% of stacktraces have 'evict_folios' frame according to the flamegraph, so it is very likely that more than 100 threads are constantly trying to do something with the spinlock."
So, basically 100 threads fighting over evict_folio and lru_lock constantly, and at least it seems (although admit eyes started glazing over) they're all fighting over the same memory page regions every time they're trying to lru_lock (initiate a spinlock for memory access releases).Note: Totally way outside of my standard programming realm, so if somebody has a clearer / better explanation / summary ... |
|
The first problem (that ~80% of the article was about) was having ~1k threads reading from mmap'ed files. When you run out of memory, the kernel's supposed to drop some of those mmap'ed pages (you can always read the file again if you need the data).
Since it's a cgroup (Docker) the kernel only scans for droppable pages when the cgroup memory limit's reached. The scan is single threaded, and the kernel needs two scans with a zero access bit to decide a page is "droppable".
OP's container runs ~1k threads on 4 CPU's so the scan takes minutes of wall-clock time because of thread contention, and it has to scan all pages (having never scanned them before). By the time the second scan runs, the application's memory access pattern (database) has already ended up setting most of the access bits again.
Upshot is, even though the container would have plenty of free memory if all its mmap'ed pages were evicted, the kernel ends up repeatedly doing very slow scans that end up finding very few evictable pages. The lock held by the scan causes some other symptoms (like ps hanging).
As far as a kernel patch, I would suggest these mitigation strategies:
- (1) Boost the priority of the scan thread if it doesn't seem to be getting at least ~0.5 core worth of runtime
- (2) Spontaneously initiate a scan of cgroup memory when ~75% (configurable) of its memory limit is used
- (3) Always bring memory usage down to at most ~95% (configurable) of the cgroup limit, randomly picking pages to be evicted if necessary.
I say "would suggest" because OP eventually admits "Oh by the way, this issue stopped happening when we upgraded our kernel, and the new version's release notes said they completely redesigned this whole subsystem."