Hacker News new | ask | show | jobs
by thyrsus 1804 days ago
Are there recommendations for learning about Linux kernel memory management? Two anecdata:

* I had some compute servers that were up for 200 days. The customers noticed that they were half as fast as identical hardware just booted. Dropping the file system cache ("echo 3 | sudo dd of=/proc/sys/vm/drop_cache") brought the speed back up to the newly deployed servers. WTF? File system caches are supposed to be zero cost discards as soon as processes ask for RAM - but something else is going on. I suspect the kernel is behaving badly with overpopulated RAM management data (TLB entries?), but I don't know how to measure that.

* If that is actually the problem, then a solution might be to decrease data size by using non-zero hugepages ("cat /proc/sys/vm/nr_hugepages"). I'd love to see recommendations on when to use that.

5 comments

I don’t remember details now, but I’ve seen a situation when a Java app was working slower and a box with more RAM (and probably a bigger heap size), compare to a box with the same CPU but 2x less RAM. I suspected that TLB cache was the reason, but didn’t have time to test this.
Could have also been compressed OOPs
Explicit hugepages on x86 are difficult to manage. Most people using off-the-shelf software can only take advantage of it by configuring, for example, innodb buffer pools to use them. However if your compute server really is a database, then you'll find the performance benefit is well worth the configuration.

For other processes you'll need a hugepage-aware allocator such as tcmalloc (the new one, not the old one) and transparent hugepages enabled. Again, the benefits of this may be enormous, if page table management is expensive on your services.

You will find a great deal of blogs on the web recommending disabling transparent hugepages. These people are all mislead. Hugepages are a major benefit.

THP is a net loss for many workloads, including PG https://www.percona.com/blog/2019/03/06/settling-the-myth-of...

For workload using forking and CoW sharing like Redis or CRuby it negates the entire benefit of CoW since flipping a single bit copies the entire huge page.

That's what used to happen but since kernel 5.8, anonymous shared pages that are dirtied by child processes are instead divided into normal pages, in the same way they would be if they were named (file-backed) mappings.
3rd party closed source software; I think it's using the C library malloc - which uses sbrk for small things, but uses mmap for >= 128k. Fun historical fact: the Red Hat/CentOS 5 kernel ulimit didn't limit mmap allocations :-/
Memory fragmentation? Dropping the cache and restarting high mem services at the same time might clear things up.
The kernel uses the sysctl vm.vfs_cache_pressure to determine whether to evict cache vs. process memory.
Are you using any swap? If so, check the swappiness setting
No swap. These are large RAM (400G to 1000G) Kubernetes nodes.
This is likely due to a kernel bug that was caused by the way cgroup slab management is handled. Upgrade to 5.10 or later, and it should be fixed. I’d be interested to see if the problem continues.