|
|
|
|
|
by afr0ck
657 days ago
|
|
Because of virtual address translation [1] speed up. When a memory access is made by a program, the CPU must first translate the virtual address to a physical address, by walking a hierarchical data structure called a page table [2]. Walking the page tables is slow, thus CPUs implement a small on-CPU cache of virtual-to-physical translations called a TLB [1]. The TLB has a limited number of entries for each page size. With 4 KiB pages, the contention on this cache is very high, especially if the workload has a very large workingset size, therefore causing frequent cache evictions and slow walk of the page tables. With 2 MiB or 1 GiB pages, there is less contention and more workingset size is covered by the TLB. For example, a TLB with 1024 entries can cover a maximum of 4 MiB of workingset memory. With 2 MiB pages, it can cover up to 2 GiB of workingset memory. Often, the CPU has different number of entries for each page size. However, it is known that larger page sizes have higher internal fragmentation and thus lead to memory wastage. It's a trade off. But generally speaking, for modern systems, the overhead of managing memory in 4 KiB is very high and we are at a point where switching to 16/64 KiB is almost always a win. 2 MiB is still a bit of a stretch, though, but transparent 2 MiB pages for heap memory is enabled by default on most major Linux distributions, aka THP [2] Source: my PhD is on memory management and address translation on large memory systems, having worked both on hardware architecture of address translation and TLBs as well as the Linux kernel. I'm happy to talk about this all day! [1] https://blogs.vmware.com/vsphere/2020/03/how-is-virtual-memo...
[2] https://docs.kernel.org/admin-guide/mm/transhuge.html |
|
Oh really :)
I'd like to ask how applications should change their memory allocation or usage patterns to maximise the benefit of THP. Do memory allocators (glibc mainly) need config tweaking to coalesce tiny mallocs into 2MB+ mmaps, will they just always do that automatically, do you need to use a custom pool allocator so you're doing large allocations, or are you never going to get the full benefit of huge tables without madvise/libhugetlbfs? And does this apply to Mac/Windows/*BSD at all?
[Edit: ouch, I see /sys/kernel/mm/transparent_hugepage/enabled is default set to 'madvise' on my system (Slackware) and as a result doing nearly nothing. But I saw it enabled in the past. Well that answers a lot of my questions: got to use madvise/libhugetlbfs.]
I read you also need to ensure ELF segments are properly aligned to get transparent huge pages for code/data.
Another question. From your link [2]:
> An application may mmap a large region but only touch 1 byte of it, in that case a 2M page might be allocated instead of a 4k page for no good.
Do the heuristics used by Linux THP (khugepaged) really allow completely ignoring whether pages have actually been page-faulted in or even initialised? Is a possibility unlikely to happen in practice?