Hacker News new | ask | show | jobs
by notepad0x90 356 days ago
What I mean is, there is always a minimum page size you're given by the kernel. Typically that is 4kb. if you malloc(sizeof(int)); you're not getting sizeof(int), you're getting a 4kb page allocated. You can change that value to something other than 4kb but then the entire system will use that value. I'm asking why the page size allocated can't be depending on the user space allocator's syscall parameter (dynamic page size). Or is it that the MMU at a hardware/microcode level can only handle static page sizes?
2 comments

MMU hardware has a hardware-defined minimum mapping granularity (i.e. page size). On x86-64 this is 4KB. On ARMv8/v9 the specification allows runtime configuration for any of 4KB, 16KB, or 64KB as the minimum mapping granularity on a per-process basis, though the actual chip implementation is free to only support a subset of such configurations and still be conformant.

Implementations may also define certain sizes that are more efficient. On modern x86-64, large pages may allow 2MB, 1GB, 512 GB, etc. (multiples of 512) to be more efficient. On modern ARMv8/v9 there is similar large page support (the exact details depend on the minimum mapping granularity) and they may also support aligned, contiguous mappings which making other sizes efficient (the exact details are even more complicated than simple large pages and highly optional).

As such, if you want to get just "1 byte", then you create a userspace allocator that asks for memory that conforms to the limits of the MMU and then it is the job of the userspace allocator to abstract that away for you by chopping it up under its own auspices.

I suppose having some sort of minimum granularity for the MMU makes a lot of sense, but above that point, why limit the "steps" or the maximum? why not 4KB+2 for example (4098)? For the sake or discussion, let's say I have 128GB ram and 50TB page file, can that work? Why is a fixed-size page more efficient, is it because of offset calculations being simpler arithmetic operations? I figured the TLB is (or can be) a simple (large) table implemented in hardware, the simplicity translating into O(N) complexity. I am suspecting it might have to do with hardware cost?
Note that the page file is essentially a datastructure which maps addresses to other addresses, but it also resides in the same physical memory. If you were to try to map every byte of a system to different physical address, it would need significantly more memory than the system had! Further, the datastructure needs to allow efficient lookup of data because the pagefile mapping occurs on every memory access (results are cached in the Translation Lookaside Buffer (TLB) and the speed and accuracy of this cache is often a limiting factor in the speed of the CPU). This all biases towards the mapping wanting to be quite coarse, in fact 4kB is probably already not optimal on today's systems with gigabytes of memory.
But why does it work that way, why does all memory need to be mapped before allocation? why can't it work like file systems where you have a table/datastructure mapping inodes to files? The smaller your allocations are, the less total ram you have, that would make sense. Some apps need huge chunks of memory, some don't, why the one-size-fits-all approach? TLB lookup can also be O(n), it will be slower when you have lots of small allocations because that means lots of lookups but that lets applications optimize performance that way instead of having it be a system-wide setting.