Hacker News new | ask | show | jobs
by notepad0x90 358 days ago
allocators can be simple but how about memory management kernel subsystems?

One question I have for those who know the topic at a greater depth: Are there simpler memory managers that don't use a fixed page size? Or does that make it a lot more complex? imagine requesting allocation of 1 byte and the kernel allocating a page of 1 byte and immediately you request 20GB and you get a 20GB page. Other than memory-management metadata taking up more memory on its own, what are the downsides of dynamic page sizes?

2 comments

Fragmentation is one major reason to use fixed size classes. After just a few allocate/free-a-bit cycles, you can very easily end up with a large amount of free space in unusably tiny chunks. (this is one reason why languages with a Garbage Generator tend to prefer being allowed to move and compact their objects - their whole thing is keeping the allocator as simple as possible to beat the benchmarks, so they can't use the usual solutions)

There's also the matter of space overhead.

A bitmap allocator is simple and has an overhead of 1 bit per chunk (allocated or not). Normally the chunk is at least 16 bytes, but if you know you're doing smaller allocations there's nothing stopping you from doing one at byte granularity.

If there exists an invalid value (for example, UTF-8 strings cannot contain the byte 0xff), you can implement allocators with zero overhead, though efficiency requires changing the API (`malloc` requires writing to all the memory to erase the tombstones, so it's more efficient for the allocator to support `strdup` and `realloc_strcat` directly).

By bucketing you can vary the overhead based on allocation size. But also remember, a lot of the trickery you can do here comes at the cost of becoming more vulnerable to use-after-free bugs.

Besides all the missing APIs, one major annoyance related to allocators is the fact that userland lacks support for CPU-local variables that the kernel can use for more efficient operations. `rseq` helps a little but still falls short.

That's essentially what an allocator is, it's an abstraction between a requested size and alignment from the user space program and the memory management API that interfaces with the MMU hardware (which maps virtual addresses into physical addresses where the virtual address is a "page" number + offset). The MMU is what's dealing in pages and offsets more than the kernel.
What I mean is, there is always a minimum page size you're given by the kernel. Typically that is 4kb. if you malloc(sizeof(int)); you're not getting sizeof(int), you're getting a 4kb page allocated. You can change that value to something other than 4kb but then the entire system will use that value. I'm asking why the page size allocated can't be depending on the user space allocator's syscall parameter (dynamic page size). Or is it that the MMU at a hardware/microcode level can only handle static page sizes?
MMU hardware has a hardware-defined minimum mapping granularity (i.e. page size). On x86-64 this is 4KB. On ARMv8/v9 the specification allows runtime configuration for any of 4KB, 16KB, or 64KB as the minimum mapping granularity on a per-process basis, though the actual chip implementation is free to only support a subset of such configurations and still be conformant.

Implementations may also define certain sizes that are more efficient. On modern x86-64, large pages may allow 2MB, 1GB, 512 GB, etc. (multiples of 512) to be more efficient. On modern ARMv8/v9 there is similar large page support (the exact details depend on the minimum mapping granularity) and they may also support aligned, contiguous mappings which making other sizes efficient (the exact details are even more complicated than simple large pages and highly optional).

As such, if you want to get just "1 byte", then you create a userspace allocator that asks for memory that conforms to the limits of the MMU and then it is the job of the userspace allocator to abstract that away for you by chopping it up under its own auspices.

I suppose having some sort of minimum granularity for the MMU makes a lot of sense, but above that point, why limit the "steps" or the maximum? why not 4KB+2 for example (4098)? For the sake or discussion, let's say I have 128GB ram and 50TB page file, can that work? Why is a fixed-size page more efficient, is it because of offset calculations being simpler arithmetic operations? I figured the TLB is (or can be) a simple (large) table implemented in hardware, the simplicity translating into O(N) complexity. I am suspecting it might have to do with hardware cost?
Note that the page file is essentially a datastructure which maps addresses to other addresses, but it also resides in the same physical memory. If you were to try to map every byte of a system to different physical address, it would need significantly more memory than the system had! Further, the datastructure needs to allow efficient lookup of data because the pagefile mapping occurs on every memory access (results are cached in the Translation Lookaside Buffer (TLB) and the speed and accuracy of this cache is often a limiting factor in the speed of the CPU). This all biases towards the mapping wanting to be quite coarse, in fact 4kB is probably already not optimal on today's systems with gigabytes of memory.
But why does it work that way, why does all memory need to be mapped before allocation? why can't it work like file systems where you have a table/datastructure mapping inodes to files? The smaller your allocations are, the less total ram you have, that would make sense. Some apps need huge chunks of memory, some don't, why the one-size-fits-all approach? TLB lookup can also be O(n), it will be slower when you have lots of small allocations because that means lots of lookups but that lets applications optimize performance that way instead of having it be a system-wide setting.