The page table on x86/amd64 is a trie. OSDEV[1] has a pretty good page on it.
All you need to remember is that this is what is inside the operating system and hardware layers, so you're already paying for it. What I propose is taking advantage of it. If you can arrange things correctly (see my other comment describing this more fully[2]) you might find things that seem expensive (when dealing only with von neumann memory) are suddenly very cheap when you accept you're programming a piece of real hardware.
I'm interested too. I've been thinking about the idea, and it seems like you'd have to be sure to re-use the hell out of your file descriptor on /dev/zero (otherwise, you could easily run out of file descriptors). It also seems like you're trading cache misses for system calls if you have to re-map pages a lot. Maybe it's a clear win, but I'd like to understand it better.
No need for /dev/zero: Linux has memfd[1] and OSX has vm_remap[2]. You only need one file descriptor per heap because Linux lets you poke holes with fallocate[3].
I'll define objects with a header that looks roughly like this:
struct header {
off_t phys;
size_t len;
int localrefs;
short flags;
char mstrategy, type;
};
phys is the offset in the heap file descriptor. len is the number of units and type is indexed into an array of unit sizes.
mstrategy is used to select the bucket size (allocated range is a power of two, so 1<<(mstrategy&31)) and the heap number.
localrefs is an optimisation which I'll get to.
If I want to allocate an array of type t, size n, I can use a BSR[4] to identify the bucket that it needs, and see if there's anything on the free list. If there isn't, I can see if I can split a larger bucket into two parts (this is effectively Knuth's buddy allocator).
I know that (mstrategy&31) < BSR(page size) needs to be moved just like the classical allocator for appending or prepending, but when (mstrategy&31) >= BSR(page size) I can take a virtual address of a bigger region, then either mmap the memfd (using phys) or vm_remap the region into the bigger region. Instead of copying the contents of the pages, the operating system will simply copy the page table (which is 3 levels deep[5], hence the log log log, although using 1G pages means log log with a lower coefficient). This is a tremendous win for problems that need to deal with several large arrays.
Now localrefs gives me a further optimisation: In-process, I can track the reference count of objects, and if my functions always consume their arguments I know inside the grow/append/prepend routine if this is the only holder of the object. If it is, I can potentially reuse this virtual address immediately, saving 3-4 syscalls.
When it's time to deallocate, I can put small objects on my free list, and garbage collect any big objects by calling fallocate() on the physical address to poke a hole (freeing system memory). OSX doesn't need fallocate() because mach has vm_unmap.
> It also seems like you're trading cache misses for system calls if you have to re-map pages a lot
Cache misses aren't the dominant force here.
The real trade off is illustrated with benchmarking: memcpy one page, versus 8 bytes (page table entry). How many pages do you need to copy before it is faster to pay the fixed (<100ns) cost of the system call and just copy the page table entries?
Memory streams at a rate of around 10GB/sec, but the TLB flush is ~100ns and the memory latency is only around 10ns, so it's easy to see how quick the gains add up when you're using 1GB pages.
All you need to remember is that this is what is inside the operating system and hardware layers, so you're already paying for it. What I propose is taking advantage of it. If you can arrange things correctly (see my other comment describing this more fully[2]) you might find things that seem expensive (when dealing only with von neumann memory) are suddenly very cheap when you accept you're programming a piece of real hardware.
[1]: http://wiki.osdev.org/Page_Tables
[2]: https://news.ycombinator.com/item?id=13269288