Hacker News new | ask | show | jobs
by geocar 3461 days ago
So do real arrays on real hardware (pagetable trie).

Small arrays are contiguous so memcpy is fast.

For large arrays, if you create a dummy file (or uses a memfd) you can mmap a region, and then copy it by mmaping the same region with the appropriate flags. You can resize (grow) it almost as easily.

They're so much faster than linked lists, that there's no (performance) reason to use linked-lists on real hardware.

4 comments

The Linux kernel uses intrusive linked lists extensively. Container based linked lists contain a copy of the data item. Intrusive linked lists force the data structures in the list to contain list pointers inside of the actual data structure, and the list operations manipulate those list-specific pointers in the data structure.

I am not sure if anyone has evaluated using alternatives, but my understanding of why has generally been memory efficiency. If you allocate a Foo, you also allocate everything you need for Foo to be on all of the lists it may appear on. I couldn't find any confirmation (with actual numbers) for this intuition, but this is what I could find:

A 2005 explanation of how intrusive lists work in the Linux kernel: "Linux Kernel Linked List Explained", https://isis.poly.edu/kulesh/stuff/src/klist/

HN submission on intrusive lists in Doom 3: https://news.ycombinator.com/item?id=8795745

And that points to technical note on the optimizations to Doom 3's BFG Edition to get it to perform well on PS3, XBox 360 and PC: http://fabiensanglard.net/doom3_documentation/DOOM-3-BFG-Tec... One of the optimizations was moving from intrusive lists.

I'm quite aware of how inappropriate linked lists are for most applications because of their poor cache locality. I thought the Linux kernel may be a case where linked lists are actually better, but I can't find any numbers or even arguments why they would be.

Linux uses linked lists because it is simple to code. That linked lists are slow doesn't matter very much because there are lots of slow parts in Linux that are a better use of attention.

Doom moved to vectors (arrays) because linked lists are slow, and because there wasn't enough other slow parts that needed attention.

Linked-Lists often get used in low-level code and embedded systems in situations where you might need multiple statically allocated entries to be turned into a list of unknown length. There certainly are times where this comes in useful, but it is also a memory constrained space with an emphasis on determinism, where malloc and new can be bad ideas.

There is a time and a place for them, but if you need speed I agree... there are better solutions

An OS kernel has no business maintaining, and even less business accessing in bulk, large data structures that are performance-critical enough to worry about coherent memory access. It is also an infrastructure-deprived environment in which swapping around pointers to implement a linked list with correct locking is relatively easy to get right but allocating vectors is out of the question.
> An OS kernel has no business maintaining, and even less business accessing in bulk, large data structures that are performance-critical enough to worry about coherent memory access

File systems are usually part of the kernel.

File systems tend not to have large data structures. They are relatively modest data structures which manage bulk access to large blocks of data. A subtle distinction, but important in this context.
Is this your understanding from intuition, or are you aware of kernel developers who have made this same argument?
Not sure exactly what you're asking.

We can evaluate the Linux kernel developers' priorities by benchmarking[1] and assuming they aren't stupid, because if they are stupid, then their opinion doesn't matter, and if they're not and they're not making things faster, then it's because they have other priorities.

That being said, there are a few[2] notable[3] moves away from linked lists that were ostensibly for performance reasons.

[1]: Even crap benchmarks: http://www.phoronix.com/scan.php?page=article&item=linux-44-...

[2]: https://lkml.org/lkml/2016/8/1/164

[3]: https://lkml.org/lkml/2008/4/1/458

You provided a reason for why the kernel does a certain thing (easier to code; performance in those places doesn't matter). I was asking if this was your understanding based on inference (your understanding of the performance trade offs in general combined with the fact things are done a certain way) or from fact (claims made directly by kernel developers, or experiments).
You can't necessarily replace intrusive linked lists with arrays. They have more operations. You can traverse so far in one list then switch to traversing in another.
No, but you also don't have to do that.

One list is traversed when committing blocks, so a linked list was never necessary - just a commit-list (vector of pointers).

Another list is traversed when dequeueing the next lock, so again: a linked list isn't necessary, just a dequeue (which might not have to be serialised).

Another list is traversed when finding the next blocked reader, but again serialisation wasn't required here.

And so on.

I don't know how the kernel is using intrusive lists, but that kind of traversing and switching is occasionally useful.
Yes. It is occasionally useful.

However when you have the right data structure, you'll find you won't need it.

Benchmarking is important because searching for the right data structure is time consuming (expensive for the programmer) and it's usually not necessary.

> memory efficiency. If you allocate a Foo, you also allocate everything you need for Foo to be on all of the lists it may appear on.

I can't speak for the kernel developers, but for the same line of reasoning it may be more significant that there is the lack of error paths.

When you allocate an 'object', further changes of state (eg. added or removal to other lists) in its lifetime can be done without the possibility of running out of memory or address space. This can be a huge benefit to overall program structure in certain types of application.

In contrast, addition or removal to a vector or flat list can ultimately require various actions to happen.

My example would probably be the work of Phil Bagwell. IMHO the biggest issue with Linked-lists is they do not support parallelism, they are inherently sequential.
Give up sequential! :)

If you want queues, and you can give up sequential, then you can get huge gains[1] in performance. Specific use cases will have better specific solutions.

[1]: e.g. https://github.com/cameron314/concurrentqueue

You can reduce some linked list operations (including multi-element ones) to a single CAS, which is often good enough.
Where I can find info about pagetable trie?
The page table on x86/amd64 is a trie. OSDEV[1] has a pretty good page on it.

All you need to remember is that this is what is inside the operating system and hardware layers, so you're already paying for it. What I propose is taking advantage of it. If you can arrange things correctly (see my other comment describing this more fully[2]) you might find things that seem expensive (when dealing only with von neumann memory) are suddenly very cheap when you accept you're programming a piece of real hardware.

[1]: http://wiki.osdev.org/Page_Tables

[2]: https://news.ycombinator.com/item?id=13269288

I'm interested too. I've been thinking about the idea, and it seems like you'd have to be sure to re-use the hell out of your file descriptor on /dev/zero (otherwise, you could easily run out of file descriptors). It also seems like you're trading cache misses for system calls if you have to re-map pages a lot. Maybe it's a clear win, but I'd like to understand it better.
No need for /dev/zero: Linux has memfd[1] and OSX has vm_remap[2]. You only need one file descriptor per heap because Linux lets you poke holes with fallocate[3].

I'll define objects with a header that looks roughly like this:

    struct header {
      off_t phys;
      size_t len;
      int localrefs;
      short flags;
      char mstrategy, type;
    };
phys is the offset in the heap file descriptor. len is the number of units and type is indexed into an array of unit sizes.

mstrategy is used to select the bucket size (allocated range is a power of two, so 1<<(mstrategy&31)) and the heap number.

localrefs is an optimisation which I'll get to.

If I want to allocate an array of type t, size n, I can use a BSR[4] to identify the bucket that it needs, and see if there's anything on the free list. If there isn't, I can see if I can split a larger bucket into two parts (this is effectively Knuth's buddy allocator).

I know that (mstrategy&31) < BSR(page size) needs to be moved just like the classical allocator for appending or prepending, but when (mstrategy&31) >= BSR(page size) I can take a virtual address of a bigger region, then either mmap the memfd (using phys) or vm_remap the region into the bigger region. Instead of copying the contents of the pages, the operating system will simply copy the page table (which is 3 levels deep[5], hence the log log log, although using 1G pages means log log with a lower coefficient). This is a tremendous win for problems that need to deal with several large arrays.

Now localrefs gives me a further optimisation: In-process, I can track the reference count of objects, and if my functions always consume their arguments I know inside the grow/append/prepend routine if this is the only holder of the object. If it is, I can potentially reuse this virtual address immediately, saving 3-4 syscalls.

When it's time to deallocate, I can put small objects on my free list, and garbage collect any big objects by calling fallocate() on the physical address to poke a hole (freeing system memory). OSX doesn't need fallocate() because mach has vm_unmap.

[1]: https://dvdhrm.wordpress.com/2014/06/10/memfd_create2/

[2]: http://web.mit.edu/darwin/src/modules/xnu/osfmk/man/vm_remap... because the osx manual page is pants

[3]: http://man7.org/linux/man-pages/man2/fallocate.2.html

[4]: http://x86.renejeschke.de/html/file_module_x86_id_20.html

[5]: http://wiki.osdev.org/Page_Tables#Long_mode_.2864-bit.29_pag...

> It also seems like you're trading cache misses for system calls if you have to re-map pages a lot

Cache misses aren't the dominant force here.

The real trade off is illustrated with benchmarking: memcpy one page, versus 8 bytes (page table entry). How many pages do you need to copy before it is faster to pay the fixed (<100ns) cost of the system call and just copy the page table entries?

Memory streams at a rate of around 10GB/sec, but the TLB flush is ~100ns and the memory latency is only around 10ns, so it's easy to see how quick the gains add up when you're using 1GB pages.

except when you want fast prepending.
There are many strategies to deal with that efficiently. Eg. negative indices (having memory before index 0), append-and-reverse.

You can also map memory before an existing mapping, but it's a bit system-and-platform dependent. (But an entirely feasible thing to do).

Naively you can also grow-and-shift or copy ... the first one is actually incredibly fast due to perfect locality, and also very simple to implement. It's still, technically, O(n^2) when building a list, so it's usually a better idea to go for append-and-reverse (which works outside the actual list implementation). Normally that's not a problem at all, and often no reversal has to be materialized.

If you know the format you want, and read infrequently, just implement it as an append and read it backwards.

If you have frequent reads, you can also consider writing your data to a file (or a memfd), and then using mmap to construct the memory layout you want. You will then see that prepend is worse-case log^3 (depth of the pagetable, assuming you have to keep moving the whole thing) time, which is much better than finger trees, and if you prepend often you can make it constant time (by choosing a high start virtual address: remember, there's 64 bits of address but only 48 bits of memory, so you can have 64k big objects without much work) which makes it faster than linked lists.

Another idea: use VMX extensions and collaborate between each layer to trade a lower constant mean and coefficient to parameterise max to log^k. Gains here might require very big data sets though.

deques support fast prepending. You can implement them on top of shared memory, although as the canonical implementation is block based and uses indirection for access, the dynamic resizing capability is less useful.