Hacker News new | ask | show | jobs
by kolbe 760 days ago
Linked list benchmarks are amazing.... if you don't thrash your cache on inserts so all its elements are contiguous. You get all the benefits of a vector and a linked list, without the reality that linked lists mostly don't get populated 100% consecutively, and thus can be anywhere in memory.
4 comments

Most languages with linked lists as an important part (lisps mostly) all have well optimized linked lists that end up with a lot better memory locality.
How does that work? I don't follow how any runtime or compile-time optimization can solve the problem of locality for a linked list. If the data wasn't allocated sequentially, then it's simply not going to be sequential (unless you move it).
Instead of thinking of a linked list structurally, think of it functionally. You can have a token that represents a location in the linked list. From that token you have a Fetch() operation that will fetch what is at that location, a Next() operation that will fetch the next token or some indication that you are done, and depending on your language, some sort of insert or append operation (mutability factors in here).

While there is a natural encoding of this process into a pointer to the target value, and a pointer to the next node, it is not the only encoding possible by any means.

A good exercise if you are having trouble with this is to implement a little linked list that performs those operations on something that simply backs to an array, including the blind O(n) copy when appending another list. It should be quite short. But that is not the only alternate implementation, either, and you can easily build others where the interface is maintained but you pay only log n additional factors maximum for operations, easily recovered from the constant factors which on modern hardware are often staggering.

Once you break the entanglement between memory representation and the API a data structure offers in your mind, many ways become obvious as to how to possibly improve this structure while maintaining the same operations, many of which of course already exist and have names.

Linked lists are, by definition, a value with a pointer to the next. That's what makes it "linked". The API you're describing is the iterator pattern, which indeed can be backed by almost anything (be it a linked list, array, tree, etc.).
Both replies as I write this are ignoring the question I was answering. The question was, how can a compiler implement a linked list as anything other than a linked list? The answer is what I gave. Being a compiler, it may also do things like an analysis of the use cases to prove that there are no relevant changes to performance (or that performance only improves) and fall back to a "true" linked list if it can't prove how it is being used, or, being a compiler, it may just tell you to get stuffed and deal with the performance differences. Depends on the choices the compiler author(s) made.

But just because Lisp has cons and cdr does not mean the compiler is obligated to only and exactly use things that structurally look like linked lists in the actual implementation. You need to break the relationship between memory layout and API to understand what is going on here. You may choose later to put them back together; the naive memory layout is often a good one. (Linked lists nowadays are arguably one of the larger exceptions to that, but there are still circumstances even so where that may not be an issue; see Erlang's memory management for instance and ponder how that impacts linked list locality.) But you can't understand what compilers may be doing if you do not realize that there is a distinction.

> Instead of thinking of a linked list structurally, think of it functionally

If your argument is that a linked-list-like data structure can be implemented using something other than a linked list, then I agree with you. Vector tries (for example) are great for that use case.

But a vector trie isn't a linked list, it's a vector trie. As such, it will be faster for some usage patterns, equal for some, and completely degenerate for some others. Just like any other data structure that isn't a linked list.

It wouldn't really be an "optimization" to implement my linked list code with something other than a linked list - it would be rewriting my code.

Every linked list discussion I've ever gotten into ends with someone finding a way to modify the std::list data structure in such a way that totally breaks the definition of the LL data structure. We may as well call it "the law of linked list arguments"
SBCL tries its hardest to allocate sequentially, then moves lists to be sequential in GC.
The entire theoretical advantage of linked lists over vectors (constant time insertion and deletion) comes from not sequentially allocating them.
I don't think this is what hayley-patton meant (although it's not crystal clear). I think SBCL does memory management with a bump allocator and a moving collector. So if you build a linked list sequentially, the cons cells will be allocated sequentially, and will be sequential in memory. If you build it in random order, they won't be. But when the collector runs, it will move the whole list, and it will move the cells in sequential order, and then it will be.
If you have a compacting GC you are going to move the nodes anyway, so why not reallocate them sequentially?
That implies you took the time to sort them, so your O(1) insertion time just became O(N log N). Sure, that cost is spread over how ever many inserts you did between GCs but it isn't O(1) anymore.
There are other advantages of linked lists over vectors besides those that you've listed.
Well, I once wrote a small scheme that did loop unrolling and allocated lists in as many steps as the unroll went, which was capped at a maximum of 5.

That worked surprisingly well, but there were lots of edge cases since I am a professional bassoonist and not a compiler guy.

I generalized it and made all lists allocated in chunks of 16 and then let the GC truncate it. I spent a stupid amount of work doing the first thing and the second thing was done in an afternoon.

Then there are all kinds of

A sufficiently smart compiler can detect that a new node gets allocated and then appended to a list, and allocate the node near the other nodes of that list.

To have a good chance that there is space near the other nodes, you’d have to reserve memory up front for each list you’d want to do that with, though, and that will kill cache locality before that optimization sets in. Corrections welcome, but I don’t see that (easily) being a net win. But hey, with a sufficiently smart compiler at hand, you can do wonders. (Edit: when looking at pure functions, it many times may not be that hard to discriminate between “this allocation will be returned” and “this allocation is temporary”. For example, if you use map to apply a function to all elements in a list, and that function is pure, detecting that only the return values of that function will be visible after the function returns isn’t that hard)

An easier win is that lisp garbage collectors tend to move memory, and thus can try to move nodes that reference each other close together. You get that for free with a semispace collector, but that has quite a few disadvantages, so you don’t want to use that. A generational collector also will do it a bit, though (say you create a few million nodes of garbage to build a list of a few hundred results. Then, after generational collection, those few hundred cells will be closer together than before)

If needed, linked list can be backed by a vector, that is resized, when needed.
If anyone knows of a detailed writeup of a language implementation like this, please link to it. The thread under this comment is disappointingly full of "a sufficiently smart compiler could ..." type stuff.
Why cant we 'simply' get processor extension to mark data as pointer so that the prefetcher would actually know what to fetch.

From my understanding this is what led to the recent 'unpatchable' exploit in Apple's M1, but rather then trying to guess it by some heuristic, why not just give compilers option to make that optimization?

Apple Silicon does this. If it prefetches something that looks like a pointer, it will also fetch the pointed-to memory. It's a cool feature, and is especially useful for Apple, since their Objective-C collections only store pointers – but it also can re-open the door for certain timing attacks by violating preconditions of constant-time cryptography algorithms.

https://en.wikipedia.org/wiki/GoFetch

I don't think making CPU issue (likely bogus) pre-fetches for every field in the cache line that's marked as a pointer is really that good idea. At best, you save couple of cycles because the fetches are started a one or two instructions earlier before the actual load instruction for "loading the linked pointer" is issued. At worst, you keep thrashing your cache loading data you're not going to read, delaying fetching the data you will read.
For everything? Obviously no.

But for all the crazy optimizations modern compiler do, I don't see how marking pointers for more then couple of them in a raw is that crazy

Because if you're issuing a bogus pre-fetch, you can't cancel it, can you? So that's 90 or something cycles that's the fetch for your actual data is being delayed. Pointer chasing already strains the memory bandwidth, trying to request even more data from memory will only worsen things.

And unrolling loop for traversing linked lists can be done, if you use a sentinel node instead of nullptr to signal then end:

        beqz    a0, .end
    .loop:
        ld      a1, 0(a0)   ; a1 = curr->data
        ld      a0, 8(a0)   ; curr = curr->next
        ; do something with payload in a1 here
        bnez    a1, .loop
    .end:
becomes

        la      s1, sentinel
        beq     a0, s1, .end
        ld      a1, 0(a0)
        ld      a2, 8(a0)
        ld      a3, 0(a2)
        ld      a4, 8(a2)
        ld      a5, 0(a4)
        ld      a6, 8(a4)
        beq     a6, s1, .trail
    .loop:
        ld      t0, 0(a6)
        ld      t1, 8(a6)
        ld      t2, 0(t1)
        ld      t3, 8(t1)
        ld      t4, 0(t3)
        ld      t5, 8(t3)
        ; do something with three payloads in a1, a3, a5 here
        mv      a1, t0
        mv      a2, t1
        mv      a3, t2
        mv      a4, t3
        mv      a5, t4
        mv      a6, t5
        bne     t5, s1, .loop
        mv      a0, a2
        beq     a2, s1, .end
     .trail:
        ld      a1, 0(a0)
        ld      a0, 8(a0)
        ; do something with payload in a1 here
        bne     a0, s1, .trail
     .end:
As you can see, "ld t3, 8(a2)" is almost right after to "ld t1, 8(a6)", with intervening load from 0(a2), so prefetch won't noticeably help here, and if the address that ends up in t3 is not in the cache, then "ld t5, 8(t3)" will stall no matter what. And moving the speculative loads up in the loop body before processing the payloads (using even more registers, as you can see) somewhat hurts the latency of processing the first three payloads.

Oh, and if you want to see something really crazy, look at e.g. splitting the branch instruction into prediction and resolution instructions [0].

[0] https://zilles.cs.illinois.edu/papers/branch_vanguard_isca_2...

This was the idea of Itanium. It failed mostly because of economics.

It turns out programmers, or rather their employers, don't really care about using hardware efficiency. They care about shipping things yesterday, because that's how business deals get closed, and making software efficient is secondary to money changing hands. Performance never really matters to business people after the checks are cashed.

Multicore computers have been ubiquitous for more than a decade, yet the overwhelming majority of software built today is single-threaded microservices, where in they spend most of their time serializing and deserializing message payloads.

This is all really to say that most performance is already being left on the table for the majority of what computers are used for today.

I do want to say that I think the Itanic would have fared way, way better in a post-LLVM world where the importance of smart, optimizing compilers is much more valued and understood and language designers actively work hand-in-hand with compiler devs far more often (with much more significant back-and-forth from hardware manufacturers).
I don't think LLVM is particularly good at optimizing VLIW code.

Very good optimizing compilers existed before LLVM. Intel had one specifically for Itanium. It wasn't enough.

Why would llvm be particularly good at optimizing vliw code when there’s no demand for it to be? You can’t believe everything else would remain the same in the hypothetical I posed.
A) optimizing for VLIW is hard. B) the null hypothesis would be no change.
Given how much of today's computer needs are dependent on a database query, this is no surprise. Who cares about the micros you gain with added efficiency while there's a 100ms db query return in the path?
Apparently Apple and Intel do, since they introduced those changes into their silicon
Not every pipeline involves a db query.
Where do you think DBs run?
I mean sure, I don't doubt 99% of end-user programmers wouldn't look twice at something like this, but compilers designers probably would care.

And it's not like the companies arent trying this idea (again, M1 exploit). But for whatever reason they want to keep cpus as black box, perfect abstract machines, even though we know they aren't

An exception to this is precisely parsing and validation in which linked lists can and ideally should be populated consecutively (using an arena allocator). I agree that general purpose linked lists are rarely optimal but they are hard to beat for parsers where they hit a pareto sweet spot of implementation simplicity and good performance.
This is such an important point.

Benchmarking is hard, and very often is done in ideal conditions, which never materialize in real use cases.