Hacker News new | ask | show | jobs
by eniotna 1922 days ago
In most real world problems, you don't really care about keeping a data structure sorted at all time. All you want, is to have it sorted when you're about to do X. So it's generally faster to just have a contiguous chunk, insert at the end, then sort when you need.
2 comments

Sorted linked lists are also especially-not-useful, because you can’t binary search them, which is often the main benefit to having something sorted.

The only thing a sorted linked list is really good for is being able to cheaply peek/pop the lowest- or highest-valued item, and a heap is almost always better for that use-case in practice.

> The only thing a sorted linked list is really good for is being able to cheaply peek/pop the lowest- or highest-valued item, and a heap is almost always better for that use-case in practice.

Or cheaply peek/pop the middle items, and reinsert a middle item. As is the most common operation in Knuth's solution to the exact cover problem (aka: Dancing Links / Algorithm X)

There's also benefits to middle-operations, such as text editing (although text files are so small that inefficient operations aren't a big deal anymore).

----------

Linked Lists also can be "merged" together, in a sort of "inverse tree" sort of way.

Consider the following data: "ABCDEFG", "123ABCDEFG", and "111ABCDEFG".

The two array representations are obvious. But Linked-Lists can optimize that into:

* A -> B -> C -> D -> E -> F -> G

* 1 -> 1 -> 1 -> A ...

* 1 -> 2 -> 3 -> A ...

This has come up a lot for me in a recent toy problem I've been working on. A lot of sub-lists happen to be have the same "ending" as other lists, so I'm merging the linked lists and saving precious RAM (I'm building up GBs of data: so saving redundant chunks like this really wins)

> There's also benefits to middle-operations, such as text editing (although text files are so small that inefficient operations aren't a big deal anymore).

Applications operating on text would normally use a rope (aka cord) or gap buffer. A linked list would have horrible performance because random access within the text is important for most applications.

> (I'm building up GBs of data: so saving redundant chunks like this really wins)

You’re likely spending a huge amount of memory to store pointers between elements within the unshared chunks (8 or 16 extra bytes per element adds up in a hurry) so it may be worth investigating some “chunking” so that you only have to use pointers to link between chunks, which can be stored continuously.

Alternatively, consider whether the index representation used by suffix arrays will work for your use-case: https://en.wikipedia.org/wiki/Suffix_array

Gap buffer is very good.

> You’re likely spending a huge amount of memory to store pointers between elements within the unshared chunks (8 or 16 extra bytes per element adds up in a hurry) so it may be worth investigating some “chunking” so that you only have to use pointers to link between chunks, which can be stored continuously.

Yeah, they're chunked or unrolled in practice. (Unrolled Linked List was one term I've seen. Chunking is also another term).

> Alternatively, consider whether the index representation used by suffix arrays will work for your use-case: https://en.wikipedia.org/wiki/Suffix_array

Haven't heard of these before. But I'll look into it.

EDIT: Its... kind of a complicated situation I'm in. I'm basically searching a game tree, and building a "linked list" for how the children relate to their parent. I need "breadcrumbs" to relate any position back to the root.

Not necessarily because I need the root, but because the root contains information that's relevant to all of its children. So a linked list of (Grandchild -> child -> root) is very natural for this application. I originally had an array and just copied the data to an array (for "more locality") to all the children. But this uses way more space.

And the depth of this tree is ~12+, exponentially increasing width as usual. You really don't want to copy the root's data to all of the children unnecessarily: even on a GPU, its far better to just pass a pointer and connect the children-to-parent relationships, and traverse the linked list.

Hooking "child.parent = parent" is very easy. And traversing the while(node != root) node = node.parent; loop is also easy as cake. There's only ~12 of these linked-list operations in practice (because the depth of the search tree only goes to ~12 deep or so), across literally billions of possibilities. The billions of possibilities lead to the ability to process the tree in parallel (billions of possibilities to search with "only" 16384 hardware threads suddenly makes the GPU seem small!)

Despite the linked-list in the algorithm, its clear to me that the GPU is the ideal architecture for processing all of these nodes in parallel.

And ironically, that's when we get into fragmentation issues for arrays!

Growing the array from size 8 -> 16 -> 32 -> 64 ... 1024->2048 makes it harder, and harder to find contiguous, exponentially growing chunks.

If you have a fragmented memory allocator, you may run out of memory due to fragmentation. In contrast, if each of those elements were a small and constant-sized 8-byte chunk (or 16-byte chunk), then you'd be able to fit that small chunk anywhere.

-----------------

Anyway, I agree with you that vector.push_back() is an outstanding methodology on modern systems. But Linked-Lists aren't as bad as people make them out to be.

What environment are you working in where this is a problem in practice?

In tiny embedded devices with very limited memory, you do all your allocation at startup and avoid malloc during runtime, so you’d never run into this.

On server, desktop, or even smartphone applications, I’ve never run into cases where “the allocator was unable to get a chunk of memory to complete a vector resize()” is a significant problem. If my vector is going to be a significant fraction of the memory available on the system, I generally know that up-front, and would just call reserve() with a conservative estimate of the upper-found size. That’s pretty rare though - not many problems call for vectors that are 1+ GB in size. For anything that’s not a significant fraction of the system’s available memory, the allocator can generally find you a chunk.

I've been experimenting with data-structures on GPUs. 8GBs of RAM to share across 4096 SIMD-cores (well... 64 "compute units" with 64-way SIMD... you know...). But Vega64 runs 4-threads per SIMD-core, so you actually need 16384-SIMD threads before you utilize the processor. (And at occupancy 10, you have 10-threads per hardware thread, or 163840 SIMD-threads total)

Anyway, you run out of RAM really, really quickly if you try to give data-structures to each of those SIMD individually. 8GBs RAM / 16384-GPU-threads is 500kB per GPU-thread... 50kB at the theoretical max occupancy 10.

Yeah, you want your data-structures to be read-mostly so that your 16384-threads can all be reading the same stuff. But every now and then, you need a per-GPU-thread data-structure. And... well... there's not a lot of per-GPU-thread data available (because you have so many darn threads...)

--------

You end up using Linked lists, even though GPU latency is wtf terrible. Like really, really, really bad. If you think a CPU's 50-nanosecond DDR4 access time is slow, try 500ns or even 1000ns for a linked-list "node = node->next" operation on GPUs. And GPUs are in-order too, so no out-of-order latency hiding for you...

Not sure what GPU algorithms you’re trying to implement, but linked lists (and generally anything with pointer-chasing) are almost maximally terrible on GPUs - they are really not designed for that.
> (and generally anything with pointer-chasing)

You mean like... going through a BVH tree to find what AABB bounding box collides with a ray? :-) I'm pretty sure its been demonstrated that GPUs are fastest at that.

Yeah, I know that linked lists take a latency hit. But even with that big hit, O(1) operations vs O(n) adds up. Don't avoid linked-lists, trees, or graphs just because you're trapped thinking about cache-locality or whatever.

A win in asymptotic complexity (especially O(1) vs O(n)) is utterly huge. On the one hand, its common for beginners to overestimate how much this matters. But on the other hand... its an asymptotic win. You gotta give it a shot.

Arrays win in many cases (and more cases in GPUs, because GPUs are worse at pointer chasing than arrays). Still, there are plenty of situations where the linked-list / tree / graph is simply unavoidable. Be it an oct-tree, linked list, or... BVH-tree traversals in Raytracing.

Naive tree traversal on a GPU actually has pretty bad performance, due to execution divergence. It takes a lot of application-specific reframing of the problem to making working with BVH trees efficient: https://developer.nvidia.com/blog/thinking-parallel-part-ii-...
MSVC STL uses the golden ratio instead of doubling for std::vector allocation which means that you can reuse contiguous runs of previously deallocated chunks to fulfill future allocations, while still meeting the standard asymptotic O(1) bound on push_back.

I believe that GCC's libstdc++ tested the strategy on a set of real programs and didn't measure any actual difference so they still use doubling.

This is extremely poor advice in practice.

Memory mapping, pages and organization of the heap means that allocating as much memory in as few chunks as possible and reusing it if possible stands out when you profile the two different approaches.

Even if tiny chunks are allocated in their own arenas, the overhead of the allocation and dealing with the pointer it returns is still unnecessary compared to just dealing with the numbers on a loop through an array and moving on.