Hacker News new | ask | show | jobs
by dragontamer 1923 days ago
And ironically, that's when we get into fragmentation issues for arrays!

Growing the array from size 8 -> 16 -> 32 -> 64 ... 1024->2048 makes it harder, and harder to find contiguous, exponentially growing chunks.

If you have a fragmented memory allocator, you may run out of memory due to fragmentation. In contrast, if each of those elements were a small and constant-sized 8-byte chunk (or 16-byte chunk), then you'd be able to fit that small chunk anywhere.

-----------------

Anyway, I agree with you that vector.push_back() is an outstanding methodology on modern systems. But Linked-Lists aren't as bad as people make them out to be.

3 comments

What environment are you working in where this is a problem in practice?

In tiny embedded devices with very limited memory, you do all your allocation at startup and avoid malloc during runtime, so you’d never run into this.

On server, desktop, or even smartphone applications, I’ve never run into cases where “the allocator was unable to get a chunk of memory to complete a vector resize()” is a significant problem. If my vector is going to be a significant fraction of the memory available on the system, I generally know that up-front, and would just call reserve() with a conservative estimate of the upper-found size. That’s pretty rare though - not many problems call for vectors that are 1+ GB in size. For anything that’s not a significant fraction of the system’s available memory, the allocator can generally find you a chunk.

I've been experimenting with data-structures on GPUs. 8GBs of RAM to share across 4096 SIMD-cores (well... 64 "compute units" with 64-way SIMD... you know...). But Vega64 runs 4-threads per SIMD-core, so you actually need 16384-SIMD threads before you utilize the processor. (And at occupancy 10, you have 10-threads per hardware thread, or 163840 SIMD-threads total)

Anyway, you run out of RAM really, really quickly if you try to give data-structures to each of those SIMD individually. 8GBs RAM / 16384-GPU-threads is 500kB per GPU-thread... 50kB at the theoretical max occupancy 10.

Yeah, you want your data-structures to be read-mostly so that your 16384-threads can all be reading the same stuff. But every now and then, you need a per-GPU-thread data-structure. And... well... there's not a lot of per-GPU-thread data available (because you have so many darn threads...)

--------

You end up using Linked lists, even though GPU latency is wtf terrible. Like really, really, really bad. If you think a CPU's 50-nanosecond DDR4 access time is slow, try 500ns or even 1000ns for a linked-list "node = node->next" operation on GPUs. And GPUs are in-order too, so no out-of-order latency hiding for you...

Not sure what GPU algorithms you’re trying to implement, but linked lists (and generally anything with pointer-chasing) are almost maximally terrible on GPUs - they are really not designed for that.
> (and generally anything with pointer-chasing)

You mean like... going through a BVH tree to find what AABB bounding box collides with a ray? :-) I'm pretty sure its been demonstrated that GPUs are fastest at that.

Yeah, I know that linked lists take a latency hit. But even with that big hit, O(1) operations vs O(n) adds up. Don't avoid linked-lists, trees, or graphs just because you're trapped thinking about cache-locality or whatever.

A win in asymptotic complexity (especially O(1) vs O(n)) is utterly huge. On the one hand, its common for beginners to overestimate how much this matters. But on the other hand... its an asymptotic win. You gotta give it a shot.

Arrays win in many cases (and more cases in GPUs, because GPUs are worse at pointer chasing than arrays). Still, there are plenty of situations where the linked-list / tree / graph is simply unavoidable. Be it an oct-tree, linked list, or... BVH-tree traversals in Raytracing.

Naive tree traversal on a GPU actually has pretty bad performance, due to execution divergence. It takes a lot of application-specific reframing of the problem to making working with BVH trees efficient: https://developer.nvidia.com/blog/thinking-parallel-part-ii-...
Its not as complicated as it sounds. Stream compaction solves execution divergence. The end. Instead of recursively searching the tree, you select the members of the tree with a child.

No, you can't do naïve recursion for this. GPUs just don't do that very well. But break it up with stream compaction, and everything is cake.

http://www.cse.chalmers.se/~uffe/streamcompaction.pdf

----------

Its not the memory-link latency that gets you here. Its branch divergence. Solve branch divergence, and then you're far faster than a CPU at traversing that BVH tree. Even without Raytracing Hardware. Even with lol 1000ns latency per node = node->next (GPUs turn out to be decent at latency hiding if you up that occupancy a bit... and just double-check on the compiler / assembly language stuff to ensure that the access was rearranged to a sane location).

MSVC STL uses the golden ratio instead of doubling for std::vector allocation which means that you can reuse contiguous runs of previously deallocated chunks to fulfill future allocations, while still meeting the standard asymptotic O(1) bound on push_back.

I believe that GCC's libstdc++ tested the strategy on a set of real programs and didn't measure any actual difference so they still use doubling.

This is extremely poor advice in practice.

Memory mapping, pages and organization of the heap means that allocating as much memory in as few chunks as possible and reusing it if possible stands out when you profile the two different approaches.

Even if tiny chunks are allocated in their own arenas, the overhead of the allocation and dealing with the pointer it returns is still unnecessary compared to just dealing with the numbers on a loop through an array and moving on.