| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alexhutcheson 1968 days ago

What environment are you working in where this is a problem in practice?

In tiny embedded devices with very limited memory, you do all your allocation at startup and avoid malloc during runtime, so you’d never run into this.

On server, desktop, or even smartphone applications, I’ve never run into cases where “the allocator was unable to get a chunk of memory to complete a vector resize()” is a significant problem. If my vector is going to be a significant fraction of the memory available on the system, I generally know that up-front, and would just call reserve() with a conservative estimate of the upper-found size. That’s pretty rare though - not many problems call for vectors that are 1+ GB in size. For anything that’s not a significant fraction of the system’s available memory, the allocator can generally find you a chunk.

1 comments

dragontamer 1968 days ago

I've been experimenting with data-structures on GPUs. 8GBs of RAM to share across 4096 SIMD-cores (well... 64 "compute units" with 64-way SIMD... you know...). But Vega64 runs 4-threads per SIMD-core, so you actually need 16384-SIMD threads before you utilize the processor. (And at occupancy 10, you have 10-threads per hardware thread, or 163840 SIMD-threads total)

Anyway, you run out of RAM really, really quickly if you try to give data-structures to each of those SIMD individually. 8GBs RAM / 16384-GPU-threads is 500kB per GPU-thread... 50kB at the theoretical max occupancy 10.

Yeah, you want your data-structures to be read-mostly so that your 16384-threads can all be reading the same stuff. But every now and then, you need a per-GPU-thread data-structure. And... well... there's not a lot of per-GPU-thread data available (because you have so many darn threads...)

--------

You end up using Linked lists, even though GPU latency is wtf terrible. Like really, really, really bad. If you think a CPU's 50-nanosecond DDR4 access time is slow, try 500ns or even 1000ns for a linked-list "node = node->next" operation on GPUs. And GPUs are in-order too, so no out-of-order latency hiding for you...

link

alexhutcheson 1968 days ago

Not sure what GPU algorithms you’re trying to implement, but linked lists (and generally anything with pointer-chasing) are almost maximally terrible on GPUs - they are really not designed for that.

link

dragontamer 1968 days ago

> (and generally anything with pointer-chasing)

You mean like... going through a BVH tree to find what AABB bounding box collides with a ray? :-) I'm pretty sure its been demonstrated that GPUs are fastest at that.

Yeah, I know that linked lists take a latency hit. But even with that big hit, O(1) operations vs O(n) adds up. Don't avoid linked-lists, trees, or graphs just because you're trapped thinking about cache-locality or whatever.

A win in asymptotic complexity (especially O(1) vs O(n)) is utterly huge. On the one hand, its common for beginners to overestimate how much this matters. But on the other hand... its an asymptotic win. You gotta give it a shot.

Arrays win in many cases (and more cases in GPUs, because GPUs are worse at pointer chasing than arrays). Still, there are plenty of situations where the linked-list / tree / graph is simply unavoidable. Be it an oct-tree, linked list, or... BVH-tree traversals in Raytracing.

link

alexhutcheson 1968 days ago

Naive tree traversal on a GPU actually has pretty bad performance, due to execution divergence. It takes a lot of application-specific reframing of the problem to making working with BVH trees efficient: https://developer.nvidia.com/blog/thinking-parallel-part-ii-...

link

dragontamer 1968 days ago

Its not as complicated as it sounds. Stream compaction solves execution divergence. The end. Instead of recursively searching the tree, you select the members of the tree with a child.

No, you can't do naïve recursion for this. GPUs just don't do that very well. But break it up with stream compaction, and everything is cake.

http://www.cse.chalmers.se/~uffe/streamcompaction.pdf

----------

Its not the memory-link latency that gets you here. Its branch divergence. Solve branch divergence, and then you're far faster than a CPU at traversing that BVH tree. Even without Raytracing Hardware. Even with lol 1000ns latency per node = node->next (GPUs turn out to be decent at latency hiding if you up that occupancy a bit... and just double-check on the compiler / assembly language stuff to ensure that the access was rearranged to a sane location).

link