|
|
|
|
|
by alexhutcheson
1922 days ago
|
|
What environment are you working in where this is a problem in practice? In tiny embedded devices with very limited memory, you do all your allocation at startup and avoid malloc during runtime, so you’d never run into this. On server, desktop, or even smartphone applications, I’ve never run into cases where “the allocator was unable to get a chunk of memory to complete a vector resize()” is a significant problem. If my vector is going to be a significant fraction of the memory available on the system, I generally know that up-front, and would just call reserve() with a conservative estimate of the upper-found size. That’s pretty rare though - not many problems call for vectors that are 1+ GB in size. For anything that’s not a significant fraction of the system’s available memory, the allocator can generally find you a chunk. |
|
Anyway, you run out of RAM really, really quickly if you try to give data-structures to each of those SIMD individually. 8GBs RAM / 16384-GPU-threads is 500kB per GPU-thread... 50kB at the theoretical max occupancy 10.
Yeah, you want your data-structures to be read-mostly so that your 16384-threads can all be reading the same stuff. But every now and then, you need a per-GPU-thread data-structure. And... well... there's not a lot of per-GPU-thread data available (because you have so many darn threads...)
--------
You end up using Linked lists, even though GPU latency is wtf terrible. Like really, really, really bad. If you think a CPU's 50-nanosecond DDR4 access time is slow, try 500ns or even 1000ns for a linked-list "node = node->next" operation on GPUs. And GPUs are in-order too, so no out-of-order latency hiding for you...