|
|
|
|
|
by dragontamer
1922 days ago
|
|
I've been experimenting with data-structures on GPUs. 8GBs of RAM to share across 4096 SIMD-cores (well... 64 "compute units" with 64-way SIMD... you know...). But Vega64 runs 4-threads per SIMD-core, so you actually need 16384-SIMD threads before you utilize the processor. (And at occupancy 10, you have 10-threads per hardware thread, or 163840 SIMD-threads total) Anyway, you run out of RAM really, really quickly if you try to give data-structures to each of those SIMD individually. 8GBs RAM / 16384-GPU-threads is 500kB per GPU-thread... 50kB at the theoretical max occupancy 10. Yeah, you want your data-structures to be read-mostly so that your 16384-threads can all be reading the same stuff. But every now and then, you need a per-GPU-thread data-structure. And... well... there's not a lot of per-GPU-thread data available (because you have so many darn threads...) -------- You end up using Linked lists, even though GPU latency is wtf terrible. Like really, really, really bad. If you think a CPU's 50-nanosecond DDR4 access time is slow, try 500ns or even 1000ns for a linked-list "node = node->next" operation on GPUs. And GPUs are in-order too, so no out-of-order latency hiding for you... |
|