|
|
|
|
|
by Karhan
4999 days ago
|
|
I remember reading a blog post of the peculiarities of GPU programming and the post noting that for most modern graphics cards (at the time) if you can keep your computable data in chunks no bigger than 64kb a piece you can expect to see enormous performance gains even on top of what you'll see by using openCL/CUDA because of a physical memory limit on the actual GPU itself. I also remember thinking that a 64kb row size for DynamoDB was very odd. I wonder if these things are at all related. |
|
Early generation of NVIDIA gpus did not an automatic Caching mechanism or could not for CUDA, I forget) that could help solve this issue. But they did have memory available locally on each compute unit where you could manually read / write data into. This helped reduce the overall read/write overhead.
Even when the newer generations have the caches, it is beneficial to use this shared / local memory. Even when the shared / local memory limits are hit, there are alternatives like Textures in CUDA, Images in OpenCL that are slightly slower, but significantly better than reading from DRAM.