| HN Mirror

You are taking about shared memory in CUDA, local memory in OpenCL. When you are reading from the same location over and over again (most notable cases are linear algebra functions, filtering in signal processing), reading from DRAM is going to be costly. This is solved on the CPUs by having multiple layers of caches.

Early generation of NVIDIA gpus did not an automatic Caching mechanism or could not for CUDA, I forget) that could help solve this issue. But they did have memory available locally on each compute unit where you could manually read / write data into. This helped reduce the overall read/write overhead.

Even when the newer generations have the caches, it is beneficial to use this shared / local memory. Even when the shared / local memory limits are hit, there are alternatives like Textures in CUDA, Images in OpenCL that are slightly slower, but significantly better than reading from DRAM.