| HN Mirror

> I'd hope that those conflicts only occur on writes and not reads?

They happen on reads and writes.

The best way to describe it is... GPUs internally not only have tons of cores and threads (albeit "SIMD" threads, not true threads...)... they also have multiple memory banks.

I know AMD's architecture better. So let me talk about GCN. The smallest "cohesive" structure of an AMD GCN GPU is the execution unit. An execution unit has its own L1 data-cache, a 64kB "shared-memory" region, and finally the 256 vALUs (vector ALUs). There are 4-instruction pointers (64-simd threads per instruction pointer).

I'll focus on the high performance "64kB Shared Memory" region.

The 64kB shared memory is organized into 32-channels. In effect, channel 0 handles all addresses ending in XXXX00. Channel 1 handles all addresses ending in XXXX04. Channel 2 handles all addresses ending in XXXX08. Etc. etc. Channel 31 handles all addresses ending in XXXX7C.

Each channel can serve ONE thread per clock cycle. So if all 256-threads of the execution unit try to grab address 400000 (ending in 00, so channel 0), they all hammer channel 0. Channels 1 through 31 don't do anything, because no one asked for data from them. In this case, ONLY one channel is working, while 31-channels are sitting around doing nothing.

In this case, the last thread will be waiting for ~256 clock cycles, because its got 255 threads in front of it trying to grab data.

If you instead wrote your program so that all the threads asked for data "equally" between all 32-channels, then your code would be 32x faster (because all 32-channels would be working). Each channel has 8 requests (32x8 == 256 reads), and the 256 threads all get their data in just 8 clock cycles.

--------

This isn't a problem in CPU world, because your L1 cache on Intel / AMD systems only has one channel per core. Actually, AMD offers TWO L1 channels per core (!!), and Intel offers 3x channels per core (2x reads + 1x write). So your one thread can do 2-reads + 1x write per clock tick.

If you have a massive 32-core CPU, you still have 2x reads + 1x write per core. So total of 64-reads + 32 writes across the whole system. This makes CPUs simple.

GPUs however, have a compromise. The 256-simd threads (or really, up to 2560-threads on an AMD system per EU) share the same 32 read/write channels.

--------

Technically speaking, there are "channels" and "banks", two different memory organization schemes in a GPU. In practice, you can just pretend that only channels exist, because a bank conflict more or less acts the same as a channel conflict.