| HN Mirror

Broadcasts can be done efficiently on AMD systems. I dunno about NVidia, but I would assume NVidia PTX has some kind of low-level broadcast mechanism too.

A lot of optimization is just knowing all of the special ways you can move memory around. Broadcast was common enough that they've given AMD GPUs a special instruction just for it.

So in the case of neural networks all reading from the same input, you'd want to do it through the broadcast instructions, instead of through shared memory. Shared memory would create bank conflicts.