Hacker News new | ask | show | jobs
by yarg 2598 days ago
Thanks for the reply. It makes me wonder how much of a slowdown GPU accelerated neural nets will get due to the mass reading of shared input values.
1 comments

Broadcasts can be done efficiently on AMD systems. I dunno about NVidia, but I would assume NVidia PTX has some kind of low-level broadcast mechanism too.

A lot of optimization is just knowing all of the special ways you can move memory around. Broadcast was common enough that they've given AMD GPUs a special instruction just for it.

So in the case of neural networks all reading from the same input, you'd want to do it through the broadcast instructions, instead of through shared memory. Shared memory would create bank conflicts.