Hacker News new | ask | show | jobs
by jklontz 2020 days ago
> For example, NN-512 can exceed 48 effective FMADDs per cycle (on the 27 peak FMADD machine) with Winograd-Cook-Toom-Lavin, if the tensor is deep enough (enough channels)

Roughly how many channels do you need for this approach to be worthwhile?

1 comments

Enough that the data panel of the input tensor fills the thread's share of the L2 cache, and the output tensor is of similar depth

So it depends on the cache size, but you can think of it as being about 512 channels in, 512 channels out, something like that