Hacker News new | ask | show | jobs
by 37ef_ced3 2020 days ago
Enough that the data panel of the input tensor fills the thread's share of the L2 cache, and the output tensor is of similar depth

So it depends on the cache size, but you can think of it as being about 512 channels in, 512 channels out, something like that