Hacker News new | ask | show | jobs
by lopuhin 945 days ago
There are two issues here -- for one, in big transformers, more compute is in the attention layers, while this work improves only feed-forward layers, which are more important for smaller models and smaller sequence lengths. Second, in many typical scenarios LLM inference is memory bandwidth bound, I'm not sure if it's possible to utilize their approach to reduce required memory bandwidth.
1 comments

Doesn't reducing the number of neurons drastically reduce memory requirements?
Yes it might. "Reduction of number of neurons" is not static here, unlike traditional pruning approaches, here they still keep all weights, but the network dynamically selects which sub-portion of them to use. There is a related discussion of this in section 3.2 (page 4), but they don't think they mention actual memory bandwidth requirements/wins of their implementation, and probably there can be different tradeoffs for different devices.