| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lopuhin 945 days ago
	There are two issues here -- for one, in big transformers, more compute is in the attention layers, while this work improves only feed-forward layers, which are more important for smaller models and smaller sequence lengths. Second, in many typical scenarios LLM inference is memory bandwidth bound, I'm not sure if it's possible to utilize their approach to reduce required memory bandwidth.

1 comments

joelthelion 945 days ago

Doesn't reducing the number of neurons drastically reduce memory requirements?

link

lopuhin 945 days ago

Yes it might. "Reduction of number of neurons" is not static here, unlike traditional pruning approaches, here they still keep all weights, but the network dynamically selects which sub-portion of them to use. There is a related discussion of this in section 3.2 (page 4), but they don't think they mention actual memory bandwidth requirements/wins of their implementation, and probably there can be different tradeoffs for different devices.

link