|
|
|
|
|
by lopuhin
945 days ago
|
|
There are two issues here -- for one, in big transformers, more compute is in the attention layers, while this work improves only feed-forward layers, which are more important for smaller models and smaller sequence lengths. Second, in many typical scenarios LLM inference is memory bandwidth bound, I'm not sure if it's possible to utilize their approach to reduce required memory bandwidth. |
|