Just from the abstract I don't see why not, it's just replacing the feed forward network that's part of all of these models with a very sparse one. The bigger problem is you seemingly have to retrain the model, so you couldn't just drop in llama2 weights from meta and have it work. Which makes it much more limiting. Something that used existing weights would be a lot more practical (like quantization for example).
For BERT, I can see this being useful if you had to make a really fast embedding model. There was a discussion about a fast embedding use case not long ago https://news.ycombinator.com/item?id=37898001
It certainly could, and I wouldn't be surprised if the authors want to try it out on those. You do have issues of past improvements often not quite enhancing more powerful models nearly as much. I'd expect this to possibly not work as well, something like the bigger models ending up with more polysemantic neurons because they're given more ''incentive'' (training time, neuron count, dataset size which they're encouraged to be able to reconstruct) to extract as much possible. This might make so the method performs worse due to this intermingling.
(See the transformer circuits website for that)
(Though I expect there's ways to recover a good chunk of extra lost throughput/accuracy, maybe by doing extra steps to directly steer the training towards breaking apart polysemantic neurons)
There are two issues here -- for one, in big transformers, more compute is in the attention layers, while this work improves only feed-forward layers, which are more important for smaller models and smaller sequence lengths. Second, in many typical scenarios LLM inference is memory bandwidth bound, I'm not sure if it's possible to utilize their approach to reduce required memory bandwidth.
Yes it might. "Reduction of number of neurons" is not static here, unlike traditional pruning approaches, here they still keep all weights, but the network dynamically selects which sub-portion of them to use. There is a related discussion of this in section 3.2 (page 4), but they don't think they mention actual memory bandwidth requirements/wins of their implementation, and probably there can be different tradeoffs for different devices.