|
|
|
|
|
by andy99
945 days ago
|
|
Just from the abstract I don't see why not, it's just replacing the feed forward network that's part of all of these models with a very sparse one. The bigger problem is you seemingly have to retrain the model, so you couldn't just drop in llama2 weights from meta and have it work. Which makes it much more limiting. Something that used existing weights would be a lot more practical (like quantization for example).
For BERT, I can see this being useful if you had to make a really fast embedding model. There was a discussion about a fast embedding use case not long ago https://news.ycombinator.com/item?id=37898001 |
|