Hacker News new | ask | show | jobs
by harles 497 days ago
That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper.
1 comments

Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.
In this paper, we don't zero out the weights. We remove them.
Thanks for the correction! Can it be retrofitted into existing models through distillation or do you have to train the model from scratch?