Hacker News new | ask | show | jobs
by vlovich123 503 days ago
Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.
1 comments

In this paper, we don't zero out the weights. We remove them.
Thanks for the correction! Can it be retrofitted into existing models through distillation or do you have to train the model from scratch?