Y
Hacker News
new
|
ask
|
show
|
jobs
by
harles
497 days ago
That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper.
1 comments
vlovich123
496 days ago
Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.
link
mayukhdeb
495 days ago
In this paper, we don't zero out the weights. We remove them.
link
vlovich123
494 days ago
Thanks for the correction! Can it be retrofitted into existing models through distillation or do you have to train the model from scratch?
link