Y
Hacker News
new
|
ask
|
show
|
jobs
by
vlovich123
503 days ago
Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.
1 comments
mayukhdeb
502 days ago
In this paper, we don't zero out the weights. We remove them.
link
vlovich123
501 days ago
Thanks for the correction! Can it be retrofitted into existing models through distillation or do you have to train the model from scratch?
link