| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by harles 544 days ago
	That could explain compute efficiency, but has nothing to do with the parameter efficiency pointed at in the paper.

1 comments

vlovich123 543 days ago

Haven’t read the paper but my guess around that is that the same reason sparse attention networks (where they 0 out many weights) just have the sparse tensors be larger.

link

mayukhdeb 542 days ago

In this paper, we don't zero out the weights. We remove them.

link

vlovich123 541 days ago

Thanks for the correction! Can it be retrofitted into existing models through distillation or do you have to train the model from scratch?

link