| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by leogao 250 days ago
	Mixture of experts sparsity is very different from weight sparsity. In a mixture of experts, all weights are nonzero, but only a small fraction get used on each input. On the other hand, weight sparsity means only very few weights are nonzero, but every weight is used on every input. Of course, the two techniques can also be combined.

2 comments

tripplyons 250 days ago

Correct. I was more focused on giving an example of sparsity being useful in general, because the comment I was replying didn't specifically mention which kind of sparsity.

For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.

Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.

link

p1esk 250 days ago

“Useful” does not mean “better”. It just means “we could not do dense”. All modern state of the art models use dense layers (both weight and inputs). Quantization is also used to make models smaller and faster, but never better in terms of quality.

Based on all examples I’ve seen so far in this thread it’s clear there’s no evidence that sparse models actually work better than dense models.

link

yorwba 250 days ago

Yes, mixture of experts is basically structured activation sparsity. You could imagine concatenating the expert matrices into a huge block matrix and multiplying by an input vector where only the coefficients corresponding to activated experts are nonzero.

From that perspective, it's disappointing that the paper only enforces modest amounts of activation sparsity, since holding the maximum number of nonzero coefficients constant while growing the number of dimensions seems like a plausible avenue to increase representational capacity without correspondingly higher computation cost.

link