Hacker News new | ask | show | jobs
by ma2rten 2184 days ago
Fair enough, sparse usually means weights are sparse and not activations.

Obviously you can compare parameter count if you really want to, but from a technical point of view training a densely activated model is a much bigger feat. Also, I have personally spoken to one of the authors of this paper and they said sparsely activated models tend to well better on tasks that require knowledge but not tasks that require intelligence (e.g. GLUE).

1 comments

I agree, training a dense model with the same number of parameters would be much a bigger feat.

Otherwise, as I mentioned elsewhere on this page, we routinely describe the size of the human brain in terms of numbers of synapses (connections), even though they are sparsely activated. Only a small subset of your brain 'lights up' for a given input. Number of parameters (connections) is a perfectly sensible way to measure model size.

Anyway, I expect we will see both much larger sparsely and densely activated models going forward. We live in interesting times :-)