| HN Mirror

It depends on your goal. If you want to measure the number of "artificial synapses" or connections, total parameters is the right figure to use, because each weight is one such connection. If you want to measure the computational cost of training or inference, then wallclock-hours and FLOPs would be better figures.

The 100's of trillions of connections (synapses) in the human brain are sparsely used -- i.e., your entire brain doesn't light up in response to every single stimulus. But we still talk about 100's of trillions of synapses when we refer to the size of the human brain's connectome. It's a perfectly valid way of measuring model size.

More to your point, the authors measure the computational cost of training in Table 3 of the paper in TPU-core-years for the various mixture-of-expert models, and compare them to an always-densely-used variant.