Hacker News new | ask | show | jobs
by gwern 2184 days ago
Probably the most relevant comparison here would be a mix of wallclock-hours and FLOPS. The MoE may be inefficient on a parameter level, but it may be the most efficient way to convert FLOPS into model power (sort of like how you currently do better making models wider than deeper - experts are the ultimate 'width').
1 comments

It depends on your goal. If you want to measure the number of "artificial synapses" or connections, total parameters is the right figure to use, because each weight is one such connection. If you want to measure the computational cost of training or inference, then wallclock-hours and FLOPs would be better figures.

The 100's of trillions of connections (synapses) in the human brain are sparsely used -- i.e., your entire brain doesn't light up in response to every single stimulus. But we still talk about 100's of trillions of synapses when we refer to the size of the human brain's connectome. It's a perfectly valid way of measuring model size.

More to your point, the authors measure the computational cost of training in Table 3 of the paper in TPU-core-years for the various mixture-of-expert models, and compare them to an always-densely-used variant.